생성일: 2021년 10월 16일 오후 9:55
Control Hazards
- Improves strategy for placing instructions in delay slot
- Two strategies
- Backward branch predict taken, forward branch not taken
- Profile-based prediction: record branch behavior, predict branch based on prior run
Hazards를 제외하고 Pipelining을 어렵게 만드는 요소
- Complex Addressing Modes and Instructions
- Address modes: Autoincrement causes register change during instruction execution
- Memory-Memory Move Instructions
- Condition Codes
Handling Multi-cycle Operations
- Multi-cycle instructions also lead to pipeline complexity.
- A very lengthy instruction causes everything else in the pipeline to wait for it.
- e.g. Floating Point
- Adds WAR and WAW hazards since pipelines are no longer same length
How to Handle Control Dependences
- Critical to keep the pipeline full with correct sequence of dynamic instructions
- Potential Solutions
- Stall the pipeline until we know the next fetch address
- Backward Taken Forward Not Taken
- Guess the next fetch address (branch prediction)
- Employ delayed branching (branch delay slot)
- Do something else (fine-grained multithreading)
- Eliminate control-flow instructions (predicated execution)
- Fetch from both possible paths (if you know the addresses of both possible paths)
(multipath execution)
Branch Problem ⭐
Problem : Next fetch address after a control-flow instruction is not determined after N cycles in a pipelined processor
- If we are fetching W instructions per cycle (i.e., if the pipeline is W wide)
- A branch mis-prediction leads to N x W wasted instruction slots
- Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
- Assume: 1 out of 5 instructions is a branch
- Assume: Each 5 instruction-block ends with a branch
- Assume: Branch direction is determined at the final pipe stage
- How long does it take to fetch 500 instructions?
- 100% accuracy
- 100 cycles (all instructions fetched on the correct path)
- No wasted work
- 99% accuracy
- 100 (correct path) + 20 (wrong path) = 120 cycles
- 20% extra instructions fetched
- 98% accuracy
- 100 (correct path) + 20 * 2 (wrong path) = 140 cycles
- 40% extra instructions fetched
- 95% accuracy
- 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
- 100% extra instructions fetched
Branch Prediction
- Predict the next fetch address
- Requires three things to be predicted at fetch stage
- Whether the fetched instruction is a branch
- (Conditional) branch direction
- Branch target address (if taken)
- Observation: Target address remains the same for a conditional direct branch across dynamic instances
- Idea: Store the target address from previous instance and access it with the PC
- Called Branch Target Buffer (BTB) or Branch Target Address Cache
위의 3가지 조건중 1,3번은 BTB를 이용하여 수행 가능
그렇다면 2번인 branch direction 은?
Direction Prediction ‼️
- Compile time (static)
- Always not taken
- Always taken
- Better accuracy : ~60-70%
- BTFN (Backward taken, Forward not taken)
- Profile based
- Program analysis based
- Programmer-based
💡
1. 위의 기술들을 합칠 수 있다.
2. 공통된 단점 : Cannot adapt to dynamic changes in branch behavior
- Run time (dynamic)
- Last time prediction
- Two-bit counter based prediction
- Two-level prediction
- Hybrid
- Advanced algorithms
💡
1. 장점
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior
+ No need for static profiling: input set representativeness problem goes away
2. 단점
- More complex (requires additional hardware)
Last Time Predictor
- Single bit per branch (stored in BTB)
- Indicates which direction branch went last time it executed
- Always mispredicts the last iteration and the first iteration of a loop branch
- Accuracy for a loop with N iterations = (N-2)/N
- Loop branches for loops with large N (number of iterations) 일 때 유리
- Loop branches for loops will small N (number of iterations) 일 때 불리
- Last time predictor 는 그동안 계속 T가 나왔더라도 한번이라도 N이 나오면 다음 예측을 곧바로 N으로 바꾼다는 단점이 있다.
- Hysteresis 를 predictor에 추가하여 하나의 다른 결과만으로는 예측을 바꾸지 않게 향상 시킴
⇒ Two-bit Counter based prediction
Two-bit Counter based prediction
- A strong prediction does not change with one single different outcome
- Accuracy for a loop with N iterations = (N-1)/N
-
~85-90% accuracy for many programs with 2-bit counter based prediction (also called bimodal prediction)
-
Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
-
Assume: 1 out of 5 instructions is a branch
-
Assume: Each 5 instruction-block ends with a branch
-
Assume: Branch direction is determined at the final pipe stage.
-
How long does it take to fetch 500 instructions?
-
100% accuracy
- 100 cycles (all instructions fetched on the correct path)
- No wasted work
-
95% accuracy
- 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
- 100% extra instructions fetched
-
90% accuracy
- 100 (correct path) + 20 * 10 (wrong path) = 300 cycles
- 200% extra instructions fetched
-
85% accuracy
- 100 (correct path) + 20 * 15 (wrong path) = 400 cycles
- 300% extra instructions fetched
Two-Level Prediction
- Last-time and 2BC predictors exploit “last-time” predictability
- Realization 1: A branch’s outcome can be correlated with other branches’ outcomes
- Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch“last-time” it was executed)
위와 같은 상황에서 첫번째 branch가 not taken이라면 당연히 두번째도 not taken이다.
- Idea: Associate branch outcomes with “global T/NT history” of all branches
- Make a prediction based on the outcome of the branch the last time the same global branch history was encountered
⇒Global history/branch predictor
- First level: Global branch history register (N bits) (A set of local history registers)
- Second level: Table of saturating(포화) counters for each history entry
- An Issue: Interference in the PHTs
- Sharing the PHTs between histories/branches leads to interference
- Reducing Interference in PHTs (I)
- Increase size of PHT
- Branch filtering
- Hashing/index-reandomization
Two level local History Branch Pridictor 그림