11. Branch Prediction 1,2

이세진·2022년 4월 3일

컴퓨터 구조

Computer Science

목록 보기

55/74

생성일: 2021년 10월 16일 오후 9:55

Control Hazards

Improves strategy for placing instructions in delay slot
Two strategies
- Backward branch predict taken, forward branch not taken
- Profile-based prediction: record branch behavior, predict branch based on prior run

Hazards를 제외하고 Pipelining을 어렵게 만드는 요소

Complex Addressing Modes and Instructions
Address modes: Autoincrement causes register change during instruction execution
Memory-Memory Move Instructions
Condition Codes

Handling Multi-cycle Operations

Multi-cycle instructions also lead to pipeline complexity.
A very lengthy instruction causes everything else in the pipeline to wait for it.
e.g. Floating Point
- Adds WAR and WAW hazards since pipelines are no longer same length

How to Handle Control Dependences

Critical to keep the pipeline full with correct sequence of dynamic instructions
Potential Solutions
- Stall the pipeline until we know the next fetch address
- Backward Taken Forward Not Taken
- Guess the next fetch address (branch prediction)
- Employ delayed branching (branch delay slot)
- Do something else (fine-grained multithreading)
- Eliminate control-flow instructions (predicated execution)
- Fetch from both possible paths (if you know the addresses of both possible paths)
  (multipath execution)

Branch Problem ⭐

Problem : Next fetch address after a control-flow instruction is not determined after N cycles in a pipelined processor

If we are fetching W instructions per cycle (i.e., if the pipeline is W wide)
- A branch mis-prediction leads to N x W wasted instruction slots
Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
Assume: 1 out of 5 instructions is a branch
Assume: Each 5 instruction-block ends with a branch
Assume: Branch direction is determined at the final pipe stage
How long does it take to fetch 500 instructions?
100% accuracy
- 100 cycles (all instructions fetched on the correct path)
- No wasted work
99% accuracy
- 100 (correct path) + 20 (wrong path) = 120 cycles
- 20% extra instructions fetched
98% accuracy
- 100 (correct path) + 20 * 2 (wrong path) = 140 cycles
- 40% extra instructions fetched
95% accuracy
- 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
- 100% extra instructions fetched

Branch Prediction

Predict the next fetch address
Requires three things to be predicted at fetch stage
1. Whether the fetched instruction is a branch
2. (Conditional) branch direction
3. Branch target address (if taken)
Observation: Target address remains the same for a conditional direct branch across dynamic instances
- Idea: Store the target address from previous instance and access it with the PC
- Called Branch Target Buffer (BTB) or Branch Target Address Cache

위의 3가지 조건중 1,3번은 BTB를 이용하여 수행 가능

그렇다면 2번인 branch direction 은?

Direction Prediction ‼️

Compile time (static)
- Always not taken
  - low accuracy : ~30-40%
- Always taken
  - Better accuracy : ~60-70%
- BTFN (Backward taken, Forward not taken)
- Profile based
  - requires hint bit in the branch instruction format
- Program analysis based
- Programmer-based

💡
1. 위의 기술들을 합칠 수 있다.
2. 공통된 단점 : Cannot adapt to dynamic changes in branch behavior

Run time (dynamic)
- Last time prediction
- Two-bit counter based prediction
- Two-level prediction
- Hybrid
- Advanced algorithms

💡
1. 장점
+ Prediction based on history of the execution of branches
+ It can adapt to dynamic changes in branch behavior
+ No need for static profiling: input set representativeness problem goes away
2. 단점
- More complex (requires additional hardware)

Last Time Predictor

Single bit per branch (stored in BTB)
Indicates which direction branch went last time it executed

Always mispredicts the last iteration and the first iteration of a loop branch
- Accuracy for a loop with N iterations = (N-2)/N
Loop branches for loops with large N (number of iterations) 일 때 유리
Loop branches for loops will small N (number of iterations) 일 때 불리

Last time predictor 는 그동안 계속 T가 나왔더라도 한번이라도 N이 나오면 다음 예측을 곧바로 N으로 바꾼다는 단점이 있다.
- Hysteresis 를 predictor에 추가하여 하나의 다른 결과만으로는 예측을 바꾸지 않게 향상 시킴
  ⇒ Two-bit Counter based prediction

Two-bit Counter based prediction

A strong prediction does not change with one single different outcome
Accuracy for a loop with N iterations = (N-1)/N

~85-90% accuracy for many programs with 2-bit counter based prediction (also called bimodal prediction)
Assume N = 20 (20 pipe stages), W = 5 (5 wide fetch)
Assume: 1 out of 5 instructions is a branch
Assume: Each 5 instruction-block ends with a branch
Assume: Branch direction is determined at the final pipe stage.
How long does it take to fetch 500 instructions?
100% accuracy
- 100 cycles (all instructions fetched on the correct path)
- No wasted work
95% accuracy
- 100 (correct path) + 20 * 5 (wrong path) = 200 cycles
- 100% extra instructions fetched
90% accuracy
- 100 (correct path) + 20 * 10 (wrong path) = 300 cycles
- 200% extra instructions fetched
85% accuracy
- 100 (correct path) + 20 * 15 (wrong path) = 400 cycles
- 300% extra instructions fetched

Two-Level Prediction

Last-time and 2BC predictors exploit “last-time” predictability
Realization 1: A branch’s outcome can be correlated with other branches’ outcomes
Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch“last-time” it was executed)

위와 같은 상황에서 첫번째 branch가 not taken이라면 당연히 두번째도 not taken이다.

Idea: Associate branch outcomes with “global T/NT history” of all branches
Make a prediction based on the outcome of the branch the last time the same global branch history was encountered

⇒Global history/branch predictor

First level: Global branch history register (N bits) (A set of local history registers)
Second level: Table of saturating(포화) counters for each history entry

An Issue: Interference in the PHTs
- Sharing the PHTs between histories/branches leads to interference
Reducing Interference in PHTs (I)
1. Increase size of PHT
2. Branch filtering
3. Hashing/index-reandomization