13. SuperScalar

Increase the depth of the pipeline leading to shorter clock cycles
Higher degree of superpipelining induces
- more forwarding/hazard hardware needed
- even more inevitable pipeline stalls (more data dependency)
- gain of superpipelining is not linear

Superpipelined vs Superscalar(SS)

Superpipelined processors have longer instruction latency than the SS processors which can degrade performance in the presence of true dependencies
Superscalar processors are more susceptible(취약) to resource conflicts – but we can fix this with extra hardware

Multiple-Issue Processor Styles

Static multiple-issue processors (aka VLIW - very long instruction word architecture)
- Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)
Dynamic multiple-issue processors (aka superscalar)
- Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware)

VLIW (Very Long Instruction Word)

Software (compiler) packs independent instructions in a larger “instruction bundle” to be fetched and executed concurrently
Hardware fetches and executes the instructions in the bundle concurrently
No need for hardware dependency checking between concurrently fetched instructions in the VLIW model

Traditional Characteristics
- Multiple functional units
- All instructions in a bundle are executed in lock step
- Instructions in a bundle statically aligned to be directly fed into the functional units
Example

장점
- No need for dynamic scheduling hardware → simple hardware
- No need for dependency checking within a VLIW instruction → simple hardware for multiple instruction issue + no renaming
- No need for instruction alignment/distribution after fetch to different functional units → simple hardware
단점
- Compiler needs to find N independent operations per cycle
- Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing)
- Lockstep execution causes independent operations to stall
정리
- 하드웨어 간소화 가능, 그러나 복잡한 컴파일러 기술
- 성능 저하 가능성
- 컴파일러가 병렬 처리를 더 쉽게 찾을 때 VLIW 성공적

Superscalar

Fetch, decode, execute, retire multiple instructions per cycle
- N-wide superscalar ⇒ N instructions per cycle
- Add hardware resources
- Hardware performs the dependence checking between concurrentlyfetched instructions
Multiple-Issue Datapath (or SS) Responsibilities
- Data dependencies – data hazards
- Procedural dependencies – control hazards
  - Use dynamic branch prediction
- Resource conflicts – structural hazards
  - Duplicating the resource or by pipelining the resource
  
  위의 3가지 문제가 Superscalar에서 더 악화됨

Out of Order (OoO) Execution

Instruction-issue ⇒ initiate execution
- Instruction lookahead capability – fetch, decode and issue instructions beyond the current instruction
Instruction-completion ⇒ complete execution
- Processor lookahead capability – complete issued instructions beyond the current instruction
Instruction-commit ⇒write back results to the RegFile or D$ (i.e., change the machine state)

1. In-Order Issue with In-Order Completion
- Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)

In-Order Issue with Out-of-Order Completion
- With out-of-order completion, a later instruction may complete before a previous instruction
- When using out-of-order completion, instruction is stalled when there is a resource conflict (e.g., for a functional unit) or when the instructions ready to issue need a result that has not yet been computed
- Handling Output Dependencies
  - IOI-OOC 에서
    
    이것을 실행 시킬 때, I1이 I2보다 늦게 실행 될 수 도 있다. ⇒ I5에서 절못된 값이 들어 있는 R3를 읽게됨 ⇒ I2 has anouput dependency on I1 ⇒ write before write
    
    따라서 stall 되야 할 수 있다. ⇒ it requires more dependency checking hardware (read before write , write before write 일 때 필요)

Out-of-Order Issue with Out-of-Order Completion
- With in-order issue the processor stops decoding instructions whenever a decoded instruction has a resource conflict or a data dependency on an issued, but uncompleted instruction
- 충돌한 것 넘어의 instruction을 가져와 decode, buffer에 저장, 버퍼에 리소스 충돌이나 데이터 종속성이 없는 명령에 flag 지정
- flagged instructions은 프로그램 순서에 관계없이 버퍼에서 issued
- issue means inserting instructions into Execution Unit

Antidependencies (write before read)

Data dependency의 일종
With OOI also have to deal with data antidependencies

Storage Conflicts and Register Renaming

Storage conflicts can be reduced (or eliminated) by increasing or duplicating the troublesome resource
Register renaming
- the processor renames the original register identifier in the instruction to a new register (one not in the visible register set)

Superscalar 장단점

장점
- Higher IPC (instructions per cycle)
단점
- Higher complexity for dependency checking
  - Require checking within a pipeline stage
  - Renaming becomes more complex in an OoO processor
- More hardware resources needed

Superscalar 정리

Must handle, with a combination of hardware and software fixes, the fundamental limitations of
- Data dependencies – aka data hazards
  - OOI-OOC 에서 악화됨
- Procedural dependencies – aka control hazards
  - Use dynamic branch prediction
- Resource conflicts – aka structural hazards
  - Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource
  
  3가지 전부 superscalar에서 악화됨