생성일: 2021년 11월 11일 오후 3:36
Two options
- Increase the depth of the pipeline to increase the clock rate - superpipeline
- Fetch (and execute) more than one instructions at one time - superscalar
⇒ Launching multiple instructions per stage allows the instruction execution rate, CPI, to be less than 1
Superpipelining
pipeline stage를 더 작게 쪼갬
- Increase the depth of the pipeline leading to shorter clock cycles
- Higher degree of superpipelining induces
- more forwarding/hazard hardware needed
- even more inevitable pipeline stalls (more data dependency)
- gain of superpipelining is not linear
- Superpipelined processors have longer instruction latency than the SS processors which can degrade performance in the presence of true dependencies
- Superscalar processors are more susceptible(취약) to resource conflicts – but we can fix this with extra hardware
Multiple-Issue Processor Styles
- Static multiple-issue processors (aka VLIW - very long instruction word architecture)
- Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)
- Dynamic multiple-issue processors (aka superscalar)
- Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware)
VLIW (Very Long Instruction Word)
- Software (compiler) packs independent instructions in a larger “instruction bundle” to be fetched and executed concurrently
- Hardware fetches and executes the instructions in the bundle concurrently
- No need for hardware dependency checking between concurrently fetched instructions in the VLIW model
- Traditional Characteristics
- Multiple functional units
- All instructions in a bundle are executed in lock step
- Instructions in a bundle statically aligned to be directly fed into the functional units
- Example
- 장점
- No need for dynamic scheduling hardware → simple hardware
- No need for dependency checking within a VLIW instruction → simple hardware for multiple instruction issue + no renaming
- No need for instruction alignment/distribution after fetch to different functional units → simple hardware
- 단점
- Compiler needs to find N independent operations per cycle
- Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing)
- Lockstep execution causes independent operations to stall
- 정리
- 하드웨어 간소화 가능, 그러나 복잡한 컴파일러 기술
- 성능 저하 가능성
- 컴파일러가 병렬 처리를 더 쉽게 찾을 때 VLIW 성공적
Superscalar
- Fetch, decode, execute, retire multiple instructions per cycle
- N-wide superscalar ⇒ N instructions per cycle
- Add hardware resources
- Hardware performs the dependence checking between concurrentlyfetched instructions
- Multiple-Issue Datapath (or SS) Responsibilities
-
Data dependencies – data hazards
-
Procedural dependencies – control hazards
- Use dynamic branch prediction
-
Resource conflicts – structural hazards
- Duplicating the resource or by pipelining the resource
위의 3가지 문제가 Superscalar에서 더 악화됨
Out of Order (OoO) Execution
- Instruction-issue ⇒ initiate execution
- Instruction lookahead capability – fetch, decode and issue instructions beyond the current instruction
- Instruction-completion ⇒ complete execution
- Processor lookahead capability – complete issued instructions beyond the current instruction
- Instruction-commit ⇒write back results to the RegFile or D$ (i.e., change the machine state)
1. In-Order Issue with In-Order Completion
- Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)
-
In-Order Issue with Out-of-Order Completion
- With out-of-order completion, a later instruction may complete before a previous instruction
- When using out-of-order completion, instruction is stalled when there is a resource conflict (e.g., for a functional unit) or when the instructions ready to issue need a result that has not yet been computed
- Handling Output Dependencies
-
IOI-OOC 에서
이것을 실행 시킬 때, I1이 I2보다 늦게 실행 될 수 도 있다. ⇒ I5에서 절못된 값이 들어 있는 R3를 읽게됨 ⇒ I2 has anouput dependency on I1 ⇒ write before write
따라서 stall 되야 할 수 있다. ⇒ it requires more dependency checking hardware (read before write , write before write 일 때 필요)
-
Out-of-Order Issue with Out-of-Order Completion
- With in-order issue the processor stops decoding instructions whenever a decoded instruction has a resource conflict or a data dependency on an issued, but uncompleted instruction
- 충돌한 것 넘어의 instruction을 가져와 decode, buffer에 저장, 버퍼에 리소스 충돌이나 데이터 종속성이 없는 명령에 flag 지정
- flagged instructions은 프로그램 순서에 관계없이 버퍼에서 issued
- issue means inserting instructions into Execution Unit
Antidependencies (write before read)
- Data dependency의 일종
- With OOI also have to deal with data antidependencies
Storage Conflicts and Register Renaming
- Storage conflicts can be reduced (or eliminated) by increasing or duplicating the troublesome resource
- Register renaming
Superscalar 장단점
- 장점
- Higher IPC (instructions per cycle)
- 단점
- Higher complexity for dependency checking
- Require checking within a pipeline stage
- Renaming becomes more complex in an OoO processor
- More hardware resources needed
Superscalar 정리
- Must handle, with a combination of hardware and software fixes, the fundamental limitations of
-
Data dependencies – aka data hazards
-
Procedural dependencies – aka control hazards
- Use dynamic branch prediction
-
Resource conflicts – aka structural hazards
- Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource
3가지 전부 superscalar에서 악화됨