생성일: 2021년 11월 11일 오후 3:36

Extracting More Performance

Two options

  1. Increase the depth of the pipeline to increase the clock rate - superpipeline
  2. Fetch (and execute) more than one instructions at one time - superscalar

⇒ Launching multiple instructions per stage allows the instruction execution rate, CPI, to be less than 1

Superpipelining

pipeline stage를 더 작게 쪼갬

  • Increase the depth of the pipeline leading to shorter clock cycles
  • Higher degree of superpipelining induces
    • more forwarding/hazard hardware needed
    • even more inevitable pipeline stalls (more data dependency)
    • gain of superpipelining is not linear

Superpipelined vs Superscalar(SS)

  • Superpipelined processors have longer instruction latency than the SS processors which can degrade performance in the presence of true dependencies
  • Superscalar processors are more susceptible(취약) to resource conflicts – but we can fix this with extra hardware

Multiple-Issue Processor Styles

  • Static multiple-issue processors (aka VLIW - very long instruction word architecture)
    • Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)
  • Dynamic multiple-issue processors (aka superscalar)
    • Decisions on which instructions to execute simultaneously are being made dynamically (at run time by the hardware)

VLIW (Very Long Instruction Word)

  • Software (compiler) packs independent instructions in a larger “instruction bundle” to be fetched and executed concurrently
  • Hardware fetches and executes the instructions in the bundle concurrently
  • No need for hardware dependency checking between concurrently fetched instructions in the VLIW model

  • Traditional Characteristics
    • Multiple functional units
    • All instructions in a bundle are executed in lock step
    • Instructions in a bundle statically aligned to be directly fed into the functional units
  • Example
  • 장점
    • No need for dynamic scheduling hardware → simple hardware
    • No need for dependency checking within a VLIW instruction → simple hardware for multiple instruction issue + no renaming
    • No need for instruction alignment/distribution after fetch to different functional units → simple hardware
  • 단점
    • Compiler needs to find N independent operations per cycle
    • Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing)
    • Lockstep execution causes independent operations to stall
  • 정리
    • 하드웨어 간소화 가능, 그러나 복잡한 컴파일러 기술
    • 성능 저하 가능성
    • 컴파일러가 병렬 처리를 더 쉽게 찾을 때 VLIW 성공적

Superscalar

  • Fetch, decode, execute, retire multiple instructions per cycle
    • N-wide superscalar ⇒ N instructions per cycle
    • Add hardware resources
    • Hardware performs the dependence checking between concurrentlyfetched instructions
  • Multiple-Issue Datapath (or SS) Responsibilities
    • Data dependencies – data hazards

    • Procedural dependencies – control hazards

      • Use dynamic branch prediction
    • Resource conflicts – structural hazards
      - Duplicating the resource or by pipelining the resource

      위의 3가지 문제가 Superscalar에서 더 악화됨

Out of Order (OoO) Execution

  • Instruction-issue ⇒ initiate execution
    • Instruction lookahead capability – fetch, decode and issue instructions beyond the current instruction
  • Instruction-completion ⇒ complete execution
    • Processor lookahead capability – complete issued instructions beyond the current instruction
  • Instruction-commit ⇒write back results to the RegFile or D$ (i.e., change the machine state)


1. In-Order Issue with In-Order Completion
- Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)

  1. In-Order Issue with Out-of-Order Completion

    • With out-of-order completion, a later instruction may complete before a previous instruction
    • When using out-of-order completion, instruction is stalled when there is a resource conflict (e.g., for a functional unit) or when the instructions ready to issue need a result that has not yet been computed

    • Handling Output Dependencies
      • IOI-OOC 에서

        이것을 실행 시킬 때, I1이 I2보다 늦게 실행 될 수 도 있다. ⇒ I5에서 절못된 값이 들어 있는 R3를 읽게됨 ⇒ I2 has anouput dependency on I1 ⇒ write before write

        따라서 stall 되야 할 수 있다. ⇒ it requires more dependency checking hardware (read before write , write before write 일 때 필요)

  1. Out-of-Order Issue with Out-of-Order Completion

    • With in-order issue the processor stops decoding instructions whenever a decoded instruction has a resource conflict or a data dependency on an issued, but uncompleted instruction
    • 충돌한 것 넘어의 instruction을 가져와 decode, buffer에 저장, 버퍼에 리소스 충돌이나 데이터 종속성이 없는 명령에 flag 지정
    • flagged instructions은 프로그램 순서에 관계없이 버퍼에서 issued
    • issue means inserting instructions into Execution Unit

Antidependencies (write before read)

  • Data dependency의 일종
  • With OOI also have to deal with data antidependencies

Storage Conflicts and Register Renaming

  • Storage conflicts can be reduced (or eliminated) by increasing or duplicating the troublesome resource
  • Register renaming
    • the processor renames the original register identifier in the instruction to a new register (one not in the visible register set)

Superscalar 장단점

  • 장점
    • Higher IPC (instructions per cycle)
  • 단점
    • Higher complexity for dependency checking
      • Require checking within a pipeline stage
      • Renaming becomes more complex in an OoO processor
    • More hardware resources needed

Superscalar 정리

  • Must handle, with a combination of hardware and software fixes, the fundamental limitations of
    • Data dependencies – aka data hazards

      • OOI-OOC 에서 악화됨
    • Procedural dependencies – aka control hazards

      • Use dynamic branch prediction
    • Resource conflicts – aka structural hazards
      - Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource

      3가지 전부 superscalar에서 악화됨

profile
나중은 결코 오지 않는다.

0개의 댓글

Powered by GraphCDN, the GraphQL CDN