생성일: 2021년 11월 11일 오후 3:36
Basics
- Thread
- Instruction stream with state (register and memory)
- Register states are also called thread context
- Threads could be part of the same process (program) or from different programs
- 같은 프로그램의 Threads는 같은 주소 공간을 공유
- Thread context switching for multitasking
- When a new thread needs to be executed, old thread’s context in hardware written back to memory and new thread’s context loaded
Hardware Multithreading
- idea : Have mutiple thread contexts in a single processor
- Why?
- To tolerate latency (reducing pipeline stalls)
- To improve system throughput
- To reduce context switch penalty
- 장점
- Latency tolerance (by running other thread work during the delay time)
- Better hardware utilization (because of reduced pipeline stalls)
- Reduced context switch penalty (having more register file resources for multiple threads)
- 단점
- High HW cost - Requires multiple thread contexts to be implemented in hardware (area, power, latency cost)
- Usually reduced single-thread performance
Types of Multithreading
- Fine-grained MT (세밀한)
- Coarse-grained MT (거친)
- Switch on event (e.g., cache miss)
- Switch on quantum/timeout
- Simultaneous MT
- Instructions from multiple threads executed concurrently in the same cycle
Fine-grained Multithreading
- Switch to another thread every clock cycle such that no instructions from the thread are in other pipeline stages concurrently
- Improves pipeline utilization by taking advantage of multiple threads
- Alternative way of looking at it: Eliminate the control and data hazards among pipeline stages by overlapping the latency with useful work from other threads
- 장점
- No need for dependency checking between instructions in different pipeline stages. (only one instruction in pipeline from a single thread)
- No need for branch prediction logic for instructions in different pipeline stages.
- Otherwise-bubble cycles used for executing useful instructions from different threads
- Improved system throughput, latency tolerance, utilization
- 단점
- Extra hardware complexity: multiple hardware contexts, thread selection logic
- Reduced single thread performance (one instruction fetched every N cycles)
- Increased resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)
- Dependency checking and branch prediction for instructions in a pipeline stage still
remain (In case of SuperScalar)
Coarse-grained Multithreading
- idea : When a thread is stalled due to some events (incurring longlatency accordingly pipeline stalls), switch to a different hardware context
- Possible stall events
- Cache misses
- Floating-point operations
- Accessing slow I/O devices
Fine-grained vs. Coarse-grained MT
- Fine-grained의 상대적 장점
- Simple to implement
- Switching need not have any performance overhead
- Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state
- FIne-grained의 상대적 단점
- Low single thread performance: each thread gets at most 1/Nth of the bandwidth of the pipeline (Nth is the number of pipeline stages)
Simultaneous Multithreading (SMT)
- Fine-grained and coarse-grained multithreading can start execution of instructions from only a single thread at a given cycle
- In FG and CG MT methods, execution unit (or pipeline stage) utilization can be low if there are not enough instructions from a thread to “dispatch” in one cycle
- idea : Dispatch instructions from multiple threads in the same cycle (to keep multiple execution units utilized
Functional unit utilization in single channel pipeline
- Data dependencies reduce functional unit utilization in pipelined processor
Functional Unit Utilization in Superscalar
- Functional unit utilization becomes lower in superscalar, OoO machines than the simple single channel pipeline. (Finding 4 instructions in parallel is harder than finding one instruction)
Predicated Execution
- Idea : Convert control dependencies into data dependencies
- It looks improved FU utilization, but some of the instructions are actually NOP
Chip Multiprocessor
- Idea : Partition functional units across cores
- Still limited FU utilization within a single thread; limited single-thread performance
Fine-grained Multithreading
- Far better than single thread one, but still low utilization due to intra-thread dependencies
- Single thread performance suffers
Simultaneos Multithreading
- Idea : Utilize functional units with independent operations from the same or different threads
- Best performance but the highest HW cost
Horizontal vs. Vertical Waste
Simultaneous Multithreading (SMT)
- Reduces both horizontal and vertical waste
- Required hardware
- Superscalar, OoO processors already have this machinery
Basic Superscalar OoO Pipeline
SMT Pipeline
- Physical register file needs to become larger.
- Changes to pipeline for SMT
- Replicated resources
- program counter
- Register map
- Return address stack
- Global history register
- Shared resources
- Register file (sized increased)
- Instruction queue
- First and second level caches
- Translatoin lookaside buffers
- Branch predictor
Changes to OoO+SS Pipeline for SMT
SMT Scalability(확장성)
- Thread가 많아지면 이득이 줄어드는 형태이다. why? ⇒The number of channels of SS processors are fixed.
SMT Design Consideration
- Fetch and prioritization polices
- which thread to fetch from?
- Shared resource allocation policies
- How to prevent starvation?
- How to maximize throughput?
- How to provide fairness/QoS?
- Free-for-all vs. partitioned
- How to measure performance
- Is total IPC across all threads the right metric?
- How to select threads to co-schedule
Which Thread to Fetch From?
- (Somewhat) STatic policies
- Round-robin
- 8 instructions from 1 thread
- 4 instructions from 2 threads
- 2 instructions from 4 threads
- Dynamic policies
- Favor threads with minimal in-flight branches
- Favor threads with minimal outstanding misses
- Favor threads with minimal in-flight instructions
- Favor threads with higher real time requirements
SMT Fetch Policies 1
- Round robin : Fetch from a different thread each cycle
- Does not work well in practice. Why?
- Instructions from slow threads monopoly the pipeline and block the instruction window
SMT Fetch Policies 2
- ICOUNT: Fetch instructions for a thread with the least instructions in the earlier pipeline stages (decode, rename, instruction queues; before execution)
- It improves throughput
SMT ICOUNT Fetch Poilicy
- Favors faster threads that have few instructions waiting
- Advantages over round robin
- Allows faster threads to make more progress
- Higher IPC throughput
- Priority is given to threads with the fewest instructions in decode, rename, and the instruction queues. This achiecves three purposes
- It prevents any one thread from filling the IQ
- It gives highest priority to threads that are moving instructions through the IQ most efficiently
- It provides a more even mix of instructions from the available threads, maximizing the parallelism in th queue