14. Multithreading 1,2,3

이세진·2022년 4월 3일

컴퓨터 구조

Computer Science

목록 보기

58/74

생성일: 2021년 11월 11일 오후 3:36

Basics

Thread
- Instruction stream with state (register and memory)
- Register states are also called thread context
Threads could be part of the same process (program) or from different programs
- 같은 프로그램의 Threads는 같은 주소 공간을 공유
Thread context switching for multitasking
- When a new thread needs to be executed, old thread’s context in hardware written back to memory and new thread’s context loaded

Hardware Multithreading

idea : Have mutiple thread contexts in a single processor
Why?
- To tolerate latency (reducing pipeline stalls)
- To improve system throughput
- To reduce context switch penalty
장점
- Latency tolerance (by running other thread work during the delay time)
- Better hardware utilization (because of reduced pipeline stalls)
- Reduced context switch penalty (having more register file resources for multiple threads)
단점
- High HW cost - Requires multiple thread contexts to be implemented in hardware (area, power, latency cost)
- Usually reduced single-thread performance

Types of Multithreading

Fine-grained MT (세밀한)
- Cycle by cycle
Coarse-grained MT (거친)
- Switch on event (e.g., cache miss)
- Switch on quantum/timeout
Simultaneous MT
- Instructions from multiple threads executed concurrently in the same cycle

Fine-grained Multithreading

Switch to another thread every clock cycle such that no instructions from the thread are in other pipeline stages concurrently
Improves pipeline utilization by taking advantage of multiple threads
Alternative way of looking at it: Eliminate the control and data hazards among pipeline stages by overlapping the latency with useful work from other threads

장점
- No need for dependency checking between instructions in different pipeline stages. (only one instruction in pipeline from a single thread)
- No need for branch prediction logic for instructions in different pipeline stages.
- Otherwise-bubble cycles used for executing useful instructions from different threads
- Improved system throughput, latency tolerance, utilization
단점
- Extra hardware complexity: multiple hardware contexts, thread selection logic
- Reduced single thread performance (one instruction fetched every N cycles)
- Increased resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)
- Dependency checking and branch prediction for instructions in a pipeline stage still
  remain (In case of SuperScalar)

Coarse-grained Multithreading

idea : When a thread is stalled due to some events (incurring longlatency accordingly pipeline stalls), switch to a different hardware context
Possible stall events
- Cache misses
- Floating-point operations
- Accessing slow I/O devices

Fine-grained vs. Coarse-grained MT

Fine-grained의 상대적 장점
- Simple to implement
- Switching need not have any performance overhead
- Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state
FIne-grained의 상대적 단점
- Low single thread performance: each thread gets at most 1/Nth of the bandwidth of the pipeline (Nth is the number of pipeline stages)

Simultaneous Multithreading (SMT)

Fine-grained and coarse-grained multithreading can start execution of instructions from only a single thread at a given cycle
In FG and CG MT methods, execution unit (or pipeline stage) utilization can be low if there are not enough instructions from a thread to “dispatch” in one cycle
idea : Dispatch instructions from multiple threads in the same cycle (to keep multiple execution units utilized

Functional unit utilization in single channel pipeline

Data dependencies reduce functional unit utilization in pipelined processor

Functional Unit Utilization in Superscalar

Functional unit utilization becomes lower in superscalar, OoO machines than the simple single channel pipeline. (Finding 4 instructions in parallel is harder than finding one instruction)

Predicated Execution

Idea : Convert control dependencies into data dependencies
It looks improved FU utilization, but some of the instructions are actually NOP

Chip Multiprocessor

Idea : Partition functional units across cores
Still limited FU utilization within a single thread; limited single-thread performance

Fine-grained Multithreading

Far better than single thread one, but still low utilization due to intra-thread dependencies
Single thread performance suffers

Simultaneos Multithreading

Idea : Utilize functional units with independent operations from the same or different threads
Best performance but the highest HW cost

Horizontal vs. Vertical Waste

Simultaneous Multithreading (SMT)

Reduces both horizontal and vertical waste
Required hardware
Superscalar, OoO processors already have this machinery

Basic Superscalar OoO Pipeline

SMT Pipeline

Physical register file needs to become larger.

Changes to pipeline for SMT
- Replicated resources
  - program counter
  - Register map
  - Return address stack
  - Global history register
- Shared resources
  - Register file (sized increased)
  - Instruction queue
  - First and second level caches
  - Translatoin lookaside buffers
  - Branch predictor

Changes to OoO+SS Pipeline for SMT

SMT Scalability(확장성)

Thread가 많아지면 이득이 줄어드는 형태이다. why? ⇒The number of channels of SS processors are fixed.

SMT Design Consideration

Fetch and prioritization polices
- which thread to fetch from?
Shared resource allocation policies
- How to prevent starvation?
- How to maximize throughput?
- How to provide fairness/QoS?
- Free-for-all vs. partitioned
How to measure performance
- Is total IPC across all threads the right metric?
How to select threads to co-schedule

Which Thread to Fetch From?

(Somewhat) STatic policies
- Round-robin
- 8 instructions from 1 thread
- 4 instructions from 2 threads
- 2 instructions from 4 threads
Dynamic policies
- Favor threads with minimal in-flight branches
- Favor threads with minimal outstanding misses
- Favor threads with minimal in-flight instructions
- Favor threads with higher real time requirements

SMT Fetch Policies 1

Round robin : Fetch from a different thread each cycle
Does not work well in practice. Why?
Instructions from slow threads monopoly the pipeline and block the instruction window

SMT Fetch Policies 2

ICOUNT: Fetch instructions for a thread with the least instructions in the earlier pipeline stages (decode, rename, instruction queues; before execution)
It improves throughput

SMT ICOUNT Fetch Poilicy

Favors faster threads that have few instructions waiting
Advantages over round robin
- Allows faster threads to make more progress
- Higher IPC throughput
Priority is given to threads with the fewest instructions in decode, rename, and the instruction queues. This achiecves three purposes
1. It prevents any one thread from filling the IQ
2. It gives highest priority to threads that are moving instructions through the IQ most efficiently
3. It provides a more even mix of instructions from the available threads, maximizing the parallelism in th queue

이세진

나중은 결코 오지 않는다.

이전 포스트

13. SuperScalar

다음 포스트

14. Multithreading 1,2,3

Computer Science

Basics

Hardware Multithreading

Types of Multithreading

Fine-grained Multithreading

Coarse-grained Multithreading

Fine-grained vs. Coarse-grained MT

Simultaneous Multithreading (SMT)

Functional unit utilization in single channel pipeline

Functional Unit Utilization in Superscalar

Predicated Execution

Chip Multiprocessor

Fine-grained Multithreading

Simultaneos Multithreading

Horizontal vs. Vertical Waste

Simultaneous Multithreading (SMT)

Basic Superscalar OoO Pipeline

SMT Pipeline

Changes to OoO+SS Pipeline for SMT

SMT Scalability(확장성)

SMT Design Consideration

Which Thread to Fetch From?

SMT Fetch Policies 1

SMT Fetch Policies 2

SMT ICOUNT Fetch Poilicy

13. SuperScalar

15. SIMD 1,2

0개의 댓글

관련 채용 정보