14. Multithreading 1,2,3

이세진·2022년 4월 3일
0

Computer Science

목록 보기
58/74

생성일: 2021년 11월 11일 오후 3:36

Basics

  • Thread
    • Instruction stream with state (register and memory)
    • Register states are also called thread context
  • Threads could be part of the same process (program) or from different programs
    • 같은 프로그램의 Threads는 같은 주소 공간을 공유
  • Thread context switching for multitasking
    • When a new thread needs to be executed, old thread’s context in hardware written back to memory and new thread’s context loaded

Hardware Multithreading

  • idea : Have mutiple thread contexts in a single processor
  • Why?
    • To tolerate latency (reducing pipeline stalls)
    • To improve system throughput
    • To reduce context switch penalty
  • 장점
    • Latency tolerance (by running other thread work during the delay time)
    • Better hardware utilization (because of reduced pipeline stalls)
    • Reduced context switch penalty (having more register file resources for multiple threads)
  • 단점
    • High HW cost - Requires multiple thread contexts to be implemented in hardware (area, power, latency cost)
    • Usually reduced single-thread performance

Types of Multithreading

  • Fine-grained MT (세밀한)
    • Cycle by cycle
  • Coarse-grained MT (거친)
    • Switch on event (e.g., cache miss)
    • Switch on quantum/timeout
  • Simultaneous MT
    • Instructions from multiple threads executed concurrently in the same cycle

Fine-grained Multithreading

  • Switch to another thread every clock cycle such that no instructions from the thread are in other pipeline stages concurrently
  • Improves pipeline utilization by taking advantage of multiple threads
  • Alternative way of looking at it: Eliminate the control and data hazards among pipeline stages by overlapping the latency with useful work from other threads

  • 장점
    • No need for dependency checking between instructions in different pipeline stages. (only one instruction in pipeline from a single thread)
    • No need for branch prediction logic for instructions in different pipeline stages.
    • Otherwise-bubble cycles used for executing useful instructions from different threads
    • Improved system throughput, latency tolerance, utilization
  • 단점
    • Extra hardware complexity: multiple hardware contexts, thread selection logic
    • Reduced single thread performance (one instruction fetched every N cycles)
    • Increased resource contention between threads in caches and memory
    • Dependency checking logic between threads remains (load/store)
    • Dependency checking and branch prediction for instructions in a pipeline stage still
      remain (In case of SuperScalar)

Coarse-grained Multithreading

  • idea : When a thread is stalled due to some events (incurring longlatency accordingly pipeline stalls), switch to a different hardware context
  • Possible stall events
    • Cache misses
    • Floating-point operations
    • Accessing slow I/O devices

Fine-grained vs. Coarse-grained MT

  • Fine-grained의 상대적 장점
    • Simple to implement
    • Switching need not have any performance overhead
    • Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state
  • FIne-grained의 상대적 단점
    • Low single thread performance: each thread gets at most 1/Nth of the bandwidth of the pipeline (Nth is the number of pipeline stages)

Simultaneous Multithreading (SMT)

  • Fine-grained and coarse-grained multithreading can start execution of instructions from only a single thread at a given cycle
  • In FG and CG MT methods, execution unit (or pipeline stage) utilization can be low if there are not enough instructions from a thread to “dispatch” in one cycle
  • idea : Dispatch instructions from multiple threads in the same cycle (to keep multiple execution units utilized

Functional unit utilization in single channel pipeline

  • Data dependencies reduce functional unit utilization in pipelined processor

Functional Unit Utilization in Superscalar

  • Functional unit utilization becomes lower in superscalar, OoO machines than the simple single channel pipeline. (Finding 4 instructions in parallel is harder than finding one instruction)

Predicated Execution

  • Idea : Convert control dependencies into data dependencies
  • It looks improved FU utilization, but some of the instructions are actually NOP

Chip Multiprocessor

  • Idea : Partition functional units across cores
  • Still limited FU utilization within a single thread; limited single-thread performance

Fine-grained Multithreading

  • Far better than single thread one, but still low utilization due to intra-thread dependencies
  • Single thread performance suffers

Simultaneos Multithreading

  • Idea : Utilize functional units with independent operations from the same or different threads
  • Best performance but the highest HW cost

Horizontal vs. Vertical Waste

Simultaneous Multithreading (SMT)

  • Reduces both horizontal and vertical waste
  • Required hardware
  • Superscalar, OoO processors already have this machinery

Basic Superscalar OoO Pipeline

SMT Pipeline

  • Physical register file needs to become larger.

  • Changes to pipeline for SMT
    • Replicated resources
      • program counter
      • Register map
      • Return address stack
      • Global history register
    • Shared resources
      • Register file (sized increased)
      • Instruction queue
      • First and second level caches
      • Translatoin lookaside buffers
      • Branch predictor

Changes to OoO+SS Pipeline for SMT

SMT Scalability(확장성)

  • Thread가 많아지면 이득이 줄어드는 형태이다. why? ⇒The number of channels of SS processors are fixed.

SMT Design Consideration

  • Fetch and prioritization polices
    • which thread to fetch from?
  • Shared resource allocation policies
    • How to prevent starvation?
    • How to maximize throughput?
    • How to provide fairness/QoS?
    • Free-for-all vs. partitioned
  • How to measure performance
    • Is total IPC across all threads the right metric?
  • How to select threads to co-schedule

Which Thread to Fetch From?

  • (Somewhat) STatic policies
    • Round-robin
    • 8 instructions from 1 thread
    • 4 instructions from 2 threads
    • 2 instructions from 4 threads
  • Dynamic policies
    • Favor threads with minimal in-flight branches
    • Favor threads with minimal outstanding misses
    • Favor threads with minimal in-flight instructions
    • Favor threads with higher real time requirements

SMT Fetch Policies 1

  • Round robin : Fetch from a different thread each cycle
  • Does not work well in practice. Why?
  • Instructions from slow threads monopoly the pipeline and block the instruction window

SMT Fetch Policies 2

  • ICOUNT: Fetch instructions for a thread with the least instructions in the earlier pipeline stages (decode, rename, instruction queues; before execution)
  • It improves throughput

SMT ICOUNT Fetch Poilicy

  • Favors faster threads that have few instructions waiting
  • Advantages over round robin
    • Allows faster threads to make more progress
    • Higher IPC throughput
  • Priority is given to threads with the fewest instructions in decode, rename, and the instruction queues. This achiecves three purposes
    1. It prevents any one thread from filling the IQ
    2. It gives highest priority to threads that are moving instructions through the IQ most efficiently
    3. It provides a more even mix of instructions from the available threads, maximizing the parallelism in th queue
profile
나중은 결코 오지 않는다.

0개의 댓글