생성일: 2021년 12월 4일 오후 4:15
GPUs are SIMD Engines Underneath
- instruction pipeline operates like a SIMD pipeline
- However, programming is done using threads, NOT SIMD instructions
How Can You Exploit Parallelism Here?
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
위의 코드에서 insctruction-level 병렬성을 활용하기 위한 3 가지 옵션
- Sequential (SISD)
- Data-Parallel (SIMD)
- Multithreaded (MIMD/SPMD)
1. Sequential (SISD)
![](https://media.vlpt.us/images/lsj8706/post/1dbb6bd8-8e24-48b9-a5ae-cba931912801/Untitled.png)
- Pipelined processor
- Out-of-order execution processor
- Independent instructions executed when ready
- Different iterations are present in the instruction window and can execute in parallel in multiple functional units
- the loop is dynamically unrolled by the hardware
- Superscalar or VLIW processor
2. data Parallel (SIMD)
![](https://media.vlpt.us/images/lsj8706/post/e7d01581-cc45-449e-9854-6b938f580e38/Untitled%201.png)
![](https://media.vlpt.us/images/lsj8706/post/293c9fc1-0cf3-49da-a126-1b48a4fcf485/Untitled%202.png)
- Each iteration is independent
- Programmer or compiler generates a SIMD instruction to execute the same instruction from all iterations across different data
- Best executed by a SIMD processor (vector, array)
3. Multithreaded
![](https://media.vlpt.us/images/lsj8706/post/c7f28f63-0fc0-42bb-b6e2-854f667cfb42/Untitled%203.png)
- Each iteration is independent
- Programmer or compiler generates a thread to execute each iteration. Each thread does the same thing (but on different data)
- Can be executed on a MIMD machine
- This particular model is also called : SPMD (Single Program Multiple Data)
- Can be executed on a SIMT machine (Single Instruction Multiple Thread)
GPU is a SIMD (SIMT) Machine
- It is programmed using threads (SPMD programming model)
- Each thread executes the same code but operates a different piece of data
- A set of threads executing the same instruction are dynamically grouped into a warp (wavefront) by the hardware
SPMD on SIMT Machine
![](https://media.vlpt.us/images/lsj8706/post/4acc7409-e2fe-435f-9569-fe34c9c82696/Untitled%204.png)
SIMD vs. SIMT Execution model
- SIMD: A single sequential instruction stream of SIMD instructions → each instruction specifies multiple data inputs
- SIMT: Multiple instruction streams of scalar instructions → threads grouped dynamically into warps
- [LD, ADD, ST], NumThreads
- 장점
- Can treat each thread separately
- Can group threads into warps flexibly
Fine-Grained Multithreading of Warps
- Assume a warp consists of 32 threads
- If you have 32K iterations, and 1 iteration/thread → 1K warps
- Warps can be interleaved(끼워지다) on the same pipeline → Fine grained multithreading of warps
![](https://media.vlpt.us/images/lsj8706/post/c2edd1f1-f23a-46bf-a0b3-4586bbb4547b/Untitled%205.png)
Iter : 3332 + 1 ⇒ 2032 + 1 Iter. 3432 + 2 ⇒ 2032 + 2
Warps and Warp-Level FGMT
- Warp: A set of threads that execute the same instruction (on different data elements) → SIMT (Nvidia-speak)
- All threads run the same code
![](https://media.vlpt.us/images/lsj8706/post/181bad98-1706-4533-b678-0afcb4113634/Untitled%206.png)