생성일: 2021년 11월 25일 오후 9:43
SIMD Processing: Exploiting Regular (Data) Parallelism
- SIMD : Single instruction operates on multiple data elements
- Array processor
- Vector processor
Data Parallelism
- Concurrency arises from performing the same operation on different pieces of data
- Single instruction multiple data (SIMD)
- Constrast with thread ("control") parallelism
- SIMD exploits operation-level parallelism on different data
- SIMD works best when dealing with arrays in for loops. Hence, for parallelism to work in SIMD, there must be a great deal of identically structured data, which is called data-level parallelism
SIMD Processing
- Single instruction operates on multiple data elements
- In time or in space
- Time-space duality
- Array processor : Instruction operates on multiple data elements at the same time using different spaces
- Vector processor : Instruction operates onf multiple data elements in consecutive time steps using the same space
Vector Registers
- Each vector data register holds N M-bit values
- Each register stores a vector
- Not a (single) scalar value as we saw before
Array vs. Vector Processors
SIMD Array Processing vs. VLIW
- VLIW : Multiple independent operations packed together by the compiler
- Array processor : Single operation on multiple (different) data elements
VLIW
SIMD Array processing
Vectror Processors 1
Vector Stride Example : Matrix Multiply
Vector Processors 2
- A vector instruction performs an operation on each element in consecutive cycles
- Vector functional units are pipelined
- Each pipeline stage operates on a different data element
- Vector instructions allow deeper pipeline
- No intra-vector dependencies ⇒ no hardware interlocking needed whithin a vector
- No control flow within a vector
- Known stride allows easy address calculation for all vector elements (load하기 쉬움)
- 장점
- No dependencies within a vector
- Pipelining & parallelization work really well
- Can have very deep pipelines (without the penalty of deep pipelines)
- Each instruction generates a lot of work (i.e., operations)
- Reduces instruction fetch bandwidth requirements
- Amortizes(상각하다) instruction fetch and control overhead over many data ⇒ Leads to high energy efficiency per operation
- No need to explicitly code loops
- Fewer branches in the instruction sequence
- Highly regular memory access pattern
- 단점
- Works (only) if parallelism is regular (data/SIMD parallelism) ⇒ very inefficient if parallelism is irregular
- Limitations
- Memory (bandwidth) can easily become a bottleneck, especially if
- compute/memory operation balance is not maintained
- data is not mapped appropriately to memory banks
Vector Registers
- Each vector data register holds N M-bit values
- Vector control registers: VLEN, VSTR, VMASK
- Maximum VLEN can be N
- Vector Mask Register (VMASK)
- Indicates which elements of vector to operate on
- Set by vector test instructions (e.g., VMASK[i] = (Vk[i] == 0))
Vector Functional Units
- Use a deep pipeline to execute element operations ⇒fast clock cycle
- Control of deep pipeline is simple because elements in vector are independent
Loading/Storing Vectors from/to Memory
- Elements separated from each other by a constant distance (stride)
- Assume stride = 1 for now
- Elements can be loaded in consecutive cycles if we can start the load of one element per cycle
- Question: How do we achieve this with a memory that takes more than 1 cycle to access? ⇒ Bank the memory; interleave the elements across banks
Memory Banking
- Memory is divided into banks that can be accessed independently; banks share address and data buses (to minimize pin cost)
- Can start and complete one bank access per cycle
- Can sustain N parallel accesses if all N go to different banks
Vector Memory System
- Next address = Previous address + Stride
- If (stride == 1) && (consecutive elements interleaved across banks) && (number of banks ≥ bank latency) , then
- we can sustain 1 elements/cycle throughput
- Strides need to be relatively prime to the number of memory banks
Scalar Code Example : Element-Wise Avg.
여기서 SHFR R7 = R6 >> 1은 2로 나눈다는 뜻이다 Why? 2진수에서 오른쪽으로 한칸씩 밀면 2로 나눈 수가 된다. (0100 ⇒ 0010은 4 ⇒ 2)
Scalar Code Execution Time (In Order) ⭐
- Scalar execution time on an in-order processor with 1 bank
- First two loads in the loop cannot be pipelined: 2*11 cycles
- 4 + 50*40 = 2004 cycles
- Scalar execution time on an in-order processor with 16 banks (wordinterleaved: consecutive words are stored in consecutive banks)
- First two loads in the loop can be pipelined (LD R4와 LD R5가 12사이클 만에 가능)
- 4 + 50*30 = 1504 cycles
- Why 16 banks?
- 11-cycle memory access latency
- Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency
Vectorizable Loops
- A loop is vectorizable if each iteration is independent of any other
- Assume no chaining (no vector data forwarding)
- One memory port (one address generator)
- 16 memory banks (word-interleaved)
Vector Chaining
- Data forwarding from one vector functional unit to another
- Chaining and 2 load ports, 1 store port in each bank
Question
Gather/Scatter Operations
- Gather/scatter operations often implemented in hardware to handle sparse vectors (matrices) or indirect indexing
- Vector loads and stores use an index vector which is added to the base register to generate the addresses
Conditional Operations in a Loop
위의 코드처럼 operation을 실행시키지 않아야 할 때도 있을 때
- 방법 : Masked operations
- VMASK register is a bit mask determining which data element should not be acted upon
- This is predicated execution. Execution is predicated on mask bit.
Some Issues
- Stride and banking
- As long as they are relatively prime to each other and there are enough banks to cover bank access latency, we can sustain 1 element/cycle throughput
- Storage format of a matrix
- Row major: Consecutive elements in a row are laid out consecutively in memory
- Column major: Consecutive elements in a column are laid out consecutively in memory
- You need to change the stride when accessing a row versus column ⇒ Bank Conflicts
Bank Conflicts in Matrix Multiplication
- A and B matrices, both stored in memory in row-major order
- Load A’s row 0 into vector register V1
- Accesses have a stride of 1
- Load B’s column 0 into vector register V2
- Accesses have a stride of 10
⇒ 둘의 stride가 다름 ⇒ bank conflicts ⇒ 어떻게 최소화 할까?
Minimizing Bank Conflicts
- More banks
- More ports in each bank
- Better data layout to match the access pattern
- Better mapping of address to bank
Array vs. Vector Processors, Revisited
- Most “modern” SIMD processors are a combination of both
- They exploit data parallelism in both time and space
Vector/SIMD Processing Summary
- Vector/SIMD machines are good at exploiting regular data-level parallelism
- Same operation performed on many data elements
- Improve performance, simplify design (no intra-vector dependencies)
- Performance improvement limited by vectorizability of code
- Scalar operations limit vector machine performance
- Remember Amdahl’s Law
SIMD ISA Extensions
- Single Instruction Multiple Data (SIMD) extension instructions
- Single instruction acts on multiple pieces of data at once
- 그래픽 분야에서 주로 사용
- For example: add four 8-bit numbers