15. SIMD 1,2

이세진·2022년 4월 3일

컴퓨터 구조

Computer Science

목록 보기

59/74

생성일: 2021년 11월 25일 오후 9:43

SIMD Processing: Exploiting Regular (Data) Parallelism

SIMD : Single instruction operates on multiple data elements
- Array processor
- Vector processor

Data Parallelism

Concurrency arises from performing the same operation on different pieces of data
- Single instruction multiple data (SIMD)
Constrast with thread ("control") parallelism
SIMD exploits operation-level parallelism on different data
SIMD works best when dealing with arrays in for loops. Hence, for parallelism to work in SIMD, there must be a great deal of identically structured data, which is called data-level parallelism

SIMD Processing

Single instruction operates on multiple data elements
In time or in space
Time-space duality
- Array processor : Instruction operates on multiple data elements at the same time using different spaces
- Vector processor : Instruction operates onf multiple data elements in consecutive time steps using the same space

Vector Registers

Each vector data register holds N M-bit values
- Each register stores a vector
- Not a (single) scalar value as we saw before

Array vs. Vector Processors

SIMD Array Processing vs. VLIW

VLIW : Multiple independent operations packed together by the compiler
Array processor : Single operation on multiple (different) data elements

VLIW

SIMD Array processing

Vectror Processors 1

Vector : one-dimensional array of nubmers

과학/상업 프로그램에서 많이 사용

//예를 들어
for(i=0;i<=49;i++)
	C[i] = (A[i]+B[i])/2

Vector Processor : 명령어가 Scalar value가 아닌 벡터에 대해 작동하는 프로세서
Basic requirements (Register)
- load/store values ⇒ vector register (contain vectors)
- operate on vectors of different lengths ⇒ vector length register (VLEN)
- Elements of a vector might be stored apart from each other in memory ⇒ vector stride register(VSTR)
  - Stride : distance in memory between two elements of a vector

Vector Stride Example : Matrix Multiply

Vector Processors 2

A vector instruction performs an operation on each element in consecutive cycles
- Vector functional units are pipelined
- Each pipeline stage operates on a different data element
Vector instructions allow deeper pipeline
- No intra-vector dependencies ⇒ no hardware interlocking needed whithin a vector
- No control flow within a vector
- Known stride allows easy address calculation for all vector elements (load하기 쉬움)
장점
- No dependencies within a vector
  - Pipelining & parallelization work really well
  - Can have very deep pipelines (without the penalty of deep pipelines)
- Each instruction generates a lot of work (i.e., operations)
  - Reduces instruction fetch bandwidth requirements
  - Amortizes(상각하다) instruction fetch and control overhead over many data ⇒ Leads to high energy efficiency per operation
- No need to explicitly code loops
  - Fewer branches in the instruction sequence
- Highly regular memory access pattern
단점
- Works (only) if parallelism is regular (data/SIMD parallelism) ⇒ very inefficient if parallelism is irregular
Limitations
- Memory (bandwidth) can easily become a bottleneck, especially if
  1. compute/memory operation balance is not maintained
  2. data is not mapped appropriately to memory banks

Vector Registers

Each vector data register holds N M-bit values
Vector control registers: VLEN, VSTR, VMASK
Maximum VLEN can be N
Vector Mask Register (VMASK)
- Indicates which elements of vector to operate on
- Set by vector test instructions (e.g., VMASK[i] = (Vk[i] == 0))

Vector Functional Units

Use a deep pipeline to execute element operations ⇒fast clock cycle
Control of deep pipeline is simple because elements in vector are independent

Loading/Storing Vectors from/to Memory

Elements separated from each other by a constant distance (stride)
Assume stride = 1 for now
Elements can be loaded in consecutive cycles if we can start the load of one element per cycle
Question: How do we achieve this with a memory that takes more than 1 cycle to access? ⇒ Bank the memory; interleave the elements across banks

Memory Banking

Memory is divided into banks that can be accessed independently; banks share address and data buses (to minimize pin cost)
Can start and complete one bank access per cycle
Can sustain N parallel accesses if all N go to different banks

Vector Memory System

Next address = Previous address + Stride
If (stride == 1) && (consecutive elements interleaved across banks) && (number of banks ≥ bank latency) , then
- we can sustain 1 elements/cycle throughput
Strides need to be relatively prime to the number of memory banks

Scalar Code Example : Element-Wise Avg.

여기서 SHFR R7 = R6 >> 1은 2로 나눈다는 뜻이다 Why? 2진수에서 오른쪽으로 한칸씩 밀면 2로 나눈 수가 된다. (0100 ⇒ 0010은 4 ⇒ 2)

Scalar Code Execution Time (In Order) ⭐

Scalar execution time on an in-order processor with 1 bank
- First two loads in the loop cannot be pipelined: 2*11 cycles
- 4 + 50*40 = 2004 cycles
Scalar execution time on an in-order processor with 16 banks (wordinterleaved: consecutive words are stored in consecutive banks)
- First two loads in the loop can be pipelined (LD R4와 LD R5가 12사이클 만에 가능)
- 4 + 50*30 = 1504 cycles
Why 16 banks?
- 11-cycle memory access latency
- Having 16 (>11) banks ensures there are enough banks to overlap enough memory operations to cover memory latency

Vectorizable Loops

A loop is vectorizable if each iteration is independent of any other

Basic Vector Code Performance ⭐

Assume no chaining (no vector data forwarding)
One memory port (one address generator)
16 memory banks (word-interleaved)

285 cycles

Vector Chaining

Data forwarding from one vector functional unit to another

Vector Code Performance - Chaining ⭐

182 cycles

Vector Code Performance - Multiple Memory Ports ⭐

Chaining and 2 load ports, 1 store port in each bank

79 cycles

Question

만약 # data elements > # elements in a vector register 이면?
- loop를 나누면 됨 (vector register가 감당할 수 있는 만큼 씩)
- 예를들어 527 data elements이고, 64-element VREGs이면 8 iterations where VLEN = 64 , 1 iteration where VLEN = 15 (8 64 + 115 = 527)
- 이것을 vector stripmining 이라고 함
만약 데이터가 메모리에 stride 방식으로 저장되지 않으면? (irregular memory access to a vector)
- Use indirection to combine/pack elements into vector registers
- 이것을 scatter/gather operations 라고 함
- Doing so also helps with avoiding useless computation on sparse vectors (i.e., vectors where many elements are 0)

Gather/Scatter Operations

Gather/scatter operations often implemented in hardware to handle sparse vectors (matrices) or indirect indexing
Vector loads and stores use an index vector which is added to the base register to generate the addresses

Conditional Operations in a Loop

위의 코드처럼 operation을 실행시키지 않아야 할 때도 있을 때

방법 : Masked operations
VMASK register is a bit mask determining which data element should not be acted upon

This is predicated execution. Execution is predicated on mask bit.

Some Issues

Stride and banking
- As long as they are relatively prime to each other and there are enough banks to cover bank access latency, we can sustain 1 element/cycle throughput
Storage format of a matrix
- Row major: Consecutive elements in a row are laid out consecutively in memory
- Column major: Consecutive elements in a column are laid out consecutively in memory
- You need to change the stride when accessing a row versus column ⇒ Bank Conflicts

Bank Conflicts in Matrix Multiplication

A and B matrices, both stored in memory in row-major order

Load A’s row 0 into vector register V1
- Accesses have a stride of 1
Load B’s column 0 into vector register V2
- Accesses have a stride of 10

⇒ 둘의 stride가 다름 ⇒ bank conflicts ⇒ 어떻게 최소화 할까?

Minimizing Bank Conflicts

More banks
More ports in each bank
Better data layout to match the access pattern
Better mapping of address to bank

Array vs. Vector Processors, Revisited

Most “modern” SIMD processors are a combination of both
- They exploit data parallelism in both time and space

Vector/SIMD Processing Summary

Vector/SIMD machines are good at exploiting regular data-level parallelism
- Same operation performed on many data elements
- Improve performance, simplify design (no intra-vector dependencies)
Performance improvement limited by vectorizability of code
- Scalar operations limit vector machine performance
- Remember Amdahl’s Law

SIMD ISA Extensions

Single Instruction Multiple Data (SIMD) extension instructions
- Single instruction acts on multiple pieces of data at once
- 그래픽 분야에서 주로 사용
- For example: add four 8-bit numbers

이세진

나중은 결코 오지 않는다.

이전 포스트

14. Multithreading 1,2,3

다음 포스트

15. SIMD 1,2

Computer Science

SIMD Processing: Exploiting Regular (Data) Parallelism

Data Parallelism

SIMD Processing

Vector Registers

Array vs. Vector Processors

SIMD Array Processing vs. VLIW

Vectror Processors 1

Vector Stride Example : Matrix Multiply

Vector Processors 2

Vector Registers

Vector Functional Units

Loading/Storing Vectors from/to Memory

Memory Banking

Vector Memory System

Scalar Code Example : Element-Wise Avg.

Scalar Code Execution Time (In Order) ⭐

Vectorizable Loops

Basic Vector Code Performance ⭐

Vector Chaining

Vector Code Performance - Chaining ⭐

Vector Code Performance - Multiple Memory Ports ⭐

Question

Gather/Scatter Operations

Conditional Operations in a Loop

Some Issues

Bank Conflicts in Matrix Multiplication

Minimizing Bank Conflicts

Array vs. Vector Processors, Revisited

Vector/SIMD Processing Summary

SIMD ISA Extensions

14. Multithreading 1,2,3

16. GPU

0개의 댓글

관련 채용 정보