NUMA Archi. 에서 OpenMP 최적화

규규·2024년 6월 20일

병렬 프로그래밍

목록 보기

9/11

NUMA archi. 에서 multi-threading programming 의 함정

thread migration
- 일반적으로 OS 는 core 간 workload 차이가 크게 나면, 작업(thread or process) 를 core 간 옮김. load balancing 을 개선 시키나, 추가적인 cost 가 발생 함.
- 이 과정은 NUMA system 에서 바람직 하지 않은데, 그 이유는 작업이 다른 node 로 옮겨지면 remote memory access 를 하게 되어 느린 interconnect 를 사용해야 하기 때문 임.
- NUMA 에서의 해결 방법은 pinning
Data placement
- 프로그래머가 NUMA 아키텍처를 인식하지 못한 채 메모리를 할당하는 경우 로컬이 아닌 소켓에 있는 모든 스레드는 느린 interconnect 를 사용하여 메모리 접근 해야 함.
- First Touch policy 방법이 있음. OS 는 NUMA archi. 를 알고 있기 때문에, 첫번째로 memory 를 touch 한 thread 의 local memory 에 data 를 할당 함.
NUMA aware code example

 #pragma omp parallel for
 for(int i = 0; i < N; i++)
 {
     a[i] = 0.0;
     b[i] = i;
 }

 #pragma omp parallel for
 {
     a[i] = a[i] + b[i];
 }

pinning 이 옳바르게 적용되었으면, thread 별 local memory 에만 접근하면 되도록 data chunk 가 분배 됨.

Binding/Pinning

The terms thread pinning,thread affinity,process binding,process affinity 는 혼용되어 사용 됨.
OpenMP 에서 thread 가 실행 될 core 를 지정하는 방법
- 2가지 명령어를 사용해야 함 OMP_PLACES,OMP_PROC_BINC
- OMP_PLACES
  - thread 가 실행 될 core 를 지정
  - 간격으로 지정
    - <lowerbound>:<length>:<stride>
    - example
      - OMP_PLACES : {0}:12:2 or {0:1}:12:2 -> Places : {0},{2},{4},{6},{8},{10},{12},{14},{16},{18},{20},{22}
      - OMP_PLACES : {0,1}:6:4 or {0:2}:6:4 -> Places : {0,1},{4,5},{8,9},{12,13},{16,17},{20,21}
  - list 로 지정 : OMP_PLACES="{0,1,2,3}"
- OMP_PROC_BIND
  - thread 가 실행 될 core 의 binding policy 를 지정
  - 지정을 안 하면 default 로는 thread 을 node, core 에 렌덤으로 실행
  - value
    - true : thread 가 못 이동함
    - false : thread 가 이동 가능
    - master : worker thread 가 master thread 와 같은 partition 에 위치
    - close : worker thread 가 master thread 와 가깝게 에 위치
    - spread : worker thread 가 master thread 와 최대한 먼 partition 에 위치
OpenMP option : --bind-to,--map-by,--report-bindings
TODO : ...
Binding 전략
- Target machine 의 system topology 를 알아야함 (cpuinfo,hwloc-ls cmd 사용)
- Thread 를 서로 다른 socket 에 배치
  - App 에서 사용 가능한 Memory bandwidth 를 증가.
  - App 에서 사용 가능한 cache size 증가
  - Synchromization performance 감소
- Thread 를 같은 socket 에 배치
  - 사용 가능한 memory bandwidth 와 cache size 감소
  - Synchronization performance 증가

출처 :

https://hpc-wiki.info/hpc/NUMA

https://hpc-wiki.info/hpc/Binding/Pinning

https://hpc-wiki.info/hpc/OpenMP_in_Small_Bites/NUMA

규규

복습용 저장소

이전 포스트

GPU API 정리

다음 포스트

NUMA Archi. 에서 OpenMP 최적화

병렬 프로그래밍

NUMA archi. 에서 multi-threading programming 의 함정

Binding/Pinning

GPU API 정리

MPI

0개의 댓글