[CS] lecture11

Minsolยท2024๋…„ 12์›” 14์ผ
0

๐Ÿ–ฅ๏ธCS

๋ชฉ๋ก ๋ณด๊ธฐ
10/12

Chapter 5: Large and Fast- Exploiting Memory Hierarchy (ํฌ๊ณ  ๋น ๋ฅธ - ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ)

Cache Coherence Problem(์บ์‹œ ์ผ๊ด€์„ฑ ๋ฌธ์ œ)

๐Ÿšจ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋–ค ์ƒํ™ฉ์ธ์ง€ ์•Œ๊ณ  ์ด๋ฆ„ ๋งํ•  ์ˆ˜ ์žˆ๋„๋ก !

  • ๋‘ ๊ฐœ์˜ CPU core๊ฐ€ ๋ฌผ๋ฆฌ์  ์ฃผ์†Œ ๊ณต๊ฐ„(physical address space)์„ ๊ณต์œ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•  ๋•Œ
    - Write-through caches

๐Ÿ–ผ๏ธ๋‘๊ฐœ์˜ CPU Core(CPU A, CPU B)๊ฐ€ ๋™์ผํ•œ ๋ฌผ๋ฆฌ์  ์ฃผ์†Œ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๊ณ  ๊ฐ CPU๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์บ์‹œ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ƒํ™ฉ
-> ๊ฐ ์‹œ๊ฐ„ ๋‹จ๊ณ„์—์„œ ์–ด๋–ค ์ผ์ด ๋ฐœ์ƒํ•˜๋Š”์ง€ & ๊ทธ์— ๋”ฐ๋ผ ๊ฐ CPU์˜ ์บ์‹œ์™€ ๋ฉ”๋ชจ๋ฆฌ ์ƒํƒœ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์คŒ

0: ์ฒ˜์Œ ์‹œ์ž‘์‹œ CPU A์™€ CPU B ๋ชจ๋‘ X ๋ณ€์ˆ˜์— ์ ‘๊ทผx
1: CPU A๊ฐ€ X๋ฅผ ์ฝ๊ณ  CPU A์˜ ์บ์‹œ์—์„œ X ๊ฐ’์€ ์—ฌ์ „ํžˆ 0์ด๊ณ , CPU B์˜ ์บ์‹œ์™€ ๋ฉ”๋ชจ๋ฆฌ ์—ญ์‹œ ๊ฐ’์ด ๋ณ€ํ•˜์ง€ x
2: CPU B๋„ X๋ฅผ ์ฝ๊ณ  CPU A์™€ CPU B ๋ชจ๋‘ X๋ฅผ ์บ์‹œ์— ์ €์žฅํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๋‘˜ ๋‹ค 0์œผ๋กœ ๋™์ผํ•œ ๊ฐ’์„ ์œ ์ง€
3: CPU A๊ฐ€ X์— ๊ฐ’์„ 1๋กœ ์“ฐ๊ธฐ ์‹œ์ž‘ํ•˜๊ณ  ์ด๋•Œ CPU A์˜ ์บ์‹œ๋Š” X ๊ฐ’์„ 1๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ , ๋ฉ”๋ชจ๋ฆฌ ๋˜ํ•œ 1๋กœ ์—…๋ฐ์ดํŠธ๋จ but CPU B์˜ ์บ์‹œ์—๋Š” ์•„์ง 0์ด ์ €์žฅ

โžก๏ธํ•ต์‹ฌ: ๋‘ CPU๊ฐ€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๊ณ  ์“ฐ๋Š” ๊ฒฝ์šฐ, ๊ฐ CPU์˜ ์บ์‹œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒ <<CPU A๋Š” 1์„ ์ผ์ง€๋งŒ, CPU B๋Š” ์—ฌ์ „ํžˆ X๋ฅผ 0์œผ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Œ & ์ด๋•Œ, CPU B๊ฐ€ X๋ฅผ ์ฝ์„ ๋•Œ 1์„ ์ฝ์ง€ ์•Š๊ณ  0์„ ์ฝ์œผ๋ฉด ์ผ๊ด€์„ฑ์ด ๊นจ์ง€๊ฒŒ ๋˜๋ฏ€๋กœ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์บ์‹œ ์ผ๊ด€์„ฑ ํ”„๋กœํ† ์ฝœ์ด ํ•„์š”>>

Coherence Defined(์ผ๊ด€์„ฑ์˜ ์ •์˜)

  • ๋น„๊ณต์‹์ ์œผ๋กœ(informally): ์ฝ๊ธฐ(Reads)๋Š” ๊ฐ€์žฅ ์ตœ๊ทผ์— ๊ธฐ๋ก๋œ ๊ฐ’(most recently written value)์„ ๋ฐ˜ํ™˜ํ•ด์•ผํ•จ
  • ๊ณต์‹์ ์œผ๋กœ(formally):
    • P๊ฐ€ X์— ์“ฐ๊ณ , P๊ฐ€ X๋ฅผ ์ฝ๋Š”๋‹ค(์ค‘๊ฐ„์— ๋‹ค๋ฅธ ์“ฐ๊ธฐ ์—†์Œ=no intervening writes)
      • => ์ฝ๊ธฐ๋Š” ๊ธฐ๋ก๋œ ๊ฐ’์„ ๋ฐ˜ํ™˜
    • P1์ด X์— ์“ฐ๊ณ , P2๊ฐ€ X๋ฅผ ์ฝ๋Š”๋‹ค(์ถฉ๋ถ„ํžˆ ๋‚˜์ค‘์—==sufficiently later)
      • => ์ฝ๊ธฐ๋Š” ๊ธฐ๋ก๋œ ๊ฐ’์„ ๋ฐ˜ํ™˜
        • ex. ์œ„ ์˜ˆ์‹œ์—์„œ 3๋‹จ๊ณ„ ์ดํ›„ CPU B๊ฐ€ X๋ฅผ ์ฝ๋Š” ๊ฒฝ์šฐ ์ฐธ๊ณ 
    • P1์ด X์— ์“ฐ๊ณ , P2๊ฐ€ X์— ์“ด๋‹ค
      • => ๋ชจ๋“  ํ”„๋กœ์„ธ์„œ๋Š” ๋™์ผํ•œ ์ˆœ์„œ๋กœ ์“ฐ๊ธฐ๋ฅผ ๋ณธ๋‹ค
        • X์˜ ์ตœ์ข… ๊ฐ’์ด ๋ชจ๋‘ ๋™์ผํ•ด์•ผํ•จ

Cache Coherence Protocols(์บ์‹œ ์ผ๊ด€์„ฑ ํ”„๋กœํ† ์ฝœ)

  • ๋‹ค์ค‘ํ”„๋กœ์„ธ์„œ(multiprocessors)์—์„œ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์บ์‹œ์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ์ž‘์—…
    • data๋ฅผ local cache๋กœ ์ด๋™(migration)
      • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•œ bandwidth(๋Œ€์—ญํญ) ๊ฐ์†Œ
    • read-shared data์˜ ๋ณต์ œ(replication)
      • ์ ‘๊ทผ์— ๋Œ€ํ•œ ๊ฒฝ์Ÿ(contention) ๊ฐ์†Œ
  • Snnoping protocols
    • ๊ฐ ์บ์‹œ๊ฐ€ bus์˜ reads/writes๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋ง
  • Directory-based protocols
    • ์บ์‹œ์™€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ block์˜ ๊ณต์œ  ์ƒํƒœ๋ฅผ directory์— ๊ธฐ๋ก(record)

Chapter 6: Parallel Processors from Client to Cloud(ํด๋ผ์ด์–ธํŠธ์—์„œ ํด๋ผ์šฐ๋“œ๊นŒ์ง€์˜ ๋ณ‘๋ ฌ ํ”„๋กœ์„ธ์„œ)

Parallel Programming(๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ)

  • ๋ณ‘๋ ฌ ์†Œํ”„ํŠธ์›จ์–ด(parallel software)๊ฐ€ ๋ฌธ์ œ์˜ ํ•ต์‹ฌ
  • ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์–ป์–ด์•ผํ•จ
    • ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด, ๋” ๋น ๋ฅธ ๋‹จ์ผํ”„๋กœ์„ธ์„œ(uniprocessor)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์‰ฌ์›€!
  • ์–ด๋ ค์šด ๊ฒƒ๋“ค
    • ๋ถ„ํ• (Partitioning)
    • ์กฐ์ •(Coordination)
    • ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ(Communications overhead)

Scaling Example

Workload(์ž‘์—…): 10๊ฐœ์˜ scalars sum๊ณผ 10 ร— 10 matrix sum
- ํ”„๋กœ์„ธ์„œ ์ˆ˜๋ฅผ 10๊ฐœ์—์„œ 100๊ฐœ๋กœ ๋Š˜๋ ธ์„ ๋•Œ์˜ ์†๋„ ํ–ฅ์ƒ

  • ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ: ์‹œ๊ฐ„ = (10 + 100) ร— tadd
  • 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ:
    • Time = 10 ร— tadd + 100/10 ร— tadd = 20 ร— tadd
    • ์†๋„ํ–ฅ์ƒ(speedup) = 110/20 = 5.5 (์ž ์žฌ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 55%)
  • 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ:
    - Time = 10 ร— tadd + 100/100 ร— tadd = 11 ร— tadd
    - ์†๋„ํ–ฅ์ƒ(speedup) = 110/11 = 10 (์ž ์žฌ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 10%)
    โ€ป ์ž‘์—…(load)์ด ํ”„๋กœ์„ธ์„œ ๊ฐ„์— ๊ท ํ˜• ์žˆ๊ฒŒ ๋ถ„๋ฐฐ๋œ๋‹ค๊ณ  ๊ฐ€์ •

Q. ํ–‰๋ ฌ ํฌ๊ธฐ๊ฐ€ 100 ร— 100์ธ ๊ฒฝ์šฐ๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?

  • ๋‹จ์ผ ํ”„๋กœ์„ธ์„œ: ์‹œ๊ฐ„ = (10 + 10000) ร— tadd
  • 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ:
    • ์‹œ๊ฐ„ = 10 ร— tadd + 10000/10 ร— tadd = 1010 ร— tadd
    • ์†๋„ํ–ฅ์ƒ(speedup) = 10010/1010 = 9.9 (์ž ์žฌ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 99%)
  • 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ:
    - ์‹œ๊ฐ„ = 10 ร— tadd + 10000/100 ร— tadd = 110 ร— tadd
    - ์†๋„ํ–ฅ์ƒ(speedup) = 10010/110 = 91 (์ž ์žฌ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 91%)
    โ€ป ์ž‘์—…์ด ๊ท ํ˜• ์žˆ๊ฒŒ ๋ถ„๋ฐฐ๋œ๋‹ค๊ณ  ๊ฐ€์ •

Strong vs Weak Scaling(๊ฐ•ํ•œ ์Šค์ผ€์ผ๋ง vs ์•ฝํ•œ ์Šค์ผ€์ผ๋ง)

๐Ÿšจ๋ญ๊ฐ€ strong, weak์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๊ธฐ

  • Strong scaling: ๋ฌธ์ œ ํฌ๊ธฐ ๊ณ ์ •๋จ
    • ์œ„์˜ ์˜ˆ์‹œ์™€ ๊ฐ™์Œ
  • Weak scaling: ๋ฌธ์ œ ํฌ๊ธฐ๊ฐ€ ํ”„๋กœ์„ธ์„œ ์ˆ˜์— ๋น„๋ก€(proportional)
    • 10๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ, 10 ร— 10 ํ–‰๋ ฌ
      • ์‹œ๊ฐ„ = 20 ร— tadd
    • 100๊ฐœ์˜ ํ”„๋กœ์„ธ์„œ, 32 ร— 32 ํ–‰๋ ฌ
      • ์‹œ๊ฐ„ = 10 ร— tadd + 1000/100 ร— tadd = 20 ร— tadd
        โ€ป ์ด ์˜ˆ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋จ

Instruction and Data Streams

๐Ÿšจ๊ทธ๋ฆผ๊ณผ ํ‘œ๋ฅผ ๊ธฐ์–ต!

  • ๋‹ค๋ฅธ ๋ถ„๋ฅ˜ ๋ฐฉ์‹
  • SPMD(Single Program Multiple Data): ๋‹จ์ผ ํ”„๋กœ๊ทธ๋žจ ๋‹ค์ค‘ ๋ฐ์ดํ„ฐ
    • MIMD ์ปดํ“จํ„ฐ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ
    • ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ๋ฅผ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ์ฝ”๋“œ(conditional code) ํฌํ•จ

Vector vs Scalar

  • ๋ฒกํ„ฐ ์•„ํ‚คํ…์ฒ˜(vector architectures)์™€ ์ปดํŒŒ์ผ๋Ÿฌ(compilers)
    • ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์„ ๋‹จ์ˆœํ™”
    • ๋ฃจํ”„ ์ข…์†์„ฑ(loop-carried dependences)์˜ ๋ถ€์žฌ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„(explicit statement)
      • ํ•˜๋“œ์›จ์–ด์—์„œ์˜ checking ๊ฐ์†Œ
    • ๊ทœ์น™์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์ด interleaved์™€ burst memory์˜ ์ด์ ์„ ํ™œ์šฉ
    • loop๋ฅผ ํ”ผํ•จ์œผ๋กœ์จ control hazard๋ฅผ ํšŒํ”ผ
  • ํŠน์ˆ˜ ๋ชฉ์ ์˜ ๋ฏธ๋””์–ด ํ™•์žฅ(ad-hoc media extension, such as MMX, SSE)๋ณด๋‹ค ๋” ์ผ๋ฐ˜์ ์ž„
    • ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ์ˆ ๊ณผ ๋” ์ž˜ ๋งž์Œ

SIMD(Single Instruction, Multiple Data)

-> ๊ฑฐ์˜ ๋‹จ์ ์ด ์—†์Œ!!

  • ๋ฐ์ดํ„ฐ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ์›์†Œ๋ณ„ ์—ฐ์‚ฐ(operate elementwise) ์ˆ˜ํ–‰
    • ex. x86์˜ MMX ๋ฐ SSE ๋ช…๋ น์–ด
      • 128-bit wide register์— ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์š”์†Œ ์ €์žฅ
  • ๋ชจ๋“  ํ”„๋กœ์„ธ์„œ๊ฐ€ ๋™์‹œ์— ๋™์ผํ•œ ๋ช…๋ น์„ ์ˆ˜ํ–‰
    • ๊ฐ ํ”„๋กœ์„ธ์„œ๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์ฃผ์†Œ ๋“ฑ์„ ์‚ฌ์šฉ
  • ๋™๊ธฐํ™”(synchronization)๋ฅผ ๋‹จ์ˆœํ™”
  • ๋ช…๋ น์–ด ์ œ์–ด ํ•˜๋“œ์›จ์–ด(instruction control hardware)๊ฐ€ ์ ์Œ
  • ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์ด ๋†’์€ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜(highly data-parallel applications)์—์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ ๋ฐœํœ˜

Multithreading(๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ)

  • ๋ณ‘๋ ฌ(parallel)๋กœ ์—ฌ๋Ÿฌ thread๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ์‹
    • ๋ ˆ์ง€์Šคํ„ฐ, PC ๋“ฑ์„ ๋ณต์ œ(replicate)
    • thread ๊ฐ„ ๋น ๋ฅธ ์ „ํ™˜(switching)
  • ์„ธ๋ฐ€ํ•œ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ(Fine-grain multithreading)
    • ๊ฐ ์‚ฌ์ดํด๋งˆ๋‹ค ์Šค๋ ˆ๋“œ๋ฅผ ์ „ํ™˜(switch)
    • ๋ช…๋ น์–ด ์‹คํ–‰์„ ๊ต์ฐจ๋กœ ๋ฐฐ์น˜(interleave)
    • ํ•œ ์Šค๋ ˆ๋“œ๊ฐ€ stall(์ง€์—ฐ)๋˜๋ฉด, ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋ฅผ ์‹คํ–‰
  • ๊ฑฐ์นœ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ(Coarse-grain multithreading)
    • ๊ธด ์ง€์—ฐ(ex. L2-cache miss)์—์„œ๋งŒ ์Šค๋ ˆ๋“œ๋ฅผ ์ „ํ™˜(switch)
    • ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋‹จ์ˆœํ•˜์ง€๋งŒ, ์งง์€ stall(ex. data hazards)์—์„œ๋Š” ์ˆจ๊ธธ ์ˆ˜ ์—†์Œ

Shared Memory

๐Ÿšจinterconnection memory๋ž‘ ์—ฐ๊ฒฐํ•ด์„œ ๋ณด๊ธฐ

  • SMP: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ(shared memory multiprocessor)
    • ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ชจ๋“  ํ”„๋กœ์„ธ์„œ์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ฌผ๋ฆฌ์  ์ฃผ์†Œ ๊ณต๊ฐ„์„ ์ œ๊ณต
    • ๊ณต์œ  ๋ณ€์ˆ˜๋ฅผ locks(์ž ๊ธˆ)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™๊ธฐํ™”(synchronize)
    • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐ„
      • UMA (์ผ๊ด€๋œ=uniform) vs. NUMA (๋น„์ผ๊ด€๋œ=nonuniform)
profile
๐Ÿ‘€

0๊ฐœ์˜ ๋Œ“๊ธ€