18. Advanced Caches

이세진·2022년 4월 3일

컴퓨터 구조

Computer Science

목록 보기

62/74

18. Advanced Caches

생성일: 2021년 12월 10일 오후 8:44

Repercussion(악영향) on pipelined processor design

Miss on I-Cache: Insert bubbles until contents supplied
Miss on D-Cache: Insert bubbles into WB stall IF, ID/RR, EXEC

Cache read/write algorithms

Read

Read Access to Cache from CPU

CPU sends index to cache. Cache looks it up and if a hit, sends data to CPU. If cache says miss, CPU sends request to main memory. All in same cycle (IF or MEM in pipeline)
Upon sending address to memory CPU sends NOP's down to subsequent stage until data read. When data arrives it goes to CPU and the cache.

Write

Write Access to Cache from CPU

2가지 선택 방안
1. Write through policy
  1. Write allocate
  2. No-write allocate
2. Write back policy

1. Write Throgh Policy

Each write goes to cache. Tag is set and valid bit is set
- This is write allocate
- write miss가 있다면 no-write allocate도 존재 가능
Each write also goes to write buffer
Write buffer writes data into main memory

2. Write back policy

CPU writes data to cache setting dirty bit
We write to the cache
We don't bother to update main memory
Cache and memory are now inconsistent

Comparison of the Write Policies

Write Through
- 캐시 로직이 간편하고 빠름
- bus 및 메모리에 대한 다른 액세스가 있을 때 버스 및 메모리에 높은 트래픽 생성 가능
Write back
- dirty bit 과 extra logic 필요
- 메모리 액세스 효율 향상

Improving cache performance

Improving miss rate
Reducing miss penalty

이 두가지가 캐시 성능 향상의 중요 사항이다.

Exploiting spatial locality to improve cache performance

지금까지의 캐시 디자인은 명령어 세트에 의해 처리되는 크기의 데이터 항목에 작동 (32비트 단어)
그러나 메모리 하위 시스템에 의해 이동된 메모리 전송 단위의 크기는 같을 필요 X
일반적으로 메모리를 더 크게, 배수로 전송 가능

예를 들어 위와 같은 16바이트 블록일 때
- 이전 예시에서는
- 4Gb Memory: 32 bit address
- 256 Kb Cache 이다.
- Total blocks = 256 Kb/16 b = 16K Blocks
- Need 14 bits to index block
- How many bits for block offset?
  - 16 bytes (4 words) so 4 (2) bits

블록 크기를 늘리는 것은 miss rate를 줄이는데 도움, but, 일정 수준 이상 키우면 오히려 miss rate가 커진다.

Direct-Mapped Caches

Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time
- One index → one entry
Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index
- Assume addresses A and B have the same index bits but different tag bits
- A, B, A, B, A, B, A, B, … -> conflict in the cache index
- All accesses are conflict misses

Set Associativity

Addresses 0 and 8 always conflict in direct mapped cache
Instead of having one column of 8, have 2 columns of 4 blocks

Fully associative cache

As before the cache is broken up into blocks
But now a memory reference may appear in any block

Classification of Cache Misses

Compulsory miss
- first reference to an address (block) always results in a miss
Capacity miss
- cache is too small to hold everything needed
Conflict miss
- defined as any miss that is neither a compulsory nor a capacity miss
- Cache is not full but algorithm sends us to a line that is full

How to Reduce Each Miss Type

Compulsory
- Caching can not help
- Prefetching can: Anticipate which blocks will be needed soon
Capacity
- Utilize cache space better: keep blocks that will be referenced
Conflict
- More associativity
- Better, randomized indexing

Direct Mapped Cache Performance

Cache Placement

Toy example: 256-byte memory, 64-byte cache, 8-byte blocks

Set Associativity

Addresses 0 and 8 always conflict in direct mapped cache
Instead of having one column of 8, have 2 columns of 4 blocks

Higher Associativity

장 : Likelihood of conflict misses even lower
단 : More tag comparators and wider data mux; larger tags

Full Associativity

A block can be placed in any cache location

Associativity(and Tradeoffs)

Higher associativity
- 장 : Higher hit rate
- 단 : 느린 캐시 액세스 시간, 비싼 하드웨어

Fully associative cache

As before the cache is broken up into blocks
But now a memory reference may appear in any block

Set associative cache

16비트 주소와 64Kb 메모리가 있는 컴퓨터가 있다고 가정, 추가로 캐시 블록의 길이가 16바이트이고 캐시 데이터에 사용할 수 있는 128바이트가 있다고 가정

Issues in Set-Associative Caches

set이 우선순위를 가질 때 어떻게 우선순위를 결정할 것인가?
3 가지 key decisions in a set
- Insertion, promotion, eviction (replacement)
Insertion: What happens to priorities on a cache fill?
- Where to insert the incoming block, whether or not to insert the block
Promotion: What happens to priorities on a cache hit?
- Whether and how to change block priority
Eviction/replacement: What happens to priorities on a cache miss?
- Which block to evict (replace) and how to adjust priorities
  - Any invalid block first
  - If all are valid, consult the replacement policy
    - LRU

Implementing LRU

Idea: Evict the least recently accessed block
Problem: Need to keep track of access ordering of blocks
LRU is complex ⇒ 최신 프로세스는 완벽한 LRU 를 구현 X

Cache Relacement Policy: LRU or Random

LRU vs. Random
- Example: 4-way cache, cyclic references to A, B, C, D, E
  - 0% hit rate with LRU policy
- Set thrashing: When the “program working set” in a set is larger than set associativity
  - Random replacement policy is better when thrashing occurs
- 실제로, LRU와 Random의 평균 적중률은 유사 ⇒ 둘의 하이브리드가 최선

What's In A Tag Store Entry?

태그 저장소 항목에는 무엇이 있을까?
- Valid bit
- Tag
- Replacement policy bit
- Dirty bit
  - Write back vs. write through caches

Handling Write

What if the processor writes to an entire block over a small amount of time?
Is there any need to bring the block into the cache from memory in the first place?
Why do we not simply write to only a portion of the block, i.e., subblock
- E.g., 4 bytes out of 64 bytes
- Problem: Valid and dirty bits are associated with the entire 64 bytes, not with each individual 4 bytes

Subblocked (Sectored) Caches

Idea: Divide a block into subblocks (or sectors)
Have separate valid and dirty bits for each subblock (sector)
Allocate only a subblock (or a subset of subblocks) on a request
장점
- No need to transfer the entire cache block into the cache
- More freedom in transferring subblocks into the cache (a cache block does not need to be in the cache fully)
단점
- More complex design
- May not exploit spatial locality fully

Cache Parameters vs. Miss/Hit Rate

Cache size
Block size
Associativity
Replacement policy
Insertion/Placement policy
Promotion policy

Cache size

total data (not including tag) capacity
- bigger can exploit temporal locality better
- 항상 큰것이 좋은것은 아님
Too large a cache adversely affects hit and miss latency
Too small a cache doesn’t exploit temporal locality well, useful data replaced often

Block size

Block size is the data that is associated with an address tag
- not necessarily the unit of transfer between hierarchies (Sub-blocking 이 있기 때문)
Too small blocks
- don’t exploit spatial locality well
- have larger tag overhead
Too large blocks
- 총 블록 수가 적어짐 ⇒ less temporal locality exploitation
- waste of cache space and bandwidth/energy if spatial locality is not high

Large Blocks: Critical-Word and Subblocking

Large cache blocks can take a long time to fill into the cache
- Idea: Fill cache block critical-word first
Large cache blocks can waste bus bandwidth
- Idea: Divide a block into subblocks
  
  ⇒ Sub blocked Cache

Associativity

How many blocks can be present in the same index (i.e., set)?
Larger associativity
- lower miss rate (reduced conflicts)
- higher hit latency and area cost (plus diminishing returns)
Smaller associativity
- lower cost
- lower hit latency

Prefetching

Idea: Fetch the data before it is needed (i.e. pre-fetch) by the program
Why?
- Memory latency is high. If we can prefetch accurately and early enough we can reduce/eliminate that latency.
- Can eliminate compulsory cache misses
Involves predicting which address will be needed in the future
- Works if programs have predictable miss address patterns
Correctness
- Does a misprediction in prefetching affect correctness?
  
  ⇒ No, prefetched data at a “mispredicted” address is simply not used
  
  ⇒ There is no need for state recovery (branch misprediction과 상반됨)

In modern systems, prefetching is usually done in cache block granularity (캐시 블록 단위)
Prefetching is a technique that can reduce both
- Miss rate
- Miss latency
하드웨어, 컴파일러, 프로그래머, 스스템에서 Prefetching 가능
The Four Questions
- What
  - What addresses to prefetch (i.e., address prediction algorithm)
- When
  - When to initiate a prefetch request (early, late, on time)
- Where
  - Where to place the prefetched data (caches, separate buffer)
  - Where to place the prefetcher (which level in memory hierarchy)
- How
  - How does the prefetcher operate and who operates it (software, hardware,..)

Challenges in Prefetching: What

What addresses to prefetch
- Prefetching useless data wastes resources (메모리 대역폭, 버퍼 공간 등)
- Accurate prediction of addresses to prefetch is important
How do we know what to prefetch?
- Predict based on past access patterns
- Use the compiler’s knowledge of data structures

Important: Prefetcher Performance

Accuracy
Coverage
Timeliness
Bandwidth consumption
Cache pollution

이세진

나중은 결코 오지 않는다.

이전 포스트

17. Memory Hierarchy and Caches

다음 포스트

18. Advanced Caches

Computer Science

18. Advanced Caches

Repercussion(악영향) on pipelined processor design

Cache read/write algorithms

Read

Read Access to Cache from CPU

Write

Write Access to Cache from CPU

1. Write Throgh Policy

2. Write back policy

Comparison of the Write Policies

Improving cache performance

Exploiting spatial locality to improve cache performance

Direct-Mapped Caches

Set Associativity

Fully associative cache

Classification of Cache Misses

How to Reduce Each Miss Type

Direct Mapped Cache Performance

Cache Placement

Set Associativity

Higher Associativity

Full Associativity

Associativity(and Tradeoffs)

Fully associative cache

Set associative cache

Issues in Set-Associative Caches

Implementing LRU

Cache Relacement Policy: LRU or Random

What's In A Tag Store Entry?

Handling Write

Subblocked (Sectored) Caches

Cache Parameters vs. Miss/Hit Rate

Cache size

Block size

Large Blocks: Critical-Word and Subblocking

Associativity

Prefetching

Challenges in Prefetching: What

Important: Prefetcher Performance

17. Memory Hierarchy and Caches

19. Virtual Memory

0개의 댓글

관련 채용 정보