18. Advanced Caches
생성일: 2021년 12월 10일 오후 8:44
Repercussion(악영향) on pipelined processor design
- Miss on I-Cache: Insert bubbles until contents supplied
- Miss on D-Cache: Insert bubbles into WB stall IF, ID/RR, EXEC
Cache read/write algorithms
Read
Read Access to Cache from CPU
- CPU sends index to cache. Cache looks it up and if a hit, sends data to CPU. If cache says miss, CPU sends request to main memory. All in same cycle (IF or MEM in pipeline)
- Upon sending address to memory CPU sends NOP's down to subsequent stage until data read. When data arrives it goes to CPU and the cache.
Write
Write Access to Cache from CPU
- 2가지 선택 방안
- Write through policy
- Write allocate
- No-write allocate
- Write back policy
1. Write Throgh Policy
- Each write goes to cache. Tag is set and valid bit is set
- This is write allocate
- write miss가 있다면 no-write allocate도 존재 가능
- Each write also goes to write buffer
- Write buffer writes data into main memory
2. Write back policy
- CPU writes data to cache setting dirty bit
- We write to the cache
- We don't bother to update main memory
- Cache and memory are now inconsistent
Comparison of the Write Policies
- Write Through
- 캐시 로직이 간편하고 빠름
- bus 및 메모리에 대한 다른 액세스가 있을 때 버스 및 메모리에 높은 트래픽 생성 가능
- Write back
- dirty bit 과 extra logic 필요
- 메모리 액세스 효율 향상
- Improving miss rate
- Reducing miss penalty
이 두가지가 캐시 성능 향상의 중요 사항이다.
- 지금까지의 캐시 디자인은 명령어 세트에 의해 처리되는 크기의 데이터 항목에 작동 (32비트 단어)
- 그러나 메모리 하위 시스템에 의해 이동된 메모리 전송 단위의 크기는 같을 필요 X
- 일반적으로 메모리를 더 크게, 배수로 전송 가능
- 예를 들어 위와 같은 16바이트 블록일 때
- 이전 예시에서는
- 4Gb Memory: 32 bit address
- 256 Kb Cache 이다.
- Total blocks = 256 Kb/16 b = 16K Blocks
- Need 14 bits to index block
- How many bits for block offset?
- 16 bytes (4 words) so 4 (2) bits
- 블록 크기를 늘리는 것은 miss rate를 줄이는데 도움, but, 일정 수준 이상 키우면 오히려 miss rate가 커진다.
Direct-Mapped Caches
- Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time
- Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index
- Assume addresses A and B have the same index bits but different tag bits
- A, B, A, B, A, B, A, B, … -> conflict in the cache index
- All accesses are conflict misses
Set Associativity
- Addresses 0 and 8 always conflict in direct mapped cache
- Instead of having one column of 8, have 2 columns of 4 blocks
Fully associative cache
- As before the cache is broken up into blocks
- But now a memory reference may appear in any block
Classification of Cache Misses
- Compulsory miss
- first reference to an address (block) always results in a miss
- Capacity miss
- cache is too small to hold everything needed
- Conflict miss
- defined as any miss that is neither a compulsory nor a capacity miss
- Cache is not full but algorithm sends us to a line that is full
How to Reduce Each Miss Type
- Compulsory
- Caching can not help
- Prefetching can: Anticipate which blocks will be needed soon
- Capacity
- Utilize cache space better: keep blocks that will be referenced
- Conflict
- More associativity
- Better, randomized indexing
Cache Placement
- Toy example: 256-byte memory, 64-byte cache, 8-byte blocks
Set Associativity
- Addresses 0 and 8 always conflict in direct mapped cache
- Instead of having one column of 8, have 2 columns of 4 blocks
Higher Associativity
- 장 : Likelihood of conflict misses even lower
- 단 : More tag comparators and wider data mux; larger tags
Full Associativity
- A block can be placed in any cache location
Associativity(and Tradeoffs)
- Higher associativity
- 장 : Higher hit rate
- 단 : 느린 캐시 액세스 시간, 비싼 하드웨어
Fully associative cache
- As before the cache is broken up into blocks
- But now a memory reference may appear in any block
Set associative cache
- 16비트 주소와 64Kb 메모리가 있는 컴퓨터가 있다고 가정, 추가로 캐시 블록의 길이가 16바이트이고 캐시 데이터에 사용할 수 있는 128바이트가 있다고 가정
Issues in Set-Associative Caches
- set이 우선순위를 가질 때 어떻게 우선순위를 결정할 것인가?
- 3 가지 key decisions in a set
- Insertion, promotion, eviction (replacement)
- Insertion: What happens to priorities on a cache fill?
- Where to insert the incoming block, whether or not to insert the block
- Promotion: What happens to priorities on a cache hit?
- Whether and how to change block priority
- Eviction/replacement: What happens to priorities on a cache miss?
- Which block to evict (replace) and how to adjust priorities
- Any invalid block first
- If all are valid, consult the replacement policy
Implementing LRU
- Idea: Evict the least recently accessed block
- Problem: Need to keep track of access ordering of blocks
- LRU is complex ⇒ 최신 프로세스는 완벽한 LRU 를 구현 X
Cache Relacement Policy: LRU or Random
- LRU vs. Random
- Example: 4-way cache, cyclic references to A, B, C, D, E
- 0% hit rate with LRU policy
- Set thrashing: When the “program working set” in a set is larger than set associativity
- Random replacement policy is better when thrashing occurs
- 실제로, LRU와 Random의 평균 적중률은 유사 ⇒ 둘의 하이브리드가 최선
What's In A Tag Store Entry?
- 태그 저장소 항목에는 무엇이 있을까?
- Valid bit
- Tag
- Replacement policy bit
- Dirty bit
- Write back vs. write through caches
Handling Write
- What if the processor writes to an entire block over a small amount of time?
- Is there any need to bring the block into the cache from memory in the first place?
- Why do we not simply write to only a portion of the block, i.e., subblock
- E.g., 4 bytes out of 64 bytes
- Problem: Valid and dirty bits are associated with the entire 64 bytes, not with each individual 4 bytes
Subblocked (Sectored) Caches
- Idea: Divide a block into subblocks (or sectors)
- Have separate valid and dirty bits for each subblock (sector)
- Allocate only a subblock (or a subset of subblocks) on a request
- 장점
- No need to transfer the entire cache block into the cache
- More freedom in transferring subblocks into the cache (a cache block does not need to be in the cache fully)
- 단점
- More complex design
- May not exploit spatial locality fully
Cache Parameters vs. Miss/Hit Rate
Cache size
- total data (not including tag) capacity
- bigger can exploit temporal locality better
- 항상 큰것이 좋은것은 아님
- Too large a cache adversely affects hit and miss latency
- Too small a cache doesn’t exploit temporal locality well, useful data replaced often
Block size
- Block size is the data that is associated with an address tag
- not necessarily the unit of transfer between hierarchies (Sub-blocking 이 있기 때문)
- Too small blocks
- don’t exploit spatial locality well
- have larger tag overhead
- Too large blocks
- 총 블록 수가 적어짐 ⇒ less temporal locality exploitation
- waste of cache space and bandwidth/energy if spatial locality is not high
Large Blocks: Critical-Word and Subblocking
- Large cache blocks can take a long time to fill into the cache
- Idea: Fill cache block critical-word first
- Large cache blocks can waste bus bandwidth
Associativity
- How many blocks can be present in the same index (i.e., set)?
- Larger associativity
- lower miss rate (reduced conflicts)
- higher hit latency and area cost (plus diminishing returns)
- Smaller associativity
- lower cost
- lower hit latency
Prefetching
- Idea: Fetch the data before it is needed (i.e. pre-fetch) by the program
- Why?
- Memory latency is high. If we can prefetch accurately and early enough we can reduce/eliminate that latency.
- Can eliminate compulsory cache misses
- Involves predicting which address will be needed in the future
- Works if programs have predictable miss address patterns
- Correctness
-
Does a misprediction in prefetching affect correctness?
⇒ No, prefetched data at a “mispredicted” address is simply not used
⇒ There is no need for state recovery (branch misprediction과 상반됨)
- In modern systems, prefetching is usually done in cache block granularity (캐시 블록 단위)
- Prefetching is a technique that can reduce both
- 하드웨어, 컴파일러, 프로그래머, 스스템에서 Prefetching 가능
- The Four Questions
- What
- What addresses to prefetch (i.e., address prediction algorithm)
- When
- When to initiate a prefetch request (early, late, on time)
- Where
- Where to place the prefetched data (caches, separate buffer)
- Where to place the prefetcher (which level in memory hierarchy)
- How
- How does the prefetcher operate and who operates it (software, hardware,..)
Challenges in Prefetching: What
- What addresses to prefetch
- Prefetching useless data wastes resources (메모리 대역폭, 버퍼 공간 등)
- Accurate prediction of addresses to prefetch is important
- How do we know what to prefetch?
- Predict based on past access patterns
- Use the compiler’s knowledge of data structures
-
Accuracy
-
Coverage
-
Timeliness
-
Bandwidth consumption
-
Cache pollution