18. Advanced Caches
생성일: 2021년 12월 10일 오후 8:44
Repercussion(악영향) on pipelined processor design
![](https://media.vlpt.us/images/lsj8706/post/446ab486-6952-4568-98f9-2378b91b1eb3/Untitled.png)
- Miss on I-Cache: Insert bubbles until contents supplied
- Miss on D-Cache: Insert bubbles into WB stall IF, ID/RR, EXEC
Cache read/write algorithms
Read
![](https://media.vlpt.us/images/lsj8706/post/59a700fe-0fb9-4e81-96f4-7797679901e1/Untitled%201.png)
![](https://media.vlpt.us/images/lsj8706/post/4ef97ea2-8f54-4b79-a71d-a75cae345a7c/Untitled%202.png)
Read Access to Cache from CPU
- CPU sends index to cache. Cache looks it up and if a hit, sends data to CPU. If cache says miss, CPU sends request to main memory. All in same cycle (IF or MEM in pipeline)
- Upon sending address to memory CPU sends NOP's down to subsequent stage until data read. When data arrives it goes to CPU and the cache.
Write
![](https://media.vlpt.us/images/lsj8706/post/eedc972c-5e2d-4f89-952f-1363054f5504/Untitled%203.png)
![](https://media.vlpt.us/images/lsj8706/post/b84ec658-2226-4e1e-a2aa-fee28dfa61af/Untitled%204.png)
Write Access to Cache from CPU
- 2가지 선택 방안
- Write through policy
- Write allocate
- No-write allocate
- Write back policy
1. Write Throgh Policy
- Each write goes to cache. Tag is set and valid bit is set
- This is write allocate
- write miss가 있다면 no-write allocate도 존재 가능
- Each write also goes to write buffer
- Write buffer writes data into main memory
![](https://media.vlpt.us/images/lsj8706/post/83a02e84-d8da-41b7-9011-246c892a2333/Untitled%205.png)
2. Write back policy
- CPU writes data to cache setting dirty bit
- We write to the cache
- We don't bother to update main memory
- Cache and memory are now inconsistent
![](https://media.vlpt.us/images/lsj8706/post/a5546f81-e0e6-473c-936a-ce9d08935dec/Untitled%206.png)
Comparison of the Write Policies
- Write Through
- 캐시 로직이 간편하고 빠름
- bus 및 메모리에 대한 다른 액세스가 있을 때 버스 및 메모리에 높은 트래픽 생성 가능
- Write back
- dirty bit 과 extra logic 필요
- 메모리 액세스 효율 향상
- Improving miss rate
- Reducing miss penalty
이 두가지가 캐시 성능 향상의 중요 사항이다.
- 지금까지의 캐시 디자인은 명령어 세트에 의해 처리되는 크기의 데이터 항목에 작동 (32비트 단어)
- 그러나 메모리 하위 시스템에 의해 이동된 메모리 전송 단위의 크기는 같을 필요 X
- 일반적으로 메모리를 더 크게, 배수로 전송 가능
![](https://media.vlpt.us/images/lsj8706/post/34ec0208-7b4d-49aa-8bc5-71207917322c/Untitled%207.png)
- 예를 들어 위와 같은 16바이트 블록일 때
- 이전 예시에서는
- 4Gb Memory: 32 bit address
- 256 Kb Cache 이다.
- Total blocks = 256 Kb/16 b = 16K Blocks
- Need 14 bits to index block
- How many bits for block offset?
- 16 bytes (4 words) so 4 (2) bits
![](https://media.vlpt.us/images/lsj8706/post/85ace5ed-85ee-4086-b92f-9ab838a3a4fb/Untitled%208.png)
![](https://media.vlpt.us/images/lsj8706/post/b09060c0-e608-4e94-a5ef-fa5072a75ecd/Untitled%209.png)
![](https://media.vlpt.us/images/lsj8706/post/c28777c7-3b2a-4ce5-8183-d56e4a2b6840/Untitled%2010.png)
- 블록 크기를 늘리는 것은 miss rate를 줄이는데 도움, but, 일정 수준 이상 키우면 오히려 miss rate가 커진다.
Direct-Mapped Caches
- Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time
- Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index
- Assume addresses A and B have the same index bits but different tag bits
- A, B, A, B, A, B, A, B, … -> conflict in the cache index
- All accesses are conflict misses
Set Associativity
- Addresses 0 and 8 always conflict in direct mapped cache
- Instead of having one column of 8, have 2 columns of 4 blocks
![](https://media.vlpt.us/images/lsj8706/post/7472fec6-db36-40d4-b6b9-89cccbaf921a/Untitled%2011.png)
Fully associative cache
- As before the cache is broken up into blocks
- But now a memory reference may appear in any block
![](https://media.vlpt.us/images/lsj8706/post/9720efd1-d1b9-440d-89b6-690c6c71f3d7/Untitled%2012.png)
Classification of Cache Misses
- Compulsory miss
- first reference to an address (block) always results in a miss
- Capacity miss
- cache is too small to hold everything needed
- Conflict miss
- defined as any miss that is neither a compulsory nor a capacity miss
- Cache is not full but algorithm sends us to a line that is full
How to Reduce Each Miss Type
- Compulsory
- Caching can not help
- Prefetching can: Anticipate which blocks will be needed soon
- Capacity
- Utilize cache space better: keep blocks that will be referenced
- Conflict
- More associativity
- Better, randomized indexing
![](https://media.vlpt.us/images/lsj8706/post/70cdf3e4-6d7b-483f-8553-8a1047170844/Untitled%2013.png)
![](https://media.vlpt.us/images/lsj8706/post/9fc67d0c-e8e6-4643-9373-3b46065a6103/Untitled%2014.png)
Cache Placement
- Toy example: 256-byte memory, 64-byte cache, 8-byte blocks
![](https://media.vlpt.us/images/lsj8706/post/aa58c82a-da1b-47a3-be37-7f62b023b8e9/Untitled%2015.png)
Set Associativity
- Addresses 0 and 8 always conflict in direct mapped cache
- Instead of having one column of 8, have 2 columns of 4 blocks
![](https://media.vlpt.us/images/lsj8706/post/b1e674ce-66e4-470f-903a-ff5756dbc96a/Untitled%2011.png)
Higher Associativity
- 장 : Likelihood of conflict misses even lower
- 단 : More tag comparators and wider data mux; larger tags
![](https://media.vlpt.us/images/lsj8706/post/fa72d35e-08eb-4395-8985-4dc1bb72ce6f/Untitled%2016.png)
Full Associativity
- A block can be placed in any cache location
![](https://media.vlpt.us/images/lsj8706/post/f7aca303-bed6-4cad-8772-88e59f254faf/Untitled%2017.png)
Associativity(and Tradeoffs)
- Higher associativity
- 장 : Higher hit rate
- 단 : 느린 캐시 액세스 시간, 비싼 하드웨어
Fully associative cache
- As before the cache is broken up into blocks
- But now a memory reference may appear in any block
![](https://media.vlpt.us/images/lsj8706/post/23b3612d-7cf4-40f2-8407-5ad683748bd9/Untitled%2012.png)
Set associative cache
![](https://media.vlpt.us/images/lsj8706/post/4b1169c2-c446-444d-8b4a-0deb160b79ef/Untitled%2018.png)
- 16비트 주소와 64Kb 메모리가 있는 컴퓨터가 있다고 가정, 추가로 캐시 블록의 길이가 16바이트이고 캐시 데이터에 사용할 수 있는 128바이트가 있다고 가정
![](https://media.vlpt.us/images/lsj8706/post/73d7c93a-980e-4fcb-b917-54ec23b6a030/Untitled%2019.png)
![](https://media.vlpt.us/images/lsj8706/post/37c33fdc-0079-4add-b870-73d692565022/Untitled%2020.png)
Issues in Set-Associative Caches
- set이 우선순위를 가질 때 어떻게 우선순위를 결정할 것인가?
- 3 가지 key decisions in a set
- Insertion, promotion, eviction (replacement)
- Insertion: What happens to priorities on a cache fill?
- Where to insert the incoming block, whether or not to insert the block
- Promotion: What happens to priorities on a cache hit?
- Whether and how to change block priority
- Eviction/replacement: What happens to priorities on a cache miss?
- Which block to evict (replace) and how to adjust priorities
- Any invalid block first
- If all are valid, consult the replacement policy
Implementing LRU
- Idea: Evict the least recently accessed block
- Problem: Need to keep track of access ordering of blocks
- LRU is complex ⇒ 최신 프로세스는 완벽한 LRU 를 구현 X
Cache Relacement Policy: LRU or Random
- LRU vs. Random
- Example: 4-way cache, cyclic references to A, B, C, D, E
- 0% hit rate with LRU policy
- Set thrashing: When the “program working set” in a set is larger than set associativity
- Random replacement policy is better when thrashing occurs
- 실제로, LRU와 Random의 평균 적중률은 유사 ⇒ 둘의 하이브리드가 최선
What's In A Tag Store Entry?
- 태그 저장소 항목에는 무엇이 있을까?
- Valid bit
- Tag
- Replacement policy bit
- Dirty bit
- Write back vs. write through caches
Handling Write
- What if the processor writes to an entire block over a small amount of time?
- Is there any need to bring the block into the cache from memory in the first place?
- Why do we not simply write to only a portion of the block, i.e., subblock
- E.g., 4 bytes out of 64 bytes
- Problem: Valid and dirty bits are associated with the entire 64 bytes, not with each individual 4 bytes
Subblocked (Sectored) Caches
- Idea: Divide a block into subblocks (or sectors)
- Have separate valid and dirty bits for each subblock (sector)
- Allocate only a subblock (or a subset of subblocks) on a request
- 장점
- No need to transfer the entire cache block into the cache
- More freedom in transferring subblocks into the cache (a cache block does not need to be in the cache fully)
- 단점
- More complex design
- May not exploit spatial locality fully
![](https://media.vlpt.us/images/lsj8706/post/cd782bf3-6290-439f-b7e4-dae3a19f1960/Untitled%2021.png)
Cache Parameters vs. Miss/Hit Rate
Cache size
- total data (not including tag) capacity
- bigger can exploit temporal locality better
- 항상 큰것이 좋은것은 아님
- Too large a cache adversely affects hit and miss latency
- Too small a cache doesn’t exploit temporal locality well, useful data replaced often
Block size
- Block size is the data that is associated with an address tag
- not necessarily the unit of transfer between hierarchies (Sub-blocking 이 있기 때문)
- Too small blocks
- don’t exploit spatial locality well
- have larger tag overhead
- Too large blocks
- 총 블록 수가 적어짐 ⇒ less temporal locality exploitation
- waste of cache space and bandwidth/energy if spatial locality is not high
Large Blocks: Critical-Word and Subblocking
- Large cache blocks can take a long time to fill into the cache
- Idea: Fill cache block critical-word first
- Large cache blocks can waste bus bandwidth
Associativity
- How many blocks can be present in the same index (i.e., set)?
- Larger associativity
- lower miss rate (reduced conflicts)
- higher hit latency and area cost (plus diminishing returns)
- Smaller associativity
- lower cost
- lower hit latency
Prefetching
- Idea: Fetch the data before it is needed (i.e. pre-fetch) by the program
- Why?
- Memory latency is high. If we can prefetch accurately and early enough we can reduce/eliminate that latency.
- Can eliminate compulsory cache misses
- Involves predicting which address will be needed in the future
- Works if programs have predictable miss address patterns
- Correctness
-
Does a misprediction in prefetching affect correctness?
⇒ No, prefetched data at a “mispredicted” address is simply not used
⇒ There is no need for state recovery (branch misprediction과 상반됨)
- In modern systems, prefetching is usually done in cache block granularity (캐시 블록 단위)
- Prefetching is a technique that can reduce both
- 하드웨어, 컴파일러, 프로그래머, 스스템에서 Prefetching 가능
- The Four Questions
- What
- What addresses to prefetch (i.e., address prediction algorithm)
- When
- When to initiate a prefetch request (early, late, on time)
- Where
- Where to place the prefetched data (caches, separate buffer)
- Where to place the prefetcher (which level in memory hierarchy)
- How
- How does the prefetcher operate and who operates it (software, hardware,..)
Challenges in Prefetching: What
- What addresses to prefetch
- Prefetching useless data wastes resources (메모리 대역폭, 버퍼 공간 등)
- Accurate prediction of addresses to prefetch is important
- How do we know what to prefetch?
- Predict based on past access patterns
- Use the compiler’s knowledge of data structures
-
Accuracy
-
Coverage
-
Timeliness
-
Bandwidth consumption
-
Cache pollution