

























| ache Oper        | ratio | n      |       |        |       |        |
|------------------|-------|--------|-------|--------|-------|--------|
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |
| ° Associativity: | 2-v   | way    | 4-v   | vay    | 8-w   | ay     |
| ° Size           | LRU   | Random | LRU   | Random | LRU   | Random |
| ° 16 KB          | 5.18% | 5.69%  | 4.67% | 5.29%  | 4.39% | 4.96%  |
| °64 KB           | 1.88% | 2.01%  | 1.54% | 1.66%  | 1.39% | 1.53%  |
| ° 256 KB         | 1.15% | 1.17%  | 1.13% | 1.13%  | 1.12% | 1.12%  |
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |
|                  |       |        |       |        |       |        |











| Size   | Instruction cache | Data cache | Unified cache |
|--------|-------------------|------------|---------------|
| 1 KB   | 3.06%             | 24.61%     | 13.34%        |
| 2 KB   | 2.26%             | 20.57%     | 9.78%         |
| 4 KB   | 1.78%             | 15.94%     | 7.24%         |
| 8 KB   | 1.10%             | 10.19%     | 4.57%         |
| 16 KB  | 0.64%             | 6.47%      | 2.87%         |
| 32 KB  | 0.39%             | 4.82%      | 1.99%         |
| 64 KB  | 0.15%             | 3.77%      | 1.35%         |
| 128 KB | 0.02%             | 2.88%      | 0.95%         |





| Write Allo | <u>cate:</u>                                               |
|------------|------------------------------------------------------------|
|            | e block is loaded on a write miss<br>by write hit actions. |
| No-Write A | llocate:                                                   |
|            | is modified in the lower level che level, or main          |
| memory) a  | and not loaded into cache.                                 |

| instruct                |                 | niss rate 16KB cache for both<br>or a combined 32KB cache?<br>%). |
|-------------------------|-----------------|-------------------------------------------------------------------|
|                         |                 | and miss =50 cycles. 75% of are instruction fetch.                |
| ° Miss rate             | e of split cach | ne=0.75*0.64%+0.25*6.47%=2.1%                                     |
|                         |                 | .99% for combined cache. But, memory access time?                 |
| ° Split cac<br>2.05 cyc | · ·             | 0.64%*50)+25%(1+6.47%*50) =                                       |
| ° Combine               | d cache.        | Extra cycle for load/store                                        |



| <ul> <li>A CPU with CPI<sub>execution</sub> = 1.1 uses a unified L1 with with write back, with write allocate, and the probability a cache block is dirty = 10%.</li> <li>Instruction mix: 50% arith/logic, 15% load, 15% store, 20% control.</li> <li>Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles.</li> </ul> | ample                |                                                 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------------------|
| store, 20% control<br>° Assume a cache miss rate of 1.5% and a miss<br>penalty of 50 cycles.                                                                                                                                                                                                                                        | with <u>write ba</u> | <u>ick,</u> with <u>write allocate,</u> and the |
| penalty of 50 cycles.                                                                                                                                                                                                                                                                                                               |                      |                                                 |
| all 07 CSE4201                                                                                                                                                                                                                                                                                                                      |                      |                                                 |
| all 07 CSE4201                                                                                                                                                                                                                                                                                                                      |                      |                                                 |
|                                                                                                                                                                                                                                                                                                                                     | all 07               | CSE4201                                         |











| ) | Reducing hit time                                         |
|---|-----------------------------------------------------------|
|   | 1. Giving Reads Priority over Writes                      |
|   | E.g., Read complete before earlier writes in write buffer |
|   | 2. Avoiding Address Translation during Cache Indexing     |
| ) | Reducing Miss Penalty                                     |
|   | 3. Multilevel Caches                                      |
| ) | Reducing Miss Rate                                        |
|   | 4. Larger Block size (Compulsory misses)                  |
|   | 5. Larger Cache size (Capacity misses)                    |
|   | 6. Higher Associativity (Conflict misses)                 |













| Vay    | Prediction                                                                                                                                                             |  |  |  |  |  |
|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| o      | How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache?                                                                    |  |  |  |  |  |
| o      | Way prediction: keep extra bits in cache to predict the "way," or block within the set, of next cache access.                                                          |  |  |  |  |  |
|        | <ul> <li>Multiplexor is set early to select desired block, only 1 tag<br/>comparison performed that clock cycle in parallel with reading<br/>the cache data</li> </ul> |  |  |  |  |  |
|        | • Miss $\Rightarrow$ 1st check other blocks for matches in next clock cycle                                                                                            |  |  |  |  |  |
|        | Hit Time                                                                                                                                                               |  |  |  |  |  |
|        | Way-Miss Hit Time Miss Penalty                                                                                                                                         |  |  |  |  |  |
| 0<br>0 | Accuracy ≈ 85%<br>Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles                                                                                            |  |  |  |  |  |
|        | <ul> <li>Used for instruction caches vs. data caches</li> </ul>                                                                                                        |  |  |  |  |  |
| Fall   | 07 CSE4201                                                                                                                                                             |  |  |  |  |  |
|        |                                                                                                                                                                        |  |  |  |  |  |









| <sup>°</sup> Don't wa             | ait for full block before restarting CPU                                                                                                                           |
|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| block ar                          | start—As soon as the requested word of the rives, send it to the CPU and let the CPU execution                                                                     |
|                                   | I locality $\Rightarrow$ tend to want next sequential word, so not clear f benefit of just early restart                                                           |
| from me<br>arrives;               | Word First—Request the missed word first<br>mory and send it to the CPU as soon as it<br>let the CPU continue execution while filling the<br>ne words in the block |
| <ul> <li>Long<br/>used</li> </ul> | blocks more popular today $\Rightarrow$ Critical Word 1st Widely                                                                                                   |

## **Merging Write Buffers**

- <sup>°</sup> Write buffer to allow processor to continue while waiting to write to memory
- ° If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry
- ° If so, new data are combined with that entry
- Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory

° The Sun T1 (Niagara) processor, among many others, uses write merging