Appendix_C_Cache

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Computer Architecture CSC 520:

Computer Architecture CSC 520 Appendix C Cache Memory

Computer Architecture CSC 520:

Computer Architecture CSC 520 A typical memory heirarchy CPU Registers C A C H E Memory I/O Devices Size 500 bytes 64 Kbytes 512 Mbytes 100 Gbytes Speed 0.25 ns 1 ns 100 ns 5 ms Register Reference Cache Reference Memory Reference Disk Memory Reference

Computer Architecture CSC 520:

Computer Architecture CSC 520 Cache Hit --- CPU finds requested data in Cache Cache Miss --CPU does not find requested data in Cache Cache is increasingly important because memory speed is not keeping up with CPU speed Cache runs at or near the speed of on-chip registers Cache heavily utilizes the principle of locality

Computer Architecture CSC 520:

Computer Architecture CSC 520 We can pose 4 questions about any level of the heirarchy 1. Where can a block be placed in the upper level? 2. How is a block found if it is in the upper level? [block identification] 3. Which block should be replaced on a miss? [block replacement] 4. What happens on a write? [write strategy]

Computer Architecture CSC 520:

Computer Architecture CSC 520 Cache --- generally the first level of memory heirarchy encountered once the address leaves the CPU The term “Cache” is now applied whenever buffering is employed to reuse commonly occurring items. Where can a block be placed in cache? There are three common basic organizations 1. Fully associative -- block can go anywhere 2. Direct mapped -- block I goes into I mod cache size 3. Set associative -- if we have four sets, block I goes into I mod 4 and then any block within the set. Majority of caches today are either direct mapped , two-way or four-way set associative.

CSC 520 COMPUTER ARCHITECTURE:

CSC 520 COMPUTER ARCHITECTURE Main Memory block 0 1 2 3 4 5 6 7 Cache block 0 1 2 3 Direct Mapping

CSC 520 COMPUTER ARCHITECTURE:

CSC 520 COMPUTER ARCHITECTURE BLOCK 0 1 2 3 4 34 35 One memory block fits exactly into one cache frame frame set 0 0 1 2 1 3 4 2 5 6 3 7 A TWO-WAY SET ASSOCIATIVE MAPPING

CSC 520 COMPUTER ARCHITECTURE:

CSC 520 COMPUTER ARCHITECTURE MATCH Address tag block frame 0 address tag block frame 1 address tag block frame N Real Address Block address word Main Memory Cache Fully Associative Mapping

Computer Architecture CSC 520:

Computer Architecture CSC 520 How is a block found in cache Caches have an address tag on each block frame that gives the block address. The tag of every cache block that might contain the desired information is checked to see if it matches the block address of the CPU request. This is done in parallel, because speed is crucial. We use a valid bit to insure that the data is valid. The address is divided into a: tag field index offset We need only to check the tag field to see if block is in the block frame. The entire block would be present, so the offset must be there. The index was used to select the set being examined in the first place.

Computer Architecture CSC 520:

Computer Architecture CSC 520 What happens on a write? Reads dominate cache accesses. All instruction accesses are reads and most instructions don’t write to memory. Writes typically constitute about 7% of memory traffic and about 25% of data cache traffic. Making the common case fast suggests optimizing caches for reads. This is further emphasized by the fact that processors typically wait for reads but rarely wait for writes. The common case [read] is easy to make fast. The block can be read from cache at the same time the tag is read and compared, so the block read can begin as soon as the block address is available. If the read is a hit, the requested part of the block is passed on to the CPU. If it is a miss, there is no benefit, but also no harm, just ignore the value read.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Writes cannot be handled this way. A block cannot be modified until the tag is checked to see if the address is a hit. Therefore, writes usually take longer than reads, also more bytes can be read than written. Write policies often distinguish cache designs. There are two basic options. . Write through --the information is written to both the block in the cache and to the block in lower level memory. . Write back -- the information is written only to the block in the cache, the modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a feature called a dirty bit is commonly used. Indicates whether or not the block has been modified. Write through is easier to implement Write back use less memory bandwidth.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Cache Performance: Best performance measure is the average time to access memory: = hit time + miss rate * miss penalty Miss penalty will be in the form of stall cycles. Exec time = (CPU clock cycles + memory stall cycles) * clock cycle time Memory stall cycles = number of misses * miss penalty = IC * misses/inst * miss penalty = IC * memory accesses/inst. * miss rate * miss penalty

Computer Architecture CSC 520:

Computer Architecture CSC 520 Example: Assume: Machine where CPI is 1.0 when all memory accesses hit the cache Only data accesses are loads and stores and these total 50% of inst. Miss rate is 2% and miss penalty is 25 clock cycles How much faster would machine be if all instructions were cache hits? First: compute performance of computer that always hits. Exec. =( CPU clock cycles + memory stall cycles)* clock cycle = ( IC * CPI +0) * clock cycle = IC * 1.0 * clock cycle Second: real computer with real cache memory stall cycles = IC * mem acc/inst * miss rate * miss penalty = IC * ( 1.0 +0.5) * 0.02 * 25 = IC * 0.75 thus Exec = (IC * 1.0 + IC*0.75) * clock cycle = 1.75 * IC * clock cycle Performance ratio = 1.75/1.0 machine with no cache miss is 1.75 faster.

Computer Architecture CSC 520:

Computer Architecture CSC 520 On a cache miss. Which block should be replaced? Obviously, we would like to replace that block that is least likely to be used in the near future. Under some cache organizations, we have no choice…. Direct mapping, a given memory block can only be placed in one particular cache block. Under fully associative, we can put the block anywhere. Under set associative, we have limited choice. Three commonly used replacement schemes. Random -- choose the block to be replaced at random LRU -- least recently used. Based on principle of locality--replace that block which has not been referenced for the longest time. FIFO --first come, first served --replace that block which has been in cache longest --simpler than LRU.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Figure 5.7 Illustrates relative performance of these different schemes for set associative organizations: Associativity Two-way Four-way ` Eight-way size LRU Random Fifo LRU Random Fifo LRU Random Fifo 16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5 *Data Cache misses per 1000 instructions

Computer Architecture CSC 520:

Computer Architecture CSC 520 Figure 5.8 Misses per 1000 instructions for instruction, data, and unified cache size instruction data unified cache cache cache 8KB 8.16 44.0 63.0 16KB 3.82 40.9 51.0 32KB 1.36 38.4 43.3 64KB 0.61 36.9 39.4 128KB 0.30 35.3 36.2 256KB 0.02 32.6 32.9 Percentage of instruction references is about 74%

Computer Architecture CSC 520:

Computer Architecture CSC 520 Problem example: Which cache has a lower miss rate: a 16KB instruction cache with a 16KB data cache, or a 32 KB unified cache. Using miss rates in Figure 5.8 Assume: A hit takes 1 clock cycle and the miss penalty is 100 clock cycles AND a load or store hit takes 1 extra clock cycle on the unified cache….why?? What is the average memory access time in each case? Assume a write-through cache with a write buffer and ignore stalls due to the write buffer. Again we assume instruction fetch constitutes 74% of memory references or about one third of instructions are load/store. Since there is only one cache port to satisfy two simultaneous requests…structural hazard.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Problem example continued. First: we must convert misses per 1000 instructions into miss rate Miss rate = misses per 1000 instructions /1000 / (memory accesses per instruction) Since every instruction has exactly one memory access to fetch the instruction, the instruction miss rate for the 16KB instruction cache is: = (3.82/1000)/ 1.00 = 0.0038 = 0.38% Since 36% of the instructions are data transfers, the data miss rate for the 16KB data cache is: ( 40.9/1000)/0.36 = 0.114 or 11.4% [this means that for those instructions that are data access instructions, the attempt to access the data will miss cache 11.4% of the time. Unified cache The unified cache needs to account for both instruction and data accesses. So the miss rate for the 32KB unified cache is given by: = (43.3/1000)/(1.0 + 0.36) = 0.0318 misses per memory access or 3.18%

Computer Architecture CSC 520:

Computer Architecture CSC 520 As specified, about 74% of the memory accesses are instruction references and 26% are data references, therefore the overall miss rate for the split caches is: (74% *0.0038) +(26% * 0.114) = 0.0324 or 3.24% Therefore the unified cache, at 3.18% has a slightly lower miss rate than the split cache. Now we can calculate the average memory access time for these two approaches. Average memory access time = % instructions * (hit time + instruction miss rate* miss penalty) + % data * (hit time + data miss rate * miss penalty) For split cache: = 74% * (1 + 0.0038 *100) + 26% * (1. + 0.114 * 100) = .74*1.38 + .26 * 12.36 = 1.023 + 3.214 = 4.24 For unified cache: = 74% *(1.0 + 0.0318 * 100) + 26% *(1 + 1 + 0.0318*100) = .74 * 4.18 + .26 * 5.18 = 3.096 + 1.348 = 4.44 [even though the miss rate of unified cache is lower, it is less efficient] WHY? Because the structural hazard necessitates a stall for every data access

Computer Architecture CSC 520:

Computer Architecture CSC 520 Effect of cache on processor performance: We wish to compare a machine with a “perfect cache” [always hits] to a machine with a real cache that has the following performance characteristics and then to a machine with no cache at all Real cache machine: Cache miss penalty is 100 clock cycles, all inst. Normally take 1 clock cycle [ignoring memory stalls]. Average miss rate is 2%, there are an average of 1.5 memory references per inst. And the average number of cache misses per 1000 inst. Is 30. What is the impact on performance when the behavior of the cache is included? CPU time = IC(*CPI + memory stall clock cycles / inst.) * clock cycle time = IC * (1.0 + (30/1000 * 100)) * clock cycle time = IC * 4.0 * clock cycle time For machine with perfect cache CPU time = IC * 1.0 * clock cycle time So machine with perfect cache is 4 times faster than machine with “real” cache. What if machine had no cache? CPU time = IC * (1.0 + 1.5 *100) * clock cycle time = IC * 151 * clock cycle time or about 40 times slower. All memory references miss cache

Computer Architecture CSC 520:

Computer Architecture CSC 520 We can repeat this problem using miss rate rather than misses per 1000 inst. Result will be the same. CPUtime = IC * ( CPIexec + miss rate * mem acc / inst. * miss penalty) * clock cycle time CPU time = IC * (1.0 + ( 1.5 * 2% * 100)) * clock cycle time = IC * 4.00 * clock cycle time

Computer Architecture CSC 520:

Computer Architecture CSC 520 What is the impact of two different cache orgainzations on the performance of a CPU? Assume CPI with a perfect cache is 2.0, clock cycle time is 1.0 ns, there are 1.5 memory references per instruction, the size of both caches is 64KB, and both have a block size of 64 bytes. One cache is direct mapped and the other is two-way set associative. Because of the multiplexor required for the set associative cache, the clock cycle must be 1.25 times longer. For either cache organization, the miss penalty is 75 ns. Assume the hit time is 1 clock cycle, the miss rate of a direct mapped 64 KB cache is 1.4% and the miss rate for the two-way set associative cache of the same size is 1.0%. First calculate average memory access time. = hit time + miss rate * miss penalty For each organization: Direct = 1.0 + (0.014 * 75) = 2.05 ns Two-way = 1.0 * 1.25 +( 0.010 * 75) = 2.0 ns The average memory access time is better for the two-way set associative. CPU Performance: CPU time = IC *(CPI + misses/inst. * miss penalty) * clock =IC *(CPI * clock) +(miss rate * memory accesses/inst. * miss penalty * clock) [Over]

Computer Architecture CSC 520:

Computer Architecture CSC 520 Substituting in these formulas: We use 75 ns for miss penalty * clock CPUtime [direct] = IC *( 2 * 1.0 + ( 1.5 * .014 * 75 )) = 3.58 * IC CPUtime[2-way] = IC * (2 * 1.0 * 1.25 + ( 1.5 * .010 * 75 )) = 3.63 * IC Note the memory access time for the two-way is better but the overall performance is worse. HOW CAN THIS BE??????? Because the clock cycle is stretched [in the set associative cache] for ALL instructions, and two clock cycles are required per instruction, so,even thought the time to access memory is lower, the execution time for each instruction is longer. If the CPI were one with a perfect cache, the result would be different.

Computer Architecture CSC 520:

Computer Architecture CSC 520 100,000 10,000 1000 100 10 0 Performance 1980 1985 1990 1995 2000 2005 Year CPU MEMORY Evolution of the Performance Gap Between Memory and CPU baseline: 1980 Note: Y axis scale is logarithmic

Computer Architecture CSC 520:

Computer Architecture CSC 520 Improving Cache Performance: Figure 5.9… Summary of Performance Equations in this Chapter 2 index = Cache Size / Block Size * Set Associativity CPU exec time = ( CPU clock cycles + Memory stall cycles) * clock cycle time Memory stall cycles = Number of misses * miss penalty Memory stall cycles = IC * Misses / instruction * miss penalty Misses / Instruction = Miss rate * memory accesses / instruction Average Memory Access time = hit time + miss rate * miss penalty CPU execution time = IC * (CPI execution + memory stall clock cycles / inst. ) * clock cycle time CPU execution time = IC * (CPI execution + miss rate * miss penalty) * clock cycle time CPU execution time = IC * (CPI execution + miss rate * memory accesses / inst. * miss penalty ) * clock cycle time Average memory access time = hit time L1 + miss rate L1 * (hit time L2 * miss penalty L2 ) Memory stall cycles / inst. = misses L1 / inst. * hit time L2 + misses L2 / inst. * miss penalty L2 )

Computer Architecture CSC 520:

Computer Architecture CSC 520 Reducing the Cache Miss Penalty We can improve the performance of our machine, of course, by reducing the number of cache misses, but: We can also improve performance by reducing the penalty we incur when we do miss cache. We will look at Five techniques for reducing the penalty on a miss. First Miss Penalty Reduction Technique: MULTILEVEL CACHES

Computer Architecture CSC 520:

Computer Architecture CSC 520 Multilevel Caches Important architectural question: Should I make the cache faster to keep pace with the speed of the CPU? OR Should I make the cache larger to overcome the widening gap between the CPU and main memory? The answer is BOTH!! We can add another level of cache between the original cache and main memory. The first level cache can be small enough to match the clock cycle time of the fast CPU. The second level cache can be large enough to capture many accesses that would otherwise go to main memory, thus reducing the miss penalty.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Average memory access time for a multilevel cache. Subscripts L1 and L2 refer to, respectively, the first and second level caches. Average memory access time = Hit time L1 + miss rate L1 * miss penalty L1 And Miss penalty L1 = hit time L2 + miss rate L2 * miss penalty L2 So Average memory access time = hit time L1 + miss rate L1 * (hit time L2 + miss rate L2 * miss penalty L2 )

Computer Architecture CSC 520:

Computer Architecture CSC 520 Local Miss Rate …. The number of misses in a cache divided by the total number of memory accesses to this cache. Global Miss Rate … the number of misses in the cache divided by the total number of memory accesses generated by the CPU. The global miss rate of the first level cache is simply the local miss rate of the first level cache. The global miss rate of the second level cache is the local miss rate for the level one cache * the local miss rate for the level two cache. The local miss rate of the second level cache is large because the first level cache skims the cream of memory accesses, this is why the global rate is a more accurate measure. The global rate indicates what fraction of memory accesses that leave the CPU go all the way to memory.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Example problem: multilevel caches Assume: in 1000 memory references there are 40 misses in the first level cache and 20 misses in the second level cache. What are the various miss rates? Assume the miss penalty from the L2 cache to memory is 100 clock cycles, the hit time of the L2 cache is 10 clock cycles, the hit time of the L1 cache is 1 clock cycle and there are 1.5 memory references per instruction. What is the average memory access time and average stall cycles per instruction. Ignore the impact of writes. Miss rate [both local and global] for L1 cache is 40/1000 or 4% Local miss rate for L2 cache is 20/40 or 50%. The global miss rate for L2 cache is 20/1000 or 2% or 50% * 4% = 2% using local miss rates continued.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Average memory access time: = hit timeL1 + miss rate L1 * (hit timeL2 * miss penaltyL2) = 1 + 4% * (10 + 50% * 100) = 1 + 4% * 60 = 3.4 clock cycles. To calculate misses per instruction, divide by 1.5 1000 memory references/ 1.5 ref. Per inst. = 667 instructions To convert from misses per 1000 memory references to misses per 1000 instructions multiply by 1.5 [or divide by .667] So we have 60 L1 misses per 1000 inst. And 30 L2 misses. Avg memory stalls per instruction = misses per inst.L1 * hit timeL2 + misses per inst.L2 * miss penaltyL2 = (60/1000) * 10 + (30/1000) * 100 = 0.06 * 10 + 0.03 * 100 =3.6 clock cycles Or you could subtract L1 hit time from AMAT and multiply by avg. mem. Ref/inst. (3.4 - 1.0) * 1.5 = 3.6 clock cycles

Computer Architecture CSC 520:

Computer Architecture CSC 520 What is the effect of associativity on its miss penalty? Assume: Hit time for L2 direct mapped 10 clock cycles Two-way set associativity increases hit increases hit time by 0.1 clock cycles = 10.1 Local miss rate L2 direct mapped 25% Local miss rate L2 set associative 20% Miss penalty L2 100 clock cycles For a direct mapped L2 cache, first level cache miss penalty is: 10 + 25% * 100 = 35 clock cycles Miss penalty two-way: 10.1 + 20% * 100 = 30.1 clock cycles In reality, we could not have fractional clock cycles, so we either shave the hit time for two-way to 10 clock cycles or round it up to eleven, either way the 2-way performance is better.

Computer Architecture CSC 520:

Computer Architecture CSC 520 What is in the second level cache? Depends…….on what? If the L2 cache is much larger than the L1 cache, then everything in the L1 cache is also in the L2 cache. multilevel inclusion If the L2 is only slightly larger than the L1, then it would be wasteful to keep a copy in L1 in the L2, so none of the L1 blocks should be in L2 multilevel exclusion

Computer Architecture CSC 520:

Computer Architecture CSC 520 Second miss penalty reduction technique: Critical Word first and early restart. The first technique, two level cache, requires additional hardware, this does not. Generally, when we access memory, we need a single word. This technique involves sending the requested word without waiting for the entire block to be loaded and restarting the CPU. Two specific strategies. Critical word first -- request the missing word from memory and send it to the CPU as soon as it arrives, let the CPU continue execution while the remainder of the block is loaded. Early Restart ---fetch the words in normal order, but as soon as it arrives, send it to the CPU and let the CPU continue execution.

Computer Architecture CSC 520:

Computer Architecture CSC 520 An example of the benefits of critical word first: Assume a computer has a 64-byte cache block, an L2 cache that takes 11 clock cycles to get the critical 8 bytes and then 2 clock cycles per 8 bytes to fetch the rest of the block. (Similar to AMD Athlon) Calculate the average miss penalty for critical word first, assuming that there will be no other accesses to the rest of the block until it is completely fetched. Then calculate assuming the following instructions read data sequentially 8 bytes at a time from the rest of the block. Compare the times with and without critical word \ first: The average miss penalty is 11 clock cycles for critical word first. The Athlon can issue two loads per clock cycle, which is faster than the L2 cache can supply data. Thus it would take 11 + (8-1) times 2 or 25 clock cycles for the CPU to sequentially read a full cache block. Without critical word first, it would take 25 clock cycles to load the block, and then 8/2 or 4 clocks to issue the loads, giving 29 clock cycles total.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Third miss penalty reduction technique: Give Priority to Read Miss over Writes. We serve reads before writes have been completed. Proper write buffer size is critical. Write buffers add complications…. They might hold an updated value of a location needed on a read miss. Example: sw R3, 512 (R0) lw R1, 1024(R0) lw R2, 512(R0) If I delay write, second read will get wrong value.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Fourth miss reduction technique: Merging Write Buffer. Also involves write buffers, this time improving their efficiency. Check to see if data already in write buffer waiting to be written matches the new data to be written. If so simply replace the older value. Efficiency is improved by merging writes to allow multiword rather than single word writes which are usually more efficient. Write Addr V V V V Write Addr V V V V 100 1 mem[100] 0 0 0 108 1 mem[108] 0 0 116 1 mem[116] 0 0 124 1 mem[124] 0 0 100 1 mem[100] 1 mem[108] 1 mem[116] 1 mem[124] 0 0 0 0 0 0 0 0 0 0 0 0

Computer Architecture CSC 520:

Computer Architecture CSC 520 Fifth miss penalty reduction technique: Victim Caches. Remember what was discarded, in case it is needed again. Requires a small, fully associative cache between a cache and its refill path. The AMD Athlon has a victim cache with 8 entries. For example, a four entry victim cache might reduces misses by 25% in a 4KB direct mapped cache. Data Tag Victim Cache CPU address data data in out Write Buffer Lower Level Memory ? ?

Computer Architecture CSC 520:

Computer Architecture CSC 520 Reducing the Miss Rate

Computer Architecture CSC 520:

Computer Architecture CSC 520 Reducing the miss rate: Classical approach to improving cache behavior is to reduce the miss rate. Three categories of cache misses: Compulsory: the very first access to a block cannot be in the cache Capacity: the cache cannot contain all the blocks needed during execution. Misses will occur because blocks are being swapped in and out. Conflict: If the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. The idea here is that hits in a fully associative cache that become misses in an N-way set-associative cache are due to more than N requests some popular sets.

Computer Architecture CSC 520:

Computer Architecture CSC 520 First miss rate reduction technique: Larger Block Size This will reduce compulsory misses. Takes advantage of principle of locality. Larger block sizes increase the miss penalty. Larger block sizes may increase the number of conflict misses and even capacity misses. because for a given cache size, larger block sizes reduce the number of blocks. At some point, increasing the block size will actually increase the miss rate. The increased miss penalty could outweigh the decrease in miss rate. Must look at average memory access time.

Computer Architecture CSC 520:

Computer Architecture CSC 520 From Figure 5.17 in the text: block size 4K 16K 64K 256K 16 8.57% 3.94 2.04 1.09 32 7.24 2.87 1.35 0.70 64 7.00 2.64 1.06 0.51 128 7.78 2.77 1.02 0.49 256 9.51 3.29 1.15 0.49

Computer Architecture CSC 520:

Computer Architecture CSC 520 Second miss rate reduction technique: Larger Caches. Obvious: Disadvantages: Longer hit time. Greater cost.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Average memory access time vs block size block size miss penalty 4K 16K 64k 256K 16 82 8.027 4.231 2.673 1.894 32 84 7.082 3.411 2.134 1.588 64 88 7.160 3.323 1.933 1.449 128 96 8.469 3.659 1.979 1.470 256 112 11.651 4.685 2.288 1.549

Computer Architecture CSC 520:

Computer Architecture CSC 520 Third miss rate reduction technique: Higher Associativity Eight way set associative is, for all practical purposes, as effective as fully associative in reducing cache misses. 2:1 cache rule of thumb --- a direct mapped cache of size N has about the same miss rate as a two-way set associative cache of size N / 2. This holds for cache sizes less than 128 KB. Greater associativity increases hit time because of the multiplexing involved. Faster processor clock cycle times encourages simple cache designs, but the increasing miss penalty rewards associativity.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Example: assume higher associativity would increase the clock cycle time as listed below: Relative to clock time of a one way (direct) cache: Clock cycle time of: 2 way is 1.36 4 way is 1.44 8 way is 1.52 Assume hit time is 1 clock cycle, that miss penalty for direct mapped is 25 cycles to an L2 cache that never misses and that the miss penalty need not be rounded to an integral number of clock cycles. Using Figure 5.14 for miss rates, for which cache sizes are the following statements true? Average memory access time of: 8way < 4way 4way < 2 way 2way < direct (1 way) Average memory access time for each associativity is: for 8 way = hit time (8 way) + miss rate 8way * miss penalty 8 way = 1.52 + miss rate 8 way * 25 similarly for others 4 way 1.44 2 way 1.36 1 way 1.00 result in figure 5.19

Computer Architecture CSC 520:

Computer Architecture CSC 520 Average memory access time using miss rates in Figure 5.14 cache size KB direct (1-way) two-way four-way eight-way 4 3.44 3.25 3.22 3.28 8 2.69 2.58 2.55 2.62 16 2.23 2.40 2.46 2.53 32 2.06 2.30 2.37 2.45 64 1.92 2.14 2.18 2.25 128 1.52 1.84 1.92 2.00 256 1.32 1.66 1.74 1.82 512 1.20 1.55 1.59 1.66

Computer Architecture CSC 520:

Computer Architecture CSC 520 Fourth miss rate reduction technique. Way Prediction and Pseudo-associative Caches. Use extra bits to predict the way or next block within the set of the next cache access. Multiplexer is set early so only one tag must be set, a miss results in checking the other blocks for matches in other cycles. In the Alpha 21264: correct prediction 1 cycle, incorrect prediction 3 cycles, accuracy of prediction…85% Pseudo-associative --- create pseudo-sets -- most common way is to invert MSB of index and check there for the desired block. This means pseudo-associative caches have two hit times: fast hit when block is where it should be (regular hit) or a slower time if it is a “pseudo-hit”.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Fifth miss rate reduction technique: Compiler optimizations. -- no hardware required. One common technique is loop interchange. Used to enhance spatial locality realizing most compilers store arrays in row-major order. Example: [before] for (j = 0; j < 100; j++) for ( i = 0; I < 5000; i++) x[i][j] = 2 * x[i][j]; [after] for (i=0; i < 5000; i++) for(j=0; j < 100; j++) x[i][j] = 2 * x[i][j]; J I I J

Computer Architecture CSC 520:

Computer Architecture CSC 520 Reducing Cache Miss penalty or Miss Rate via Parallelism

Computer Architecture CSC 520:

Computer Architecture CSC 520 First Technique: Non-blocking Caches to Reduce Stalls on Cache Misses Works only on computers that allow out of order completion. The CPU need not stall on a cache miss. For example: the cpu could continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data A nonblocking cache or lockup-free cache allows the data cache to continue to supply cache hits during a miss. This option significantly increase the complexity of the cache controller.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Second Technique: Hardware prefetching of Instructions and Data Instructions and data can be prefetched into the caches or into an external buffer that is faster than main memory. Typically instruction prefetch is done in hardware outside of the cache. the processor fetches two blocks on a miss… the requested block and the next consecutive block. The requested block is placed in instruction cache and the prefetched block is placed in the instruction stream buffer. If the requested instruction is in the prefetch buffer, the cache request is cancelled and the block is read from the buffer, the next prefetch request is issued. It was found that a single prefetch buffer would catch 15 – 25% of the misses in a 4KB direct mapped cache.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Third technique: Compiler-Controlled prefetching. Instead of using hardware, have the compiler insert prefetch instructions to request data before they are needed. Two types: Register prefetch: load the value into a register Cache prefetch: load data only into cache, not a register This technique only makes sense if the processor can proceed while the prefetch operations are underway.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Reducing the Hit Time

Computer Architecture CSC 520:

Computer Architecture CSC 520 First technique to reduce hit time: Small and simple caches: using the index portion of the address to read memory tag takes time. smaller hardware is faster keep cache small enough so that it fits on same chip as processor so we do not have to go “off chip”. direct mapped is faster than set associative Figure 5.24 in text indicates access time for caches of various size and organization.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Second technique: Avoid Address Translation during Indexing of the Cache Use a virtual address space in cache rather than physical addresses so you do not have to translate virtual addresses into physical addr. before you check cache. Virtual caches are not popular because of technical problems they introduce. We will not discuss here.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Third technique: Pipelined Cache Access: Allows one to increase clock speed. Pentium I took one clock cycle to access cache Pentium III took two cycles Pentium 4 take four clocks

Computer Architecture CSC 520:

Computer Architecture CSC 520 Fourth Technique: Trace Caches: We try to identify sequences of instructions that execute dynamically, rather than simply appear statically in the program. These would include predictions of branch taken sequences. Address mapping is more complicated and instructions within sequences may be stored multiple times in the cache.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Main Memory and Organizations for Improving Performance

Computer Architecture CSC 520:

Computer Architecture CSC 520 Main memory is the next level down from the cache(s) The latency of main memory directly affects the cache in that this latency determines the cache miss penalty. Caches are, therefore, interested in low memory latency. However, it is generally easier to improve memory bandwidth with new memory organizations than to reduce memory latency directly. Example to illustrate these organizations: Assume basic memory organization performance is 4 clocks to send address 56 clocks to access each word 4 clocks to send a word of data If we have a cache block of four words, and a word is 8 bytes, the miss penalty is: 4 * (4 + 56 + 4) or 256 clock cycles, giving a memory bandwidth of 1/8 bytes per clock…..32 bytes in 256 clocks.

Computer Architecture CSC 520:

Computer Architecture CSC 520 This slide indicates a one word memory organization: CPU Cache Bus Memory

Computer Architecture CSC 520:

Computer Architecture CSC 520 First technique for higher memory bandwidth: Wider Main Memory: CPU Cache Mux Cache Bus Memory First level caches are organized one word wide because most CPU accesses are that size. In the example shown the first level cache remains one word wide but second level cache is four words wide and the multiplexing takes place between those caches. At this width, miss penalty is just 64 clocks and memory bandwidth is ½ byte per clock rather than the 1/8 byte per clock of the single word organization Cost: wider memory bus.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Second technique: interleaved memory CPU Cache Bus Memory Bank 0 Bank 1 Bank2 Bank3 Memory is organized to read or write multiple words at a time. Banks are often one word wide so width of bus and cache need not change, but rather than sending read/write signal to a single block, it can be sent to multiple blocks causing each block to be read. Miss penalty would be 4 + 56 + (4 * 4) or 76 clocks This give a bandwidth of about 0.4 bytes/clock Sequential word locations (addresses) are distributed modulo 4 across the four banks. Why 4 times 4? Because is takes four clocks to actually read from block after it has been accessed.

Computer Architecture CSC 520:

Computer Architecture CSC 520 Memory Technology

Computer Architecture CSC 520:

Computer Architecture CSC 520 DRAM. Single transistor per bit must be refreshed…technology is built into the chip designers generally try to keep time spent refreshing less than 5% of the total time. SRAM. static RAM 8 to 16 times faster than DRAM but also 8 to 16 times more expensive Flash Memory…non-volatile but writeable and rewriteable works much like EEPROM. Read time is comparable to DRAM but write time is 10 to 100 times slower. New DRAM technology..double data rate dram or DRDRAM essentially double to data rate of conventional DRAM from as of writing of this text, a max of 1200 MB/sec to 2400 MB/sec.

Computer Architecture CSC 520:

Computer Architecture CSC 520

Computer Architecture CSC 520:

Computer Architecture CSC 520

Computer Architecture CSC 520:

Computer Architecture CSC 520

authorStream Live Help