A highly Configurable Cache Architecture for Embedded Systems : Chuanjun Zhang, UC Riverside 1 A highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid** , and Walid Najjar
*Dept. of Electrical Engineering
Dept. of Computer Science and Engineering
University of California, Riverside
**Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported by the National Science Foundation and the Semiconductor Research Corporation Outline : Chuanjun Zhang, UC Riverside 2 Outline Why a Configurable Cache? What Parameters ?
Configurable Associativity by Way Concatenation
Configurable Size by Way Shutdown
Configurable Line Size
How to Configure Cache
Cache Parameter Explorer
A Heuristic Algorithm Searches Pareto Set of Cache Parameters :
Tradeoff Between Energy Dissipation and Performance
The explorer is Synthesized Using Synopsys
Conclusions and Future Work Why Choose Cache: Impacts Performance and Power : Chuanjun Zhang, UC Riverside 3 Why Choose Cache: Impacts Performance and Power Performance impacts are well known
ARM920T: Caches consume 50% of total processor system power (Segars 01)
M*CORE: Unified cache consumes 50% of total processor system power (Lee/Moyer/Arends 99)
We’ll show that a configurable cache can reduce that power nearly in half on average Why a Configurable Cache? : Chuanjun Zhang, UC Riverside 4 Why a Configurable Cache? An embedded system may execute one application forever
Tuning the cache configuration (size, associativity, line size) can save a lot of energy
40% difference in memory access energy epic & mpeg2 from MediaBench associativity associativity Miss rate Normalized Energy Benefits of Configurable Cache : Chuanjun Zhang, UC Riverside 5 Benefits of Configurable Cache Mass production
Unique chips getting more expensive as technology scales down (ITRS)
Huge benefits to mass producing a single chip
Harder to produce chips distinguished by cache when we have 50-100 processors per chip
Adapt to program phases
Recent research shows programs have different cache requirements over time
Much research assumes a configurable cache Slide 6: Chuanjun Zhang, UC Riverside 6 Caches Vary Greatly in Embedded Processors Configurable Associativity by Way Concatenation : Chuanjun Zhang, UC Riverside 7 Configurable Associativity by Way Concatenation Four-way set-associative base cache
Ways can be concatenated to form two-way
Can be further concatenated to direct-mapped
Concatenation is logical – only 1 array accessed Way 1 Way 2 Way 3 Way 4 four-way C. Zhang(ISCA 03) Way-Concatenate Cache Architecture : Chuanjun Zhang, UC Riverside 8 Way-Concatenate Cache Architecture index data output critical path 6x64 6x64 a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0 data array Trivial area overhead reg1 reg0 tag part Previous Method – Way Shutdown : Chuanjun Zhang, UC Riverside 9 Previous Method – Way Shutdown Albonesi proposed a cache where ways could be shut down
To save dynamic power
Motorola M*CORE has same way-shutdown feature
Unified cache – even allows setting each way as I, D, both, or off Way 1 Way 2 Way 3 Way 4 Reduces dynamic power by accessing fewer ways
But, decreases total size, so may increase miss rate Way Shutdown Can be Good for Static Power : Chuanjun Zhang, UC Riverside 10 Way Shutdown Can be Good for Static Power Static power (leakage) increasingly important in nanoscale technologies
We combine way shutdown with way concatenate
Use sleep transistor method of Powell (ISLPED 2000) Gnd Vdd Bitline Bitline Gated-Vdd
Control When off, prevents leakage. But 4% performance overhead Slide 11: Chuanjun Zhang, UC Riverside 11 Cache Line Size 64B cache line 64B consecutive code 64B non consecutive code 16B A B 48B are wasted 64B cache line C. Zhang(ISVLSI 03) Slide 12: Chuanjun Zhang, UC Riverside 12 Configurable Cache Line Size With Line Concatenation Counter bus One Way Off Chip Memory 4 physical lines are
filled when line size
is 64 bytes The physical line size is 16 byte
A programmable counter is used to designate the line size
An interleaved off chip memory organization Computing Total Memory-Related Energy : Chuanjun Zhang, UC Riverside 13 Computing Total Memory-Related Energy Considers CPU stall energy and off-chip memory energy
Excludes CPU active energy
Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_miss = k_miss_energy * energy_hit
energy_static_per_cycle = k_static * energy_total_per_cycle
(We varied the k’s to account for different system implementations) energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss
energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill
energy_static = cycles * energy_static_per_cycle Underlined – measured quantities
SimpleScalar (cache_hits, cache_misses, cycles)
Our layout or data sheets (others) Energy Savings : Chuanjun Zhang, UC Riverside 14 Energy Savings Energy savings when way concatenation, way shut down, and cache line size concatenation are implemented. (C. Zhang TECS ACM To Appear) Cache Parameters that Consume the Lowest Energy Varies Across Applications : Chuanjun Zhang, UC Riverside 15 Cache Parameters that Consume the Lowest Energy Varies Across Applications How to Configure Cache : Chuanjun Zhang, UC Riverside 16 How to Configure Cache Simulation-based methods
Seconds of real-timework may take tens of hours to simulate
Simulation tools set up
Increase the time
Self exploring method
Cache parameter explorer
Incorporated on a prototype platform
Pareto parameters: a set of parameters show performance and energy trade off Cache self-exploring hardware : Chuanjun Zhang, UC Riverside 17 Cache self-exploring hardware An explorer is used to detect the Pareto set of cache parameters
The explorer stands aside to collect information used to calculate the energy Mem Processor D$ I$ Pareto parameter sets : Chuanjun Zhang, UC Riverside 18 Pareto parameter sets A B C D Lowest Energy Best Performance Tradeoff between Energy and Performance Not a Pareto Point Heuristic algorithm : Chuanjun Zhang, UC Riverside 19 Heuristic algorithm Search all possible Cache configurations
Time consuming. Considering other configurable parameters: voltage levels, bus width, etc. the search space will increase very quickly to millions.
A heuristic is proposed
First to search point A
Sequence of searching parameter matters,
Do not need cache flush
Then searching for point B
Last we search for points in region C A B C Time Energy(mJ) Lowest Energy Best Perf Tradeoff Impact of Cache Parameters on Miss Rate and Energy : Chuanjun Zhang, UC Riverside 20 Impact of Cache Parameters on Miss Rate and Energy Average Instruction Cache Miss Rate and Normalized Energy of the Benchmarks. One Way Line Size 32B Line Size 32B One Way Energy Dissipation on On-Chip Cache and Off Chip Memory : Chuanjun Zhang, UC Riverside 21 Energy Dissipation on On-Chip Cache and Off Chip Memory . Benchmark:parser Searching for Point A : Chuanjun Zhang, UC Riverside 22 Searching for Point A Point A :The least energy cache configuration Search Cache Size Search Line Size Search Associativity Way prediction A Energy(mJ) Time Lowest Energy Searching for Point B : Chuanjun Zhang, UC Riverside 23 Searching for Point B Point B :The best performance cache configuration
High associativity doesn’t mean high performance
Large line size may not be good for data cache Fix Cache Size Search Line Size Search Associativity No Way prediction A Energy(mJ) B Best Performance Searching for Point C : Chuanjun Zhang, UC Riverside 24 Searching for Point C Cache parameters in region C: represent the trade off between energy and performance
Choose cache parameters between points A and B.
Cache size at points A and B are 8K and 4K respectively, then the cache size of points in region C will be tested at 8K and 4K.
Combinations of point A and B’s parameters are tested. Point A B C
Line size 64 64 64
Cache size 2K 8K 4K 8K
Associativity 1W 4W 1W 1W 2W A C B Tradeoff between Energy and Performance FSM and Data Path of the Cache Explorer : Chuanjun Zhang, UC Riverside 25 FSM and Data Path of the Cache Explorer input hit energies miss energies static energies hit num miss num multiplier adder register FSM comparator lowest energy control com_out com_out configure register memory mux mux exe time Implementing the Heuristic in Hardware : Chuanjun Zhang, UC Riverside 26 Implementing the Heuristic in Hardware Total size of the explorer
About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS technology.
Compared to the reported size of the MIPS 4Kp with cache, this represents just over a 3% area overhead.
2.69 mW at 200 MHz. The power overhead compared with the MIPS 4Kp would be less than 0.5%.
Furthermore, the exploring hardware is used only during the exploring stage, and can be shut down after the best configuration is determined. How well the heuristic is ? : Chuanjun Zhang, UC Riverside 27 How well the heuristic is ? Time complexity:
Search all space: O(m x n x l x p)
Heuristic : O(m + n + l + p)
m:number of associativities, n :number of cache size
l : number of cache line size , p :way prediction on/off
On average 5 searching instead of 27 total searchings can find point A
2 out of 19 benchmarks miss the lowest power cache configuration.
Use a different searching heuristic: line size, associativity, way prediction and cache size.
11 out 19 benchmarks miss the best configuration Results of Some Other Benchmarks : Chuanjun Zhang, UC Riverside 28 Results of Some Other Benchmarks Conclusion and Future Work : Chuanjun Zhang, UC Riverside 29 Conclusion and Future Work A configurable cache architecture is proposed.
Associativity, size,line size.
A cache parameter explorer is implemented to find the cache parameters.
A heuristic algorithm is proposed to search the Pareto cache parameter sets.
The complexity of the heuristic is O(m+n+l) instead of O(m*n*l)
Only 95% of the Pareto points can be found by Heuristic
little area and power overhead, and no performance overhead.
Dynamically detect the cache parameters .