GSRC 12 10 2006 Eisley

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

In-Network Cache Coherence: 

In-Network Cache Coherence Noel Eisley*, Li-Shiuan Peh*, Li Shang** *Department of Electrical Engineering Princeton University {eisley, peh}@princeton.edu **Department of Electrical and Computer Engineering Queen’s University li.shang@queensu.ca

Chip-Multiprocessors (CMPs): 

Chip-Multiprocessors (CMPs) CMPs reduce complexity and offer scalability Higher frequencies and more transistors  higher power consumption Higher frequencies  many-cycle chip traversal CMPs offer a large amount of distributed on-chip storage On-chip data access is much faster than off-chip (~10-100x) An important challenge is to manage the distributed on-chip data using a scalable approach Then Now Future… Intel 80-core  

Background: Interconnection Networks: 

Background: Interconnection Networks Arbiter Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Routing Router Pipeline Ejection Injection South North East West I/O buffers

On-Chip Data Management -- Cache coherence is the first step: 

On-Chip Data Management -- Cache coherence is the first step Finding valid data Does it exist? Where is it? Optimal: go directly to the nearest sharer Updating/invalidating data Avoid unnecessary traffic (i.e. broadcast) Optimal: invalidate sharers directly A A A A A A A A

Current: Directory-Based Protocol: 

Current: Directory-Based Protocol Home node (H) keeps a list (or partial list) of sharers Read: Round trip overhead Reads don’t take advantage of locality Write: Up to two round trip overhead Invalidations inefficient H

Our Approach: Virtual Network: 

Our Approach: Virtual Network Network is not just for getting from A to B It can keep track of sharers and maintain coherence How? With trees stored within routers Tree connecting home node with sharers Tree stored within routers On the way to home node, when requests “bump” into trees, routers will re-route them towards sharers. H Bump into tree Add to tree

Outline: 

Outline Motivation and introduction In-network cache coherence protocol Methodology and results Conclusions

Reads Locate Data Efficiently: 

H Reads Locate Data Efficiently Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Read Request injected into the network Tree constructed as read reply is sent back New read injected into the network Request is redirected by the network Data is obtained from sharer and returned

Writes Invalidate Data Efficiently: 

Writes Invalidate Data Efficiently H Write Request injected into the network In-transit invalidations Acknowledgements spawned at leaf nodes Wait for acknowledgements Send out write reply Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement

Virtual Trees - Structure: 

Virtual Trees - Structure Node types Home: same as for directory protocol Root: first sharer Internal: all others Characteristics: Directional link always points towards the root node No loops No disjoint trees H R

Router Microarchitecture: 

Router Microarchitecture Tree Cache Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Tree Cache Routing Router Pipeline Tag Data Entry H

Router Microrchitecture: Virtual Tree Cache: 

Router Microrchitecture: Virtual Tree Cache Each cache line in the virtual tree cache contains 2 bits (To, From root node) for each direction (NSEW) 1 bit if the local node contains valid data 1 busy bit 1 Outstanding request bit Tag Data Entry T T T T F F F F N S E W D B R Virtual Tree Cache

Outline: 

Outline Motivation and introduction In-network cache coherence protocol Methodology and results Conclusions

Methodology: 

Methodology Benchmarks: SPLASH-2 Multithreaded simulator: Bochs running 16 or 64-way Network and virtual tree simulator Trace-driven Full-system simulation becomes prohibitive with many cores Motivation: explore scalability Cycle-accurate Each message is modeled Models network contention (arbitration) Calculates average request completion times Bochs Network Simulator Application Memory traces Memory access latency performance

Tree Cache Configuration: 

Tree Cache Configuration Good configuration arrived at through detailed design-space analysis Cache access parameters derived from Cacti Tree Cache Configuration Simulation Timing Summary

Performance Scalability: 

Performance Scalability Compare in-network virtual tree protocol to standard directory protocol Good improvement 4x4 Mesh: Avg. 35.5% read latency reduction, 41.2% write latency reduction Scalable 8x8 Mesh: Avg. 35% read and 48% write latency reduction 4x4 Mesh 8x8 Mesh

Storage Scalability: 

Storage Scalability Directory protocol: O(# Nodes) Virtual tree protocol: O(# Ports) ~O(1) Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 16 64 # of Nodes Tree Cache Line Directory Cache Line 28 bits 28 bits 18 bits 66 bits

Related Work: 

Related Work Mizrahi et al. [1989] Goodman and Woest [1988] LimitLess (Chaiken, et al.) [1991] SCI (Gjessing, et al.) [1992] GLOW (Kaxiras and Goodman) [1996] Chang, et al. [1999] Iyer, et al. [2000] Cheng, et al. [2006]

Conclusions: 

Conclusions Want to address the scalability of on-chip data management (cache coherence) for CMPs Embedding cache coherence within the network fabric allows the network to optimize memory traffic ~35% and ~41% read and write latency reduction for 4x4 meshes ~35% and ~48% reduction for 8x8 meshes Storage overhead scalability Overhead when compared to directory protocol for a 4x4 mesh Reduction for an 8x8 mesh

Slide20: 

Thank you!

Results (cont.): 

Results (cont.) Effect of embedding the connectivity (i.e. tree cache) within the network 31% Avg. read latency reduction 49.1% Avg. write latency reduction

Network Template: 

Network Template H Legend H: Home node Sharer node Tree node(no data) Read request/reply Write request/reply Teardown Acknowledgement

Tough Questions: 

Tough Questions Wouldn’t it be faster to use look-ahead routing (e.g. 2-cycle hop) and accept more hops?

Scalability: 

Scalability Performance From previous slide, avg. read and write latencies remain constant or decrease (relative to directory) from 4x48x8 mesh Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 4x4: 8x8: Tree Full-Map Lim. Dir.

Motivation: Coherence-Aware Networks: 

Motivation: Coherence-Aware Networks Network is not just for getting from A to B Storing more information allows for in-transit optimizations Give routers the ability to source and sink coherence messages

Virtual Tree - Protocol: 

Virtual Tree - Protocol On capacity miss Tear down the tree and begin a new one On write request Tear down the tree and begin a new one Proactive evictions of tree cache Even at nodes other than the home node Tree must be torn down completely before a new one can be created A B C D H