logging in or signing up GSRC 12 10 2006 Eisley Hannah Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 160 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript In-Network Cache Coherence: In-Network Cache Coherence Noel Eisley*, Li-Shiuan Peh*, Li Shang** *Department of Electrical Engineering Princeton University {eisley, peh}@princeton.edu **Department of Electrical and Computer Engineering Queen’s University li.shang@queensu.caChip-Multiprocessors (CMPs): Chip-Multiprocessors (CMPs) CMPs reduce complexity and offer scalability Higher frequencies and more transistors higher power consumption Higher frequencies many-cycle chip traversal CMPs offer a large amount of distributed on-chip storage On-chip data access is much faster than off-chip (~10-100x) An important challenge is to manage the distributed on-chip data using a scalable approach Then Now Future… Intel 80-core Background: Interconnection Networks: Background: Interconnection Networks Arbiter Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Routing Router Pipeline Ejection Injection South North East West I/O buffers On-Chip Data Management -- Cache coherence is the first step: On-Chip Data Management -- Cache coherence is the first step Finding valid data Does it exist? Where is it? Optimal: go directly to the nearest sharer Updating/invalidating data Avoid unnecessary traffic (i.e. broadcast) Optimal: invalidate sharers directly A A A A A A A ACurrent: Directory-Based Protocol: Current: Directory-Based Protocol Home node (H) keeps a list (or partial list) of sharers Read: Round trip overhead Reads don’t take advantage of locality Write: Up to two round trip overhead Invalidations inefficient H Our Approach: Virtual Network: Our Approach: Virtual Network Network is not just for getting from A to B It can keep track of sharers and maintain coherence How? With trees stored within routers Tree connecting home node with sharers Tree stored within routers On the way to home node, when requests “bump” into trees, routers will re-route them towards sharers. H Bump into tree Add to tree Outline: Outline Motivation and introduction In-network cache coherence protocol Methodology and results ConclusionsReads Locate Data Efficiently: H Reads Locate Data Efficiently Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Read Request injected into the network Tree constructed as read reply is sent back New read injected into the network Request is redirected by the network Data is obtained from sharer and returned Writes Invalidate Data Efficiently: Writes Invalidate Data Efficiently H Write Request injected into the network In-transit invalidations Acknowledgements spawned at leaf nodes Wait for acknowledgements Send out write reply Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Virtual Trees - Structure: Virtual Trees - Structure Node types Home: same as for directory protocol Root: first sharer Internal: all others Characteristics: Directional link always points towards the root node No loops No disjoint trees H R Router Microarchitecture: Router Microarchitecture Tree Cache Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Tree Cache Routing Router Pipeline Tag Data Entry HRouter Microrchitecture: Virtual Tree Cache: Router Microrchitecture: Virtual Tree Cache Each cache line in the virtual tree cache contains 2 bits (To, From root node) for each direction (NSEW) 1 bit if the local node contains valid data 1 busy bit 1 Outstanding request bit Tag Data Entry T T T T F F F F N S E W D B R Virtual Tree Cache Outline: Outline Motivation and introduction In-network cache coherence protocol Methodology and results ConclusionsMethodology: Methodology Benchmarks: SPLASH-2 Multithreaded simulator: Bochs running 16 or 64-way Network and virtual tree simulator Trace-driven Full-system simulation becomes prohibitive with many cores Motivation: explore scalability Cycle-accurate Each message is modeled Models network contention (arbitration) Calculates average request completion times Bochs Network Simulator Application Memory traces Memory access latency performanceTree Cache Configuration: Tree Cache Configuration Good configuration arrived at through detailed design-space analysis Cache access parameters derived from Cacti Tree Cache Configuration Simulation Timing SummaryPerformance Scalability: Performance Scalability Compare in-network virtual tree protocol to standard directory protocol Good improvement 4x4 Mesh: Avg. 35.5% read latency reduction, 41.2% write latency reduction Scalable 8x8 Mesh: Avg. 35% read and 48% write latency reduction 4x4 Mesh 8x8 MeshStorage Scalability: Storage Scalability Directory protocol: O(# Nodes) Virtual tree protocol: O(# Ports) ~O(1) Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 16 64 # of Nodes Tree Cache Line Directory Cache Line 28 bits 28 bits 18 bits 66 bitsRelated Work: Related Work Mizrahi et al. [1989] Goodman and Woest [1988] LimitLess (Chaiken, et al.) [1991] SCI (Gjessing, et al.) [1992] GLOW (Kaxiras and Goodman) [1996] Chang, et al. [1999] Iyer, et al. [2000] Cheng, et al. [2006]Conclusions: Conclusions Want to address the scalability of on-chip data management (cache coherence) for CMPs Embedding cache coherence within the network fabric allows the network to optimize memory traffic ~35% and ~41% read and write latency reduction for 4x4 meshes ~35% and ~48% reduction for 8x8 meshes Storage overhead scalability Overhead when compared to directory protocol for a 4x4 mesh Reduction for an 8x8 mesh Slide20: Thank you!Results (cont.): Results (cont.) Effect of embedding the connectivity (i.e. tree cache) within the network 31% Avg. read latency reduction 49.1% Avg. write latency reduction Network Template: Network Template H Legend H: Home node Sharer node Tree node(no data) Read request/reply Write request/reply Teardown AcknowledgementTough Questions: Tough Questions Wouldn’t it be faster to use look-ahead routing (e.g. 2-cycle hop) and accept more hops? Scalability: Scalability Performance From previous slide, avg. read and write latencies remain constant or decrease (relative to directory) from 4x48x8 mesh Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 4x4: 8x8: Tree Full-Map Lim. Dir. Motivation: Coherence-Aware Networks: Motivation: Coherence-Aware Networks Network is not just for getting from A to B Storing more information allows for in-transit optimizations Give routers the ability to source and sink coherence messages Virtual Tree - Protocol: Virtual Tree - Protocol On capacity miss Tear down the tree and begin a new one On write request Tear down the tree and begin a new one Proactive evictions of tree cache Even at nodes other than the home node Tree must be torn down completely before a new one can be created A B C D H You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
GSRC 12 10 2006 Eisley Hannah Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 160 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript In-Network Cache Coherence: In-Network Cache Coherence Noel Eisley*, Li-Shiuan Peh*, Li Shang** *Department of Electrical Engineering Princeton University {eisley, peh}@princeton.edu **Department of Electrical and Computer Engineering Queen’s University li.shang@queensu.caChip-Multiprocessors (CMPs): Chip-Multiprocessors (CMPs) CMPs reduce complexity and offer scalability Higher frequencies and more transistors higher power consumption Higher frequencies many-cycle chip traversal CMPs offer a large amount of distributed on-chip storage On-chip data access is much faster than off-chip (~10-100x) An important challenge is to manage the distributed on-chip data using a scalable approach Then Now Future… Intel 80-core Background: Interconnection Networks: Background: Interconnection Networks Arbiter Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Routing Router Pipeline Ejection Injection South North East West I/O buffers On-Chip Data Management -- Cache coherence is the first step: On-Chip Data Management -- Cache coherence is the first step Finding valid data Does it exist? Where is it? Optimal: go directly to the nearest sharer Updating/invalidating data Avoid unnecessary traffic (i.e. broadcast) Optimal: invalidate sharers directly A A A A A A A ACurrent: Directory-Based Protocol: Current: Directory-Based Protocol Home node (H) keeps a list (or partial list) of sharers Read: Round trip overhead Reads don’t take advantage of locality Write: Up to two round trip overhead Invalidations inefficient H Our Approach: Virtual Network: Our Approach: Virtual Network Network is not just for getting from A to B It can keep track of sharers and maintain coherence How? With trees stored within routers Tree connecting home node with sharers Tree stored within routers On the way to home node, when requests “bump” into trees, routers will re-route them towards sharers. H Bump into tree Add to tree Outline: Outline Motivation and introduction In-network cache coherence protocol Methodology and results ConclusionsReads Locate Data Efficiently: H Reads Locate Data Efficiently Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Read Request injected into the network Tree constructed as read reply is sent back New read injected into the network Request is redirected by the network Data is obtained from sharer and returned Writes Invalidate Data Efficiently: Writes Invalidate Data Efficiently H Write Request injected into the network In-transit invalidations Acknowledgements spawned at leaf nodes Wait for acknowledgements Send out write reply Legend H: Home node Sharer node Tree node (no data) Read request/reply Write request/reply Teardown request Acknowledgement Virtual Trees - Structure: Virtual Trees - Structure Node types Home: same as for directory protocol Root: first sharer Internal: all others Characteristics: Directional link always points towards the root node No loops No disjoint trees H R Router Microarchitecture: Router Microarchitecture Tree Cache Link Traversal Crossbar Traversal Virtual Channel Arbitration Physical Channel Arbitration Tree Cache Routing Router Pipeline Tag Data Entry HRouter Microrchitecture: Virtual Tree Cache: Router Microrchitecture: Virtual Tree Cache Each cache line in the virtual tree cache contains 2 bits (To, From root node) for each direction (NSEW) 1 bit if the local node contains valid data 1 busy bit 1 Outstanding request bit Tag Data Entry T T T T F F F F N S E W D B R Virtual Tree Cache Outline: Outline Motivation and introduction In-network cache coherence protocol Methodology and results ConclusionsMethodology: Methodology Benchmarks: SPLASH-2 Multithreaded simulator: Bochs running 16 or 64-way Network and virtual tree simulator Trace-driven Full-system simulation becomes prohibitive with many cores Motivation: explore scalability Cycle-accurate Each message is modeled Models network contention (arbitration) Calculates average request completion times Bochs Network Simulator Application Memory traces Memory access latency performanceTree Cache Configuration: Tree Cache Configuration Good configuration arrived at through detailed design-space analysis Cache access parameters derived from Cacti Tree Cache Configuration Simulation Timing SummaryPerformance Scalability: Performance Scalability Compare in-network virtual tree protocol to standard directory protocol Good improvement 4x4 Mesh: Avg. 35.5% read latency reduction, 41.2% write latency reduction Scalable 8x8 Mesh: Avg. 35% read and 48% write latency reduction 4x4 Mesh 8x8 MeshStorage Scalability: Storage Scalability Directory protocol: O(# Nodes) Virtual tree protocol: O(# Ports) ~O(1) Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 16 64 # of Nodes Tree Cache Line Directory Cache Line 28 bits 28 bits 18 bits 66 bitsRelated Work: Related Work Mizrahi et al. [1989] Goodman and Woest [1988] LimitLess (Chaiken, et al.) [1991] SCI (Gjessing, et al.) [1992] GLOW (Kaxiras and Goodman) [1996] Chang, et al. [1999] Iyer, et al. [2000] Cheng, et al. [2006]Conclusions: Conclusions Want to address the scalability of on-chip data management (cache coherence) for CMPs Embedding cache coherence within the network fabric allows the network to optimize memory traffic ~35% and ~41% read and write latency reduction for 4x4 meshes ~35% and ~48% reduction for 8x8 meshes Storage overhead scalability Overhead when compared to directory protocol for a 4x4 mesh Reduction for an 8x8 mesh Slide20: Thank you!Results (cont.): Results (cont.) Effect of embedding the connectivity (i.e. tree cache) within the network 31% Avg. read latency reduction 49.1% Avg. write latency reduction Network Template: Network Template H Legend H: Home node Sharer node Tree node(no data) Read request/reply Write request/reply Teardown AcknowledgementTough Questions: Tough Questions Wouldn’t it be faster to use look-ahead routing (e.g. 2-cycle hop) and accept more hops? Scalability: Scalability Performance From previous slide, avg. read and write latencies remain constant or decrease (relative to directory) from 4x48x8 mesh Storage overhead 4x4 mesh: 56% more storage bits 8x8 mesh: 58% fewer storage bits 4x4: 8x8: Tree Full-Map Lim. Dir. Motivation: Coherence-Aware Networks: Motivation: Coherence-Aware Networks Network is not just for getting from A to B Storing more information allows for in-transit optimizations Give routers the ability to source and sink coherence messages Virtual Tree - Protocol: Virtual Tree - Protocol On capacity miss Tear down the tree and begin a new one On write request Tear down the tree and begin a new one Proactive evictions of tree cache Even at nodes other than the home node Tree must be torn down completely before a new one can be created A B C D H