logging in or signing up bluegene01 Malbern Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 37 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 18, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Emulating Massively Parallel (PetaFLOPS) Machines: Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory http://charm.cs.uiuc.edu Roadmap: Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work Blue Gene: Processor-in-memory Case Study: Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from: http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors. SMP Node: SMP Node 25 processors 200 processing elements Input/Output Buffer 32 x 128 bytes Network Connected to six neighbors via duplex link 16 bit @ 500 MHz = 1 Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn Processor: Processor STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements! Need for Emulator: Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine Emulator Objectives: Emulator Objectives Emulate Blue Gene and other petaFLOPS machines. Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback. Emulator Implementation: Emulator Implementation What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm. Experiments on Emulator: Experiments on Emulator Sample applications implemented: Primes Jacobi relaxation MD prototype ApoA-I: 92k Atoms 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors Collective Operations: Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE x y z Use 'primitive' 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster Converse BlueGene Emulator Objective: Converse BlueGene Emulator Objective Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Bluegene Emulator : Bluegene Emulator Node Structure Communication threads Non-affinity message queue Affinity message queue Worker thread inBuffer Performance: Performance Pingpong Close to Converse pingpong; 81-103 us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong 134-175 us RTT Charm++ on top of Emulator: Charm++ on top of Emulator BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Future Work: Simulator: Future Work: Simulator LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
bluegene01 Malbern Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 37 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 18, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Emulating Massively Parallel (PetaFLOPS) Machines: Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory http://charm.cs.uiuc.edu Roadmap: Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work Blue Gene: Processor-in-memory Case Study: Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from: http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors. SMP Node: SMP Node 25 processors 200 processing elements Input/Output Buffer 32 x 128 bytes Network Connected to six neighbors via duplex link 16 bit @ 500 MHz = 1 Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn Processor: Processor STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements! Need for Emulator: Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine Emulator Objectives: Emulator Objectives Emulate Blue Gene and other petaFLOPS machines. Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback. Emulator Implementation: Emulator Implementation What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm. Experiments on Emulator: Experiments on Emulator Sample applications implemented: Primes Jacobi relaxation MD prototype ApoA-I: 92k Atoms 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors Collective Operations: Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE x y z Use 'primitive' 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster Converse BlueGene Emulator Objective: Converse BlueGene Emulator Objective Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Bluegene Emulator : Bluegene Emulator Node Structure Communication threads Non-affinity message queue Affinity message queue Worker thread inBuffer Performance: Performance Pingpong Close to Converse pingpong; 81-103 us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong 134-175 us RTT Charm++ on top of Emulator: Charm++ on top of Emulator BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Future Work: Simulator: Future Work: Simulator LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques