bluegene01

Uploaded from authorPOINT
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Emulating Massively Parallel (PetaFLOPS) Machines: 

Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory http://charm.cs.uiuc.edu

Roadmap: 

Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work

Blue Gene: Processor-in-memory Case Study: 

Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from: http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

SMP Node: 

SMP Node 25 processors 200 processing elements Input/Output Buffer 32 x 128 bytes Network Connected to six neighbors via duplex link 16 bit @ 500 MHz = 1 Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn

Processor: 

Processor STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!

Need for Emulator: 

Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

Emulator Objectives: 

Emulator Objectives Emulate Blue Gene and other petaFLOPS machines. Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.

Emulator Implementation: 

Emulator Implementation What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

Experiments on Emulator: 

Experiments on Emulator Sample applications implemented: Primes Jacobi relaxation MD prototype ApoA-I: 92k Atoms 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors

Collective Operations: 

Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE x y z Use 'primitive' 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster

Converse BlueGene Emulator Objective: 

Converse BlueGene Emulator Objective Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator.

Bluegene Emulator : 

Bluegene Emulator Node Structure Communication threads Non-affinity message queue Affinity message queue Worker thread inBuffer

Performance: 

Performance Pingpong Close to Converse pingpong; 81-103 us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong 134-175 us RTT

Charm++ on top of Emulator: 

Charm++ on top of Emulator BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc

Future Work: Simulator: 

Future Work: Simulator LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques