Presentation Transcript
Emulating Massively Parallel (PetaFLOPS) Machines: Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla
Joshua Mostkoff Unger, Gengbin Zheng,
Laxmikant V. Kalé
Department of Computer Science
Parallel Programming Laboratory
http://charm.cs.uiuc.edu
Roadmap: Roadmap BlueGene Architecture
Need for an Emulator
Charm++ BlueGene
Converse BlueGene
Future Work
Blue Gene: Processor-in-memory Case Study: Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from:
http://www.research.ibm.com/bluegene/
FUNCTIONAL MODEL:
34X34X36 cube of shared memory nodes each having 25 processors.
SMP Node: SMP Node 25 processors
200 processing elements
Input/Output Buffer
32 x 128 bytes
Network
Connected to six neighbors via duplex link
16 bit @ 500 MHz = 1 Gigabyte/s
Latencies:
5 cycles per hop
75 cycles per turn
Processor: Processor STATS:
500 MHz
Memory-side cache eliminates coherency problems
10 cycles local cache
20 cycles remote cache
10 cycles cache miss
8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!
Need for Emulator: Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine
Emulator Objectives: Emulator Objectives Emulate Blue Gene and other petaFLOPS machines.
Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture.
Issues:
Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging.
Therefore don’t need complex event queue/rollback.
Emulator Implementation: Emulator Implementation What are basic data structures/interface?
Machine configuration (topology), handler registration
Nodes with node-level shared data
Threads (associated with each node) representing processing elements
Communication between nodes
How to handle all these objects on parallel architecture? How to handle object-to-object communication?
Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.
Experiments on Emulator: Experiments on Emulator Sample applications implemented:
Primes
Jacobi relaxation
MD prototype
ApoA-I: 92k Atoms 40,000 atoms, no bonds calculated, nearest neighbor cutoff
Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors
Collective Operations: Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE x y z Use 'primitive' 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster
Converse BlueGene Emulator Objective: Converse BlueGene Emulator Objective Performance estimation (with proper time stamping)
Provide API for building Charm++ on top of emulator.
Bluegene Emulator : Bluegene Emulator Node Structure Communication threads Non-affinity message queue Affinity message queue Worker thread inBuffer
Performance: Performance Pingpong
Close to Converse pingpong;
81-103 us v.s. 92 us RTT
Charm++ pingpong
116 us RTT
Charm++ Bluegene pingpong
134-175 us RTT
Charm++ on top of Emulator: Charm++ on top of Emulator BlueGene thread represents Charm++ node;
Name conflict:
Cpv, Ctv
MsgSend, etc
CkMyPe(), CkNumPes(), etc
Future Work: Simulator: Future Work: Simulator LeanMD : Fully functional MD with only cutoff
How can we examine performance of algorithms on variants of processor-in-memory design in massive system?
Several layers of detail to measure
Basic: Correctly model performance, timestamp messages with correction for out-of-order execution
More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques