Foley CCA IPS presentation final

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Design and Prototype Implementation of MCMD Support in the SWIM IPS Framework : 

Design and Prototype Implementation of MCMD Support in the SWIM IPS Framework Samantha Foley

Overview: 

Overview Description of the SWIM project Description of the IPS What it needs to do for SWIM Existing implementation Second generation Requirements Design Example Prototype implementation and future work

What is SWIM?: 

What is SWIM? Center for Simulation of RF Wave Interactions with Magnetohydrodynamics DoE SciDAC project (Fusion + OASCR) Integration of two established aspects of plasma physics Wave-plasma interactions Extended MHD (magnetohydrodynamics) Scientific goal: better understanding and control of RF/plasma interactions Disturbances in the plasma are less efficient Increase the efficiency of fusion energy projects, like ITER Technical goal: prototype integrated fusion simulation

Integrated Plasma Simulator: 

Integrated Plasma Simulator SWIM’s framework to integrate individual fusion codes Focus on designing interfaces for different classes of physics components Multiple implementations of each class participating in project Existing codes have a wide range of HPC friendliness Scalability, parallelism, portability Evolutionary approach toward full component model Initially don’t touch physics codes at all Python wrapped physics codes

IPS: Structure: 

IPS: Structure Simulation configuration file Describes the components, ports, time loop, and other information for running the simulation Framework Basic file movement, access to configuration information and components Services Job launch, data movement, component access, framework access Components Driver - physics workflow control Init - sets up global things like the Plasma State Physics components - python wrapped physics codes Plasma State Library and set of files that the physics components use to communicate data

New IPS design requirements: 

New IPS design requirements MCMD There will be components that can run concurrently, coupled and sequentially with respect to each other Task management Need to be able to launch and manage multiple components to achieve coupled or concurrent execution Need to work with the resource manager Resource management Need to be able to manage resources efficiently for concurrent, coupled and sequential cases Fault tolerance: ability to handle node failure Fault tolerant Work with the CIFTS project to receive information about node failure and other system failures

New IPS design: 

New IPS design Services are broken into the following groups: Resource manager Task manager Event service Data manager Workflow manager

New IPS: Resource manager (RM): 

New IPS: Resource manager (RM) Allocates nodes to a component for execution Releases the nodes back to the pool Interface for TM to get allocation information for a component or group of components Keeps track of node ownership, group, availability, and status The status of a node relates to whether it is up, down, or we suspect it may be having problems

New IPS: Task manager (TM): 

New IPS: Task manager (TM) Two level launch mechanism Accommodates model where framework/driver run on head node and physics codes are launched as MPI jobs Call - driver (or service) calls a component method, the it is invoked launchTask - called by the component’s python wrapper to run the executable on the appropriate nodes and runtime environment Monitors the launched tasks for users Designed for asynchronous and coupled launches for MCMD execution Current prototype is synchronous and does not handle concurrent launch

New IPS: Event service (ES): 

New IPS: Event service (ES) Based on the CCA event service specification Primarily for logging what happens during the simulation Eventually, it will publish events to the SWIM web portal Interface to the FTB Initial implementation, to be integrated with IPS Work of Aniruddha Shet, CIFTS project

New IPS: Overall design: 

New IPS: Overall design Framework ES RM TM DM WM Component Driver Init PS Log FTB Services Web Portal

New IPS: APIs: 

New IPS: APIs Task Manager call(comp, method, args) callNonblocking(comp, method, args) waitCall(callID), getCallStatus(callID), killCall(callID) launchTask(executable, args, compName) launchTaskNonblocking(executable, args, compName) waitTask(taskID), getTaskStatus(taskID), killTask(taskID) Resource Manager getAllocation(comp, method), releaseAllocation(compName) getCompReqs(comp, method) setNodeState(node, newState) getAllocationInfo(compName), getGroupAllocationInfo(groupName)

Example: Task Launch: 

Example: Task Launch … RM.getAllocation(compX, method1) DM.stageInput(compX) TM.call(compX, method1, args) DM.stageOutput(compX) RM.releaseAllocation(compX) … getCompReqs(compX, method1) If there are enough nodes, update nodeTable and other data structures … if method == ‘method1’ return n … compX.requirementsInfo Driver RM.getAllocation(compX, method1)

Example: Task Launch (con’t): 

Example: Task Launch (con’t) … RM.getAllocation(compX, method1) DM.stageInput(compX) TM.call(compX, method1, args) DM.stageOutput(compX) RM.releaseAllocation(compX) … Get input file list Get input file dir fwk.copyFiles(inputDir, files, cwd) Driver DM.stageInput(compX)

Example: Task Launch (con’t): 

Example: Task Launch (con’t) … RM.getAllocation(compX, method1) DM.stageInput(compX) TM.call(compX, method1, args) DM.stageOutput(compX) RM.releaseAllocation(compX) … DM.setWorkDir(compX) retval = compX.method1(args) cd back to driver’s work dir return retval … TM.launchTask(task, args,self) … compX.method1 Driver TM.call(compX, method1, args) Find the nodes allocated to compX Return resource info info = RM.getAllocationInfo(compX) Launch executable on allocated nodes return retval TM.launchTask(exe, args, compX) RM.getAllocationInfo(compX)

Example: Task Launch (con’t): 

Example: Task Launch (con’t) Get output file list Get output file dir fwk.copyFiles(cwd, files, outputDir) … RM.getAllocation(compX, method1) DM.stageInput(compX) TM.call(compX, method1, args) DM.stageOutput(compX) RM.releaseAllocation(compX) … Driver DM.stageOutput(compX)

Example: Task Launch (con’t): 

Example: Task Launch (con’t) Find the nodes allocated to compX Update nodeTable and other data structures … RM.getAllocation(compX, method1) DM.stageInput(compX) TM.call(compX, method1, args) DM.stageOutput(compX) RM.releaseAllocation(compX) … Driver RM.releaseAllocation(compX)

Work in progress: 

Work in progress Currently implementing the design Prototype implementation Expected at the end of the summer Targeted system: Jaguar Resource management Synchronous, non-concurrent task launch Event system integrated

Future Work: 

Future Work TM: multiple component launch Asynchronous call and task launch Notion of coupling and groups Ability to monitor status RM: fault tolerance Detection of node failure Suspected failure policies Continue collaboration with CIFTS

Further IPS work: 

Further IPS work Extend to full MCMD functionality Ability to deal with other types of failure Example: I/O failure Sophisticated data management Concurrent access to plasma state Workflow management Develop automatic workflow traversal system based on DAGs Evolve to CCA-compliant IPS

Questions?: 

Questions?