logging in or signing up wright Laurie Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 58 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript MPI Scheduling in Condor: An Update Paradyn/Condor WeekMadison, WI 2002: MPI Scheduling in Condor: An Update Paradyn/Condor Week Madison, WI 2002Outline: Outline Review of Dedicated/MPI Scheduling in Condor Dedicated vs. Opportunistic Backfill Supported MPI Implementations Supported Platforms Future WorkWhat is MPI?: What is MPI? MPI is the “Message Passing Interface” A library for writing parallel applications Fixed number of nodes Cannot be preempted Lots of scientists use it for large problems MPI is a standard with many different implementationsDedicated Scheduling in Condor: Dedicated Scheduling in Condor To schedule MPI jobs, Condor must have access to dedicated resources More and more Condor pools are being formed from dedicated resources Few schedulers handle both dedicated and non-dedicated resources at the same timeProblems with Dedicated Compute Clusters: Problems with Dedicated Compute Clusters Dedicated resources are not really dedicated Most software for controlling clusters relies on dedicated scheduling algorithms Assume constant availability of resources to compute fixed schedules Due to hardware and software failure, dedicated resources are not always available over the long-termLook Familiar?: Look Familiar?Two common views of a Cluster:: Two common views of a Cluster:The Condor Solution: The Condor Solution Condor overcomes these difficulties by combining aspects of dedicated and opportunistic scheduling into a single system Opportunistic scheduling involves placing jobs on non-dedicated resources under the assumption that the resources might not be available for the entire duration of the jobs This is what Condor has been doing for yearsThe Condor Solution (cont’d): The Condor Solution (cont’d) Condor manages all resources and jobs within a single system Administrators only have to maintain one system, saving time and money Users can submit a wide variety of jobs: Serial or parallel (including PVM + MPI) Spend less time learning different scheduling tools, more time doing scienceClaiming Resources for Dedicated Jobs: Claiming Resources for Dedicated Jobs When the dedicated scheduler (DS) has idle jobs, it queries the collector to find all dedicated resources DS does match-making to decide which resources it wants DS sends requests to the opportunistic scheduler to claim those resources DS claims resources and has exclusive control (until it releases them)Backfilling: The Problem: Backfilling: The Problem All dedicated schedulers leave “holes” Traditional solution is to use backfilling Use lower priority parallel jobs Use serial jobs However, if you can’t checkpoint the serial jobs, and/or you don’t have any parallel jobs of the right size and duration, you’ve still got holesBackfilling: The Condor Solution: Backfilling: The Condor Solution In Condor, we already have an infrastructure for managing non-dedicated nodes with opportunistic scheduling, so we use that to fill the holes in the dedicated schedule Our opportunistic jobs can be checkpointed and migrated when the dedicated scheduler needs the resources again Allows dedicated resources to be used for opportunistic jobs as neededSpecific MPI Implementations : Specific MPI Implementations Supported: MPICH Planned: MPIPro LAM Others?Condor’s MPICH Support: Condor’s MPICH Support MPICH uses rsh to spawn jobs Condor provides our own rsh tool Older versions of MPICH need to be built without a hard-coded path to rsh Newer versions of MPICH (1.2.2.3 and later) support an environment variable, P4_RSHCOMMAND, which specifies what program should be usedCondor and MPIPro: Condor and MPIPro We’ve investigated supporting MPIPro jobs with Condor MPIPro has some issues with selecting a port for the head node in your computation, and we’re looking for a good solutionCondor + LAM = "LAMdor”: Condor + LAM = "LAMdor” LAM's API is better suited for a dynamic environment, where hosts can come and go from your MPI universe Has a different mechanism for spawning jobs than MPICH Condor working to support their methods for spawningLAMdor (Cont’d): LAMdor (Cont’d) LAM working to understand, expand, and fully implement the dynamic scheduling calls in their API LAM also considering using Condor’s libraries to support checkpointing of MPI computationsOther MPI implementations: Other MPI implementations What are people using? Do you want to see Condor support any other MPI implementations? If so, let us know by sending email to: condor-admin@cs.wisc.eduSupported Platforms: Supported Platforms Condor’s MPI support is now available on all Condor platforms: Unix Linux, Solaris, Digital Unix, IRIX, HPUX Windows (new since last year) NT, 2000Future work (short-term): Future work (short-term) Implementing more advanced dedicated scheduling algorithms Integrating Condor’s user priority system with its dedicated scheduling Adding support for user-specified job priorities (among their own jobs) Condor-MPI support for the Tool Daemon ProtocolFuture work (longer term): Future work (longer term) Solving problems w/ MPI on the Grid "Flocking" MPI jobs to remote pools, or even spanning pools with a single computation Solving issues of resource ownership on the Grid (i.e. how do you handle multiple dedicated schedulers on the grid wanting to control a given resource?) More Future work: More Future work Support for other kinds of dedicated jobs: Generic dedicated jobs We gather and schedule the resources, then call your program, give it the list of machines, and let the program spawn itself Linda (parallel programming interface) Gaussian (computational chemistry) More Future work: More Future work Better support for preempting opportunistic jobs to facilitate running high-priority dedicated ones “Checkpointing” vanilla jobs to swap space Checkpointing entire MPI computations MW using Condor-MPIHow do I start using MPI with Condor?: How do I start using MPI with Condor? MPI support added and tested in the current development series (6.3.X) MPI support is a built-in feature of the next stable series of Condor (6.4.X) 6.4.0 will be released Any Day Now™Thanks for Listening!: Thanks for Listening! Questions? Come to the MPI “BoF”, Wednesday, 3/6/02, 11am-noon, 3385 CS For more information: www.cs.wisc.edu/condor condor-admin@cs.wisc.edu You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
wright Laurie Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 58 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript MPI Scheduling in Condor: An Update Paradyn/Condor WeekMadison, WI 2002: MPI Scheduling in Condor: An Update Paradyn/Condor Week Madison, WI 2002Outline: Outline Review of Dedicated/MPI Scheduling in Condor Dedicated vs. Opportunistic Backfill Supported MPI Implementations Supported Platforms Future WorkWhat is MPI?: What is MPI? MPI is the “Message Passing Interface” A library for writing parallel applications Fixed number of nodes Cannot be preempted Lots of scientists use it for large problems MPI is a standard with many different implementationsDedicated Scheduling in Condor: Dedicated Scheduling in Condor To schedule MPI jobs, Condor must have access to dedicated resources More and more Condor pools are being formed from dedicated resources Few schedulers handle both dedicated and non-dedicated resources at the same timeProblems with Dedicated Compute Clusters: Problems with Dedicated Compute Clusters Dedicated resources are not really dedicated Most software for controlling clusters relies on dedicated scheduling algorithms Assume constant availability of resources to compute fixed schedules Due to hardware and software failure, dedicated resources are not always available over the long-termLook Familiar?: Look Familiar?Two common views of a Cluster:: Two common views of a Cluster:The Condor Solution: The Condor Solution Condor overcomes these difficulties by combining aspects of dedicated and opportunistic scheduling into a single system Opportunistic scheduling involves placing jobs on non-dedicated resources under the assumption that the resources might not be available for the entire duration of the jobs This is what Condor has been doing for yearsThe Condor Solution (cont’d): The Condor Solution (cont’d) Condor manages all resources and jobs within a single system Administrators only have to maintain one system, saving time and money Users can submit a wide variety of jobs: Serial or parallel (including PVM + MPI) Spend less time learning different scheduling tools, more time doing scienceClaiming Resources for Dedicated Jobs: Claiming Resources for Dedicated Jobs When the dedicated scheduler (DS) has idle jobs, it queries the collector to find all dedicated resources DS does match-making to decide which resources it wants DS sends requests to the opportunistic scheduler to claim those resources DS claims resources and has exclusive control (until it releases them)Backfilling: The Problem: Backfilling: The Problem All dedicated schedulers leave “holes” Traditional solution is to use backfilling Use lower priority parallel jobs Use serial jobs However, if you can’t checkpoint the serial jobs, and/or you don’t have any parallel jobs of the right size and duration, you’ve still got holesBackfilling: The Condor Solution: Backfilling: The Condor Solution In Condor, we already have an infrastructure for managing non-dedicated nodes with opportunistic scheduling, so we use that to fill the holes in the dedicated schedule Our opportunistic jobs can be checkpointed and migrated when the dedicated scheduler needs the resources again Allows dedicated resources to be used for opportunistic jobs as neededSpecific MPI Implementations : Specific MPI Implementations Supported: MPICH Planned: MPIPro LAM Others?Condor’s MPICH Support: Condor’s MPICH Support MPICH uses rsh to spawn jobs Condor provides our own rsh tool Older versions of MPICH need to be built without a hard-coded path to rsh Newer versions of MPICH (1.2.2.3 and later) support an environment variable, P4_RSHCOMMAND, which specifies what program should be usedCondor and MPIPro: Condor and MPIPro We’ve investigated supporting MPIPro jobs with Condor MPIPro has some issues with selecting a port for the head node in your computation, and we’re looking for a good solutionCondor + LAM = "LAMdor”: Condor + LAM = "LAMdor” LAM's API is better suited for a dynamic environment, where hosts can come and go from your MPI universe Has a different mechanism for spawning jobs than MPICH Condor working to support their methods for spawningLAMdor (Cont’d): LAMdor (Cont’d) LAM working to understand, expand, and fully implement the dynamic scheduling calls in their API LAM also considering using Condor’s libraries to support checkpointing of MPI computationsOther MPI implementations: Other MPI implementations What are people using? Do you want to see Condor support any other MPI implementations? If so, let us know by sending email to: condor-admin@cs.wisc.eduSupported Platforms: Supported Platforms Condor’s MPI support is now available on all Condor platforms: Unix Linux, Solaris, Digital Unix, IRIX, HPUX Windows (new since last year) NT, 2000Future work (short-term): Future work (short-term) Implementing more advanced dedicated scheduling algorithms Integrating Condor’s user priority system with its dedicated scheduling Adding support for user-specified job priorities (among their own jobs) Condor-MPI support for the Tool Daemon ProtocolFuture work (longer term): Future work (longer term) Solving problems w/ MPI on the Grid "Flocking" MPI jobs to remote pools, or even spanning pools with a single computation Solving issues of resource ownership on the Grid (i.e. how do you handle multiple dedicated schedulers on the grid wanting to control a given resource?) More Future work: More Future work Support for other kinds of dedicated jobs: Generic dedicated jobs We gather and schedule the resources, then call your program, give it the list of machines, and let the program spawn itself Linda (parallel programming interface) Gaussian (computational chemistry) More Future work: More Future work Better support for preempting opportunistic jobs to facilitate running high-priority dedicated ones “Checkpointing” vanilla jobs to swap space Checkpointing entire MPI computations MW using Condor-MPIHow do I start using MPI with Condor?: How do I start using MPI with Condor? MPI support added and tested in the current development series (6.3.X) MPI support is a built-in feature of the next stable series of Condor (6.4.X) 6.4.0 will be released Any Day Now™Thanks for Listening!: Thanks for Listening! Questions? Come to the MPI “BoF”, Wednesday, 3/6/02, 11am-noon, 3385 CS For more information: www.cs.wisc.edu/condor condor-admin@cs.wisc.edu