Presentation Transcript
Condor Parallel Universe : Condor Parallel Universe
Overview : Overview Task vs. Job Parallelism
New Condor support for Task-Parallelism
Other goodies
The Talk in one Slide : The Talk in one Slide
Parallel Universe can run any* task parallel job
Not just MPICH 1.2.4
Not just MPI…
Job vs Task Parallelism : Job vs Task Parallelism
Condor historically focused on Job Parallelism
Job parallelism either manually or via DAGman
Rest of talk on task parallelism
Can also get task parallel via pvm or MW
Parallel Universe : Parallel Universe Adaptation of MPI universe
Modifications based on experience with MPI
User feedback
But, more than just MPI
MPI lifecycle without Condor : MPI lifecycle without Condor Lam Version
lamboot lamboot -ssi boot ssh machine_file
mpirun mpirun -np 8 exe arg1 arg2...
lamhalt lamhalt
Scheduling : Scheduling Need 'Dedicated Scheduler'
'Dedicated' has a specific Condor meaning
Nodes running MPI require a dedicated scheduler
A Given machine can have many opportunistic schedulers
... but only 1 dedicated scheduler
DedicatedScheduler surprises : DedicatedScheduler surprises DedicatedScheduler co-opts normal negotiation cycle
Preemption and scheduling work differently than opportunistic
DedicatedScheduler schedules First-Fit, sorted by UserJobPrio
Condor_q –analyze mystery!
Job startup : Job startup Same file transfer, etc. as Vanilla
One shadow, many starters
Starter runs sshd on all machines, does key exchange
Starter runs the exe on first machine
(head node, Rank0)
Your script Here : Your script Here Script on the head node has contact file
We provide samples for LAM, MPICH
We try to mimic 'by hand' startup
Use condor_ssh to start remote jobs
When script exits, condor cleans up
Parallel Example : Parallel Example Submit Machine Execute Machines Schedd Startd Startd Startd Sshd Sshd Sshd Job Job Job
Example submit file : Example submit file Universe = Parallel # executable is a script
executable = script # the real binary transfer_input_files = executable arguments = arg1 arg2 arg3 machine_count = 8 output = out.$(Cluster).$(NODE) queue
Example Script : Example Script
chmod 755 simple
lamboot –ssi boot rsh $MACHINE_FILE
mpirun –np $NO_MACHINES simple
lamhalt
Example submit file 2 : Example submit file 2 Universe = Parallel
Requirements = (Hostname == 'somemachine')
queue
Requirements = (Hostname != 'somemachine')
queue 7
Example Script 2 : Example Script 2 mach1 = `sed –n 1p $MACHINE_FILE`
mach2 = `sed –n 2p $MACHINE_FILE`
./server andamp;
ssh $mach1 client_app
ssh $mach2 client_app
wait
Summary : Summary With Parallel Universe in Condor 6.8 comes:
Support for most MPI implementations (some scripting required)
Somewhat better MPI scheduling
Better node placement via condor matchmaking
Questions? : Questions?
Thank you!
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.