Parallel Medical and Genomics Applications on Power3 and Power4 Machines: Parallel Medical and Genomics Applications on Power3 and Power4 Machines Amit Majumdar
San Diego Supercomputer Center - UCSD
Application I : Brain Deformation Simulation in Image Guided Neurosurgery
Simon K. Warfield1, Florin Talos1,2, Alida Tei1,3, Aditya Bharatha1,4, Arya Nabavi1,2, Matthieu Ferrante1,5, Peter McL. Black2, Ferenc A. Jolesz1, Ron Kikinis1, Corey Kemper1
1Surgical Planing Laboratory and 2Dept. of Surgery Brigham and Women’s Hospital and Harvard Medical School 3Massashusetts Institute of Technology 4University of Toronto Medical School 5Telecom. Lab., Universite’ Catholique de Louvain, Belgium
Application II : Monte Carlo SPECT Imaging
Yuni Dewaraja1, Kenneth Koral1, Abhijit Bose2, Michael Ljungberg3
1Nuclear Medicine, University of Michigan
2Center for Advanced Computing, University of Michigan
3Department of Radiation Physics, Univ. of Lund, Sweden
Application III : Parallel Proteomics Application
John R. Yates1, Daniel J. Carucci2 ,Giri Chukkapalli3, Robert Sinkovits3
1The Scripps Research Institute
2Naval Medical Research Center, US NAVY
3SDSC
IBM Parallel Machines : compute nodes: IBM Parallel Machines : compute nodes Power2 and Power3 at CAC University of Michigan :
176 160Mhz Power2; 1Gbytes and 256 Mbytes memory; three 110Mbytes/sec HP switches
3 Power3 nodes, 8 cpu/node, 375 Mhz Power3,8 Gbytes/node; 420 Mbytes/sec Colony switch
Power3 at SDSC : 144 nodes with 8zcpu/node, 375 Mhz Power3; 4 Gbytes/node; 350 Mbytes/sec Colony switch
Power3 at NAVO MSRC : 334 nodes with 4 cpu/node, 375 Mhs
Power4 at TACC University of Texas : 4 Regatta-HPC frames; 16 cpu/node; 1.3 Ghz ; 32 Gbytes/node (can be 64 procs 128 Gbytes memory machine by LL); high-speed dual-plane IBM SP Switch2
Application I : Brain Deformation Simulation in Image Guided Neurosurgery: Application I : Brain Deformation Simulation in Image Guided Neurosurgery
Brain Deformation Simulation in Image Guided Neurosurgery: Brain Deformation Simulation in Image Guided Neurosurgery Challenge faced by neurosurgeons
Remove as much as possible tumor tissue while minimizing removal of healthy tissue
Avoid critical anatomical structures
Real Time Brain Mapping
Enhanced visualization of tumor and critical brain structures
Align preoperatively acquired image data with intraoperative images of patient’s brain during surgery
Real time constraints
The code must meet real-time constraints of neurosurgery – provide images within few minutes few times during surgery lasting few hours
Algorithm: Algorithm Project preoperative image onto intraoperative images
Allows fusion of images from multiple imaging modalities and with multiple contrast types
Tracks surfaces of key structures in intraoperatively acquired images – allows projection of preoperative images into the patient’s brain configuration during surgery
A volumetric deformation field is inferred from the surface changes
The field captures nonrigid deformations of the shape of the brain due to brain swelling, cerebrospinal fluid loss, anaesthetic agents and actions of neurosurgeon
Current model uses linear elastic material model to represent brain
Overall Process: Overall Process Before Image Guided Neurosurgery :
During Image Guided Neurosurgery :
Segmentation and Visualization Preoperative Planning of
Surgical Trajectory Preoperative
Data Acquisition Preoperative data
Intraoperative MRI Segmentation Registration Surface
matching Solve biomechanical
Model for volumetric
deformation Visualization Surgical
process
Volumetric Biomechanical Simulation of Brain Deformation: Volumetric Biomechanical Simulation of Brain Deformation During surgery brain shape changes due to surgical intervention
During surgery surgeon can acquire new volumetric MRI to review current configuration of the entire brain
Volumetric Biomechanical Simulation of Brain Deformation
Match surface from earlier acquisition to the new acquisition
Infer volumetric deformation based upon the surface correspondences
Apply forces to the volumetric model that will produce the same displacement field at the surface as was obtained by the surface matching
Biomechanical model allows the computation of the deformation throughout the volume
Biomechanical Simulation Equations: Biomechanical Simulation Equations
Biomechanical Simulation Equations: Biomechanical Simulation Equations Mathematical operations, plugging in of interpolation of nodes in terms of linear functions etc. etc. finally gives :
Ku = -F
K is the stiffness matrix.
Displacement at the boundary surface nodes are fixed to match those generated by the active surface model
The force vector F is set equal to the displacement vector for the boundary nodes :
F =
Now solving matrix system for unknown displacement produces deformation field for the entire mesh that matches prescribed displacements at the boundary
Signa SP (GE Medical Systems): Signa SP (GE Medical Systems) R. Pergolizzi
Brain shift (1): Brain shift (1) F. Talos
Brain shift (2): Brain shift (2) F. Talos
Linear System Solver: Linear System Solver The PETSc package is used to solve the linear system
Generalized Minimal Residual (GMRES) solver with block Jacobi preconditioning
The rows of matrix are divided equally amongst CPUs
Global matrix is assembled in parallel
Each CPU assembles the local matrix for each element in its subdomain
Each CPU has equal # of rows to process
Due to irregular connectivity of the meshes some CPUs may do more work than others
Performance on Power3 and Power4 Machines: Performance on Power3 and Power4 Machines
Timing Table: Timing Table
Observations: Observations Power4 timings:
Unexpected timings on Power4 16 and 32 processors
Scheduler gives exclusive access to nodes and CPUs
Cache , network ?
Power3 timing is consistent (with other machines)
Overall scaling is not good beyond few processors
Serial I/O part contributes to this
Petsc performance : (GMRES with block jacobi precond)
Linear system solver MFLOPS scale well with # of procs on Power3 (have not checked on Power4 yet)
Petsc sparse matrix storage allocation is efficient
# of GMRES iterations increase with # of processors – 41 to 135 iterations on 1 to 16 procs respectively - contributes partially to scaling
Future plan is to investigate scaling further
Investigate viscoelastic modelling ($funding$)
Thanks to Petsc group (Barry Smith) for valuable discussions
Application II : Monte Carlo SPECT Imaging: Application II : Monte Carlo SPECT Imaging
Slide18: Radionuclide therapy
SPECT imaging in radionuclide therapy
Monte Carlo simulation of SPECT imaging
SIMIND code
Applications
Parallel Monte Carlo code and performance
Radionuclide therapy: Radionuclide therapy Cancer cells are sterilized using internally administered ionizing radiation
Some therapeutic isotopes, ex. I-131, produce both beta particles and gamma ray photons
Beta particles kill tumor cells. Beta pathlength span several cells.
Photons used to image radioactivity distribution within patient
Radionuclide therapy has less toxic effect on normal tissue than chemotherapy.
I-131 Radionuclide therapy: I-131 Radionuclide therapy I-131 Radioimmunotherapy (RIT) : I-131 labeled antibodies selectively target radioactivity to tumor cells while sparing normal tissue.
Shows promise for the treatment of non-Hodgkin’s lymphoma (NHL). NHL is the fifth leading cause of cancer death. Median survival 6-10 years.
I-131 MIBG
Shows promise for the treatment of metastatic neuroblastoma which is a childhood cancer with poor long term survival.
I-131 RIT for NHL at University of Michigan: I-131 RIT for NHL at University of Michigan Phase II clinical trial: Out of 76 patients with no previous treatment, 48 achieved a complete response and 26 achieved a partial response.
Patient-specific infusion
Tracer dose: ~ 5 mCi for dosimetry studies to determine therapeutic dose for each patient
Therapeutic dose: 50 -100 mCi one week later
I-131 imaging: I-131 imaging Single photon emission computed tomography (SPECT) imaging using a rotating gamma camera
Components of the gamma camera
Lead collimator
Detection medium - scintillation crystal
Electronics
Tomographic reconstruction of SPECT data produces a 3-D image of the radioactivity distribution within the patient.
Activity quantification: Activity quantification For accurate quantification, SPECT data has to be compensated for
Patient attenuation
Patient scatter
Camera response
What is the role of M.C. simulation in SPECT Imaging?: What is the role of M.C. simulation in SPECT Imaging? Monte Carlo is used primarily to evaluate compensation methods for scatter, attenuation and camera response and to evaluate the overall accuracy of our clinical activity quantification.
M.C. is ideal for such evaluations because unlike in experiments, the details of photon histories are known.
M.C. simulation of SPECT imaging: M.C. simulation of SPECT imaging SIMIND Monte Carlo code
Complete photon transport in phantom and SPECT camera
Complex source distributions: analytical or digital phantoms
Relatively fast
Code has been verified by experiment
SIMIND verification using thorax phantom: SIMIND verification using thorax phantom Measurement with experimental phantom Simulation with byte-coded digital phantom based on CT images
SIMIND verification: thorax phantom: SIMIND verification: thorax phantom
M.C. applications: SPECT quantification accuracy using voxel-man phantom: M.C. applications: SPECT quantification accuracy using voxel-man phantom Clinically realistic case
Voxel-man is based on patient CT
Realistic activity distribution in organs and tumor
Quantification error for large, spherical tumors < 3%
Simulation time: 220 hours using 16 SP2 processors (60 projections; 1 billion photon per projection)
Work in progress: Monte Carlo generated patient-specific recovery coefficients: Work in progress: Monte Carlo generated patient-specific recovery coefficients
A parallel M.C. code for SPECT: motivation: A parallel M.C. code for SPECT: motivation Fast code for accurate I-131 Monte Carlo simulations
I-131 M.C. simulations are computationally tedious
Physical modeling of collimator
Variance reduction limited
SPECT require a large number of projections
Realistic simulations using high resolution voxel phantoms
When all of the above are included in a simulation, CPU time can be several months using the serial SIMIND code
SIMIND parallelization: SIMIND parallelization
In the present application the photon histories are independent of each other - “inherently parallel”
Critical to have a good parallel RNG (SPRNG)
Each processor performs entire simulation and reports results to host processor
Code replicated in each of the N processors
Host sums N partial results and calculates the final result
Standard deviation of the combined result is improved by 1/sqrt(N)
Minimal changes to original SIMIND code
Slide32: host processor other processors
SP2 timing results of one SPECT projection of the voxel-man phantom: SP2 timing results of one SPECT projection of the voxel-man phantom
Power3 timing results of one SPECT projection of the voxel-man phantom: Power3 timing results of one SPECT projection of the voxel-man phantom
Power4 Timing of a different Monte Carlo photon transport codeMay, 2002 on IBM San Mateo center 32 procs Power4 with large page set ( early access to machine may have contributed to results below): Power4 Timing of a different Monte Carlo photon transport code May, 2002 on IBM San Mateo center 32 procs Power4 with large page set ( early access to machine may have contributed to results below)
More Realistic Simulations: More Realistic Simulations
A 60-projection SPECT simulation of the voxel-man phantom simulation of the voxel-phantom.
Realistic values were used for the activity concentration ratios in several organs and tumors (based on typical I-131 RIT patient studies at U. Michigan clinic): kidneys, 81; liver, 26; lungs, 26; spleen 53; blood pool, 48; 100 cc spherical tumor, 100; 50 cc spherical tumor, 100; 20 cc spherical tumor, 100; all other structures, 4. The SPECT matrix size was 64x64x64 with pixel size of 0.8 cm x 0.8 cm x 0.4 cm. Up to 3 orders of scatter was modeled. 1 billion photons were simulated for each projection.
The simulation time on Power3 using 512 processors was 6.5 hours for all 60 projections (time for each projection was 6.5 min).
Effect of Parallel Computing: Effect of Parallel Computing Monte Carlo has enabled to evaluate and improve the quantification of I-131 tumor uptake for dosimetry in NHL patients undergoing RIT at U. Michigan clinic.
May lead to statistically significant dose-response relationships
Speed-up due to the parallel SIMIND code has enabled us to carry out clinically realistic simulations using voxel-phantoms.
In the future we will carry out M.C. based dose calculations
More accurate
Tumor and organ dose distributions
Application III : Parallel Proteomics Application: Application III : Parallel Proteomics Application
Slide39: Sequest is a proteomics application used to analyze mass spectrometer output and match to protein database to identify proteins
The Naval Medical Research Center (NMRC) is using Sequest in the research to develop a malaria vaccine based on the expression of proteins in the various stages of the malaria parasite.
Performance of the serial Sequest is currently was a major bottleneck in malaria vaccine project
SDSC computational scientists developed a parallel version of the Sequest code to reduce simulation time significantly
Slide40: More individuals on the planet with malaria today then ever before in history
300-500 million people become ill with malaria each year
1-3 million children die each year from malaria (200-300 per hour)
Drug resistance is spreading rapidly
There is no licensed vaccine available anywhere in the world
Malaria is a major cause of illness in US troops overseas
Facts about Malaria An efficient vaccine should be achievable:
Immunity can be acquired naturally
Irradiated sporozoites provide > 95% protection
Vaccines targeting single proteins were disappointing
Current strategy: multistage multicomponent vaccine
Slide41: A Proteomics View of the Malaria Parasite Life Cycle OOne Genome:
~6,000 genes
DDifferent Proteomes:
Distinct Stages Comprehensively Analyze Protein Complements from 4 P. falciparum Cell Types
Identify Stage-specific Targets for Drug and Vaccine Development
Importance of a malaria vaccine: Importance of a malaria vaccine The battle against malaria hampered by the emergence of drug resistant strains of Plasmodium falciparum, the parasite responsible for majority of malaria infections.
Most cases of malaria are concentrated in the world's poorest countries. Malaria vaccine likely to be affordable alternative to expensive drugs.
Slide43: Plasmodium falciparum
Sporozoites, Trophozoites,
Merozoites, Gametocytes Digestion Lysis High-Throughput Proteomics: MudPIT
Single Processor Optimization: Single Processor Optimization
( Timings on 375 MHz IBM Power3 procs.TF:ThermoFinnigan;TSRI:The Scripps Research institute)
Code parallelization: Code parallelization Parallel version of Sequest has been developed using MPI and incorporating all single processor optimizations.
Parallelization was done so that all the processors work on a different file simultaneously; files are picked up from a list in round robin distribution by the processors
Tests show that good load balancing is achieved by distributing units of work in a round robin fashion.
Benchmarks show almost linear scaling on thousands of files
Database file (~30 Mb) currently read in once per input file (one test case has ~30,000 input file ; ~1000 procs)
Future plan : once per MPI process (reduces I/O from ~ 1 Tb to 30 Gb); eventually database file will be read once by single node
Power 3 Timing and Speedup: Power 3 Timing and Speedup
Power3 versus Power4 Speedup on 32 procs : Power3 versus Power4 Speedup on 32 procs
Impact on science: Impact on science Calculations (anlysis of part of a whole cell lysate of a merozoite sample: late-blood stage in the malaria lifecycle) which would have required 30 days on a single processor now require less than an hour on ~1000 processors of IBM SP Power3 NAVO machine
2x speedup due to single processor tuning observed
and ~1000x speedup from parallelization observed
Sequest is estimated to be in use at 500 laboratories worldwide – this work impacts entire proteomics community
Slide49: Final Comments Genomics community has already extensively used supercomputing capabilities and will continue to use
Now the proteomics community will use supercomputing more and more
Medical community is starting to use parallel computing in clinical and operation room procedures such as imaging, modeling of physical organs and their functions etc.