logging in or signing up harvard deas Techy_Guy Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 234 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 03, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering (ICASE) NASA Langley Research Center Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory Algorithms for Terascale Computation of PDEs and PDE-constrained OptimizationIn memoriam, Joseph-Louis Lagrange: In memoriam, Joseph-Louis Lagrange Born 1736, Turin Died 10 April 1813, Paris Winner, Prix de l’Académie des Sciences, 1772, 1774, 1780 Mécanique Analytique, 1788 Elected F.R.S., 1791 Théorie des Fonctions Analytiques, 1797 Slide3: “As long as algebra and geometry have been separated, their progress has been slow and their uses limited; but since these two sciences have been united, they have lent each mutual forces, and have marched together towards perfection.” Another Quotation of LagrangePlan of presentation: Plan of presentation Imperative of “optimal” algorithms for terascale computing Basic domain decomposition and multilevel algorithmic concepts Illustrations of solver performance on ASCI platforms (with a little help from my friends…) Terascale Optimal PDE Simulations (TOPS) software project of the U.S. DOE ConclusionsTerascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrumentTerascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrument Experiments expensive Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous However, simulation is far from proven! To meet expectations, we need to handle problems of multiple physical scales. Experiments difficult to instrument Experiments expensive Large platforms have been provided: ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 100+ Tflop / 30 TB Time (CY) Capability 1+ Tflop / 0.5 TB 30+ Tflop / 10 TB Red 3+ Tflop / 1.5 TB Blue 10+ Tflop / 4 TB White 50+ Tflop / 25 TB Large platforms have been provided ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2004 www.llnl.gov/asci/platforms Sandia Los Alamos Livermore LivermoreNSF’s 13.6 TF TeraGrid coming on line: NSF’s 13.6 TF TeraGrid coming on line 26 24 8 4 HPSS 5 HPSS HPSS UniTree External Networks External Networks External Networks External Networks Site Resources Site Resources Site Resources Site Resources TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org c/o I. FosterBuilding platforms is the “easy” part: Building platforms is the “easy” part Algorithms must be highly concurrent and straightforward to load balance latency tolerant cache friendly (temporal and spatial locality of reference) highly scalable (in the sense of convergence) Goal for algorithmic scalability: fill up memory of arbitrarily large machines while preserving constant running times with respect to proportionally smaller problem on one processor Domain-decomposed multilevel methods “natural” for all of these Domain decomposition also “natural” for software engineeringAlgorithmic requirementsfrom architecture: Algorithmic requirements from architecture Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache T3E “horizontal” aspects “vertical” aspects message passing, shared memory threads register blocking, cache blocking, prefetchingKeyword: “Optimal”: Keyword: “Optimal” Convergence rate nearly independent of discretization parameters Multilevel schemes for rapid linear convergence of linear problems Newton-like schemes for quadratic convergence of nonlinear problems Convergence rate as independent as possible of physical parameters Continuation schemes Physics-based preconditioningWhy Optimal Algorithms?: Why Optimal Algorithms? The more powerful the computer, the greater the importance of optimality Example: Suppose Alg1 solves a problem in time CN2, where N is the input size Suppose Alg2 solves the same problem in time CN Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger Why Optimal?, cont.: Why Optimal?, cont. Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time Is 10,000 processors a reasonable expectation? Yes, we have it today (ASCI White)! Could computational scientists really use 10,000X? Of course; we are approximating the continuum A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface In 2D 10,000X disappears fast; in 3D even faster However, these machines are expensive (ASCI White is $100M, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them Decomposition strategies for Lu=f in : Decomposition strategies for Lu=f in Operator decomposition Function space decomposition Domain decompositionOperator decomposition: Operator decomposition Consider ADI Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves Parallelism within each substep But global data exchanges between bandsolve substeps Function space decomposition: Function space decomposition Consider a spectral Galerkin method Domain decomposition: Domain decomposition Consider restriction and extension operators for subdomains, , and for possible coarse grid, Replace discretized with Solve by a Krylov method, e.g., CG Matrix-vector multiplies with parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)Comparison: Comparison Operator decomposition (ADI) natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) Function space decomposition (Fourier) natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) Domain decomposition (Schwarz) natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem Theoretical scaling of domain decomposition (for three common network topologies): Theoretical scaling of domain decomposition (for three common network topologies) With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales linearly with problem size (“scalable”, assumes one subdomain per processor) With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales as three-fourths power of problem size (“almost scalable”) With linear-time (common bus) network: optimal number of processors scales as one-fourth power of problem size (*not* scalable) bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node! Three Basic Concepts: Three Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioningIterative correction: Iterative correction The most basic idea in iterative methods Evaluate residual accurately, but solve approximately, where is an approximate inverse to A A sequence of complementary solves can be used, e.g., with first and then one has Optimal polynomials of lead to various preconditioned Krylov methods Scale recurrence, e.g., with , leads to multilevel methods Multilevel Preconditioning: Finest Grid Multilevel PreconditioningSchwarz Preconditioning: Schwarz Preconditioning Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of Let Boolean rectangular matrix extract the subset of : Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain supportIteration count estimates from the Schwarz theory: Iteration count estimates from the Schwarz theory In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned systemSchur Preconditioning: Schur Preconditioning Given a partition Condense: Let M be a good preconditioner for S Then is a preconditioner for A Moreover, solves with may be done approximately if all degrees of freedom are retained Schwarz polynomials: Schwarz polynomials Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively) Schwarz-on-Schur: Schwarz-on-Schur Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem “Neumann-Neumann” alg Multigrid on the Schur complement Schwarz-inside-Schur: Schwarz-inside-Schur Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization Constraint Objective Lagrangian Form the gradient of the Lagrangian with respect to each of x, u, and : Schwarz-inside-Schur: Schwarz-inside-Schur Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers Then Schwarz-inside-Schur, cont.: Schwarz-inside-Schur, cont. Problems is the Jacobian of a PDE huge! involve Hessians of objective and constraints second derivatives and huge H is unreasonable to form, store, or invert Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic differentiation (vector-to-vector map) Form approximate inverse action of state Jacobian and its transpose by Schwarz Nonlinear Schwarz preconditioning: Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free It replaces with a new nonlinear system possessing the same root, Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition Then sum the corrections: Nonlinear Schwarz, cont.: Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product Remarkably, (Cai-Keyes, 2000) it can be shown that where and All required actions are available in terms of ! Experimental example of nonlinear Schwarz: Experimental example of nonlinear Schwarz “Unreasonable effectiveness” of Schwarz: “Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Good decompositions are a compromise between conditioning and parallel complexity, in practice Let be a complete set of orthonormal row eigenvectors for A : orSome anecdotal successes: Some anecdotal successes Newton-Krylov-Schwarz Computational aerodynamics (Anderson et al., NASA, ODU, ANL) FETI (Schwarz on Schur) Computational structural dynamics (Farhat & Pierson, CU-Boulder) LNKS (Schwarz inside Schur) PDE-constrained optimization (Biros, NYU & Ghattas, CMU)Newton-Krylov-Schwarz: Newton-Krylov-Schwarz Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994)Jacobian-Free Newton-Krylov Method: Jacobian-Free Newton-Krylov Method In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error Schwarz preconditions, using approximate elements Computational Aerodynamics: Computational Aerodynamics mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc Transonic “Lambda” Shock, Mach contours on surfacesFixed-size Parallel Scaling Results: Fixed-size Parallel Scaling Results c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith This scaling study, featuring our widest range of processor number, was done for the incompressible case. Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors): Fixed-size Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)PDE Workingsets: PDE Workingsets Smallest: data for single stencil Largest: data for entire subdomain Intermediate: data for a neighborhood collection of stencils, reused as possibleImprovements Resulting from Locality Reordering: Improvements Resulting from Locality ReorderingCache Traffic for PDEs: Cache Traffic for PDEs As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth Slide49: FETI-DP for structural mechanics 1mdof 4mdof 9mdof 18mdof 30mdof 60mdof c/o C. Farhat and K. Pierson Numerically scalable, hardware scalable solutions for realistic solid/shell models Used in Sandia codes Salinas, Adagio, Andante PDE-constrained Optimization: PDE-constrained Optimization c/o G. Biros and O. Ghattas Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis) www.cs.nyu.edu/~biros/veltisto/Slide51: Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups For FY2002, 51 new projects at $57M/year total Approximately one-third for ISICs A third for grid infrastructure and collaboratories A third for applications groups 5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC: Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, $18M, five years, 24 co-PIs TOPS: TOPS Not just algorithms, but vertically integrated software suites Portable, scalable, extensible, tunable, modular implementations Starring PETSc and hypre, among other existing packages Background of PETSc Library(in which FUN3D example was implemented): Background of PETSc Library (in which FUN3D example was implemented) Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitive tasks (e.g., gather/scatter) See http://www.mcs.anl.gov/petscUser Code/PETSc Library Interactions: PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PC KSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library InteractionsUser Code/PETSc Library Interactions: PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PC KSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library Interactions To be AD codeBackground of Hypre Library(to be combined with PETSc under SciDAC): Background of Hypre Library (to be combined with PETSc under SciDAC) Developed by Chow, Cleary & Falgout (LLNL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by seven additional staff (Henson, Jones, Lambert, Painter, Tong, Treadway, Yang) under ASCI and SciDAC Object-oriented design similar to PETSc Concentrates on linear problems only Richer in preconditioners than PETSc, with focus on algebraic multigrid Includes other preconditioners, including sparse approximate inverse (Parasails) and parallel ILU (Euclid) See http://www.llnl.gov/CASC/hypre/Hypre’s “Conceptual Interfaces”: Hypre’s “Conceptual Interfaces” Slide c/o E. Chow, LLNLSample of Hypre’s Scaled Efficiency: Sample of Hypre’s Scaled Efficiency Slide c/o E. Chow, LLNLScope for TOPS: Scope for TOPS Design and implementation of “solvers” Time integrators, with sens. analysis Nonlinear solvers, with sens. analysis Optimizers Linear solvers Eigensolvers Software integration Performance optimization Optimizer Linear solver Eigensolver Time integrator Nonlinear solver Indicates dependence Sens. Analyzer TOPS Philosophy on PDEs: TOPS Philosophy on PDEs Solution of a system of PDEs is rarely a goal in itself PDEs are solved to derive various outputs from specified inputs Actual goal is characterization of a response surface or a design or control strategy Together with analysis, sensitivities and stability are often desired Tools for PDE solution should also support related desires TOPS Philosophy on Operators: TOPS Philosophy on Operators A continuous operator may appear in a discrete code in many different instances Optimal algorithms tend to be hierarchical and nested iterative Processor-scalable algorithms tend to be domain-decomposed and concurrent iterative Majority of progress towards desired highly resolved, high fidelity result occurs through cost-effective low resolution, low fidelity parallel efficient stages Operator abstractions and recurrence must be supportedIt’s 2002; do you know what your solver is up to?: It’s 2002; do you know what your solver is up to? Has your solver not been updated in the past five years? Is your solver running at 1-10% of machine peak? Do you spend more time in your solver than in your physics? Is your discretization or model fidelity limited by the solver? Is your time stepping limited by stability? Are you running loops around your analysis code? Do you care how sensitive to parameters your results are? If the answer to any of these questions is “yes”, you are a potential customer!TOPS project goals/success metrics: TOPS project goals/success metrics Understand range of algorithmic options and their tradeoffs (e.g., memory vs. time, inner iteration work vs. outer) Can try all reasonable options from different sources easily without recoding or extensive recompilation Know how their solvers are performing Spend more time in their physics than in their solvers Are intelligently driving solver research, and publishing joint papers with TOPS researchers Can simulate truly new physics, as solver limits are steadily pushed back (finer meshes, higher fidelity models, complex coupling, etc.) TOPS will have succeeded if users —Conclusions: Conclusions Domain decomposition and multilevel iteration the dominant paradigm in contemporary terascale PDE simulation Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive! Many challenges loom at the “next scale” of computation Undoubtedly, new theory/algorithms will be part of the solutionAcknowledgments: Acknowledgments Collaborators or Contributors: George Biros (NYU) Xiao-Chuan Cai (Univ. Colorado, Boulder) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) Kendall Pierson (Sandia) PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSCRelated URLs: Related URLs Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes SciDAC initiative http://www.science.doe.gov/scidac TOPS project http://www.math.odu.edu/~keyes/scidac PETSc project http://www.mcs.anl.gov/petsc Hypre project http://www.llnl.gov/CASC/hypre ASCI platforms http://www.llnl.gov/asci/platformsBibliography: Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, to be submitted to J. Comp. Phys. Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp. High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297Agenda for future research: Agenda for future research High concurrency (100,000 processors) Asynchrony Fault tolerance Automated tuning Integration of simulation with studies of sensitivity, stability, and optimization High ConcurrencyToday Future: High Concurrency Today Future 100,000 processors, in a room or as part of a grid Most phases of DD computations scale well favorable surface-to-volume comm-to-comp ratio However, latencies will nix frequent exact reductions Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout 10,000 processors in a single room with tightly coupled network DD computations scale well, when provided with network rich enough for parallel near neighbor communication fast global reductions (complexity sublinear in processor count) AsynchronyToday Future: Asynchrony Today Future Adaptivity requirements and far-flung, nondedicated networks will lead to idleness and imbalance at synchronization points Need algorithms with looser outer loops than global Newton-Krylov Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton? Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent A priori partitionings for quasi-static meshes provide load-balanced computational tasks between frequent synchronization points Good load balance is critical to parallel scalability on 1,000 processors and more Fault ToleranceToday Future: Fault Tolerance Today Future c/o A. Geist With 100,000 processors or worldwide networks, MTBF will be in minutes Checkpoint-restart could take longer than the time to next failure Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL Fault tolerance is not a driver in most scientific application code projects FT handled as follows: Detection of wrong System – in hardware Framework – by runtime env Library – in math or comm lib Notification of application Interrupt – signal sent to job Error code returned by app process Recovery Restart from checkpoint Migration of task to new hardware Reassignment of work to remaining tasks Automated TuningToday Future: Automated Tuning Today Future Less knowledgeable users required to employ parallel iterative solvers in taxing applications Need safe defaults and automated tuning strategies Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for: problem size/processor ratio outer solver type Krylov solver type DD preconditioner type maximum subspace dimensions overlaps fill levels inner tolerances potentially many others Integrated SoftwareToday Future: Integrated Software Today Future Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms Paradigm: integrated software based on common distributed data structures Each analysis is a “special effort” Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare Slide75: David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research Center Institute for Scientific Computing Research Lawrence Livermore National Laboratory Domain Decomposition in the Mainstream of Computational Science You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
harvard deas Techy_Guy Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 234 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 03, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering (ICASE) NASA Langley Research Center Institute for Scientific Computing Research (ISCR) Lawrence Livermore National Laboratory Algorithms for Terascale Computation of PDEs and PDE-constrained OptimizationIn memoriam, Joseph-Louis Lagrange: In memoriam, Joseph-Louis Lagrange Born 1736, Turin Died 10 April 1813, Paris Winner, Prix de l’Académie des Sciences, 1772, 1774, 1780 Mécanique Analytique, 1788 Elected F.R.S., 1791 Théorie des Fonctions Analytiques, 1797 Slide3: “As long as algebra and geometry have been separated, their progress has been slow and their uses limited; but since these two sciences have been united, they have lent each mutual forces, and have marched together towards perfection.” Another Quotation of LagrangePlan of presentation: Plan of presentation Imperative of “optimal” algorithms for terascale computing Basic domain decomposition and multilevel algorithmic concepts Illustrations of solver performance on ASCI platforms (with a little help from my friends…) Terascale Optimal PDE Simulations (TOPS) software project of the U.S. DOE ConclusionsTerascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment.Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrumentTerascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous In these, and many other areas, simulation is an important complement to experiment. Experiments difficult to instrument Experiments expensive Terascale simulation has been “sold”: Terascale simulation has been “sold” Environment global climate contaminant transport Experiments controversial Applied Physics radiation transport supernovae Experiments prohibited or impossible Scientific Simulation Experiments dangerous However, simulation is far from proven! To meet expectations, we need to handle problems of multiple physical scales. Experiments difficult to instrument Experiments expensive Large platforms have been provided: ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 100+ Tflop / 30 TB Time (CY) Capability 1+ Tflop / 0.5 TB 30+ Tflop / 10 TB Red 3+ Tflop / 1.5 TB Blue 10+ Tflop / 4 TB White 50+ Tflop / 25 TB Large platforms have been provided ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2004 www.llnl.gov/asci/platforms Sandia Los Alamos Livermore LivermoreNSF’s 13.6 TF TeraGrid coming on line: NSF’s 13.6 TF TeraGrid coming on line 26 24 8 4 HPSS 5 HPSS HPSS UniTree External Networks External Networks External Networks External Networks Site Resources Site Resources Site Resources Site Resources TeraGrid: NCSA, SDSC, Caltech, Argonne www.teragrid.org c/o I. FosterBuilding platforms is the “easy” part: Building platforms is the “easy” part Algorithms must be highly concurrent and straightforward to load balance latency tolerant cache friendly (temporal and spatial locality of reference) highly scalable (in the sense of convergence) Goal for algorithmic scalability: fill up memory of arbitrarily large machines while preserving constant running times with respect to proportionally smaller problem on one processor Domain-decomposed multilevel methods “natural” for all of these Domain decomposition also “natural” for software engineeringAlgorithmic requirementsfrom architecture: Algorithmic requirements from architecture Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache T3E “horizontal” aspects “vertical” aspects message passing, shared memory threads register blocking, cache blocking, prefetchingKeyword: “Optimal”: Keyword: “Optimal” Convergence rate nearly independent of discretization parameters Multilevel schemes for rapid linear convergence of linear problems Newton-like schemes for quadratic convergence of nonlinear problems Convergence rate as independent as possible of physical parameters Continuation schemes Physics-based preconditioningWhy Optimal Algorithms?: Why Optimal Algorithms? The more powerful the computer, the greater the importance of optimality Example: Suppose Alg1 solves a problem in time CN2, where N is the input size Suppose Alg2 solves the same problem in time CN Suppose that the machine on which Alg1 and Alg2 have been parallelized to run has 10,000 processors In constant time (compared to serial), Alg1 can run a problem 100X larger, whereas Alg2 can run a problem 10,000X larger Why Optimal?, cont.: Why Optimal?, cont. Alternatively, filling the machine’s memory, Alg1 requires 100X time, whereas Alg2 runs in constant time Is 10,000 processors a reasonable expectation? Yes, we have it today (ASCI White)! Could computational scientists really use 10,000X? Of course; we are approximating the continuum A grid for weather prediction allows points every 1km versus every 100km on the earth’s surface In 2D 10,000X disappears fast; in 3D even faster However, these machines are expensive (ASCI White is $100M, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them Decomposition strategies for Lu=f in : Decomposition strategies for Lu=f in Operator decomposition Function space decomposition Domain decompositionOperator decomposition: Operator decomposition Consider ADI Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves Parallelism within each substep But global data exchanges between bandsolve substeps Function space decomposition: Function space decomposition Consider a spectral Galerkin method Domain decomposition: Domain decomposition Consider restriction and extension operators for subdomains, , and for possible coarse grid, Replace discretized with Solve by a Krylov method, e.g., CG Matrix-vector multiplies with parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)Comparison: Comparison Operator decomposition (ADI) natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose) Function space decomposition (Fourier) natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform) Domain decomposition (Schwarz) natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem Theoretical scaling of domain decomposition (for three common network topologies): Theoretical scaling of domain decomposition (for three common network topologies) With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales linearly with problem size (“scalable”, assumes one subdomain per processor) With power-law-time (3D torus-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales as three-fourths power of problem size (“almost scalable”) With linear-time (common bus) network: optimal number of processors scales as one-fourth power of problem size (*not* scalable) bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node! Three Basic Concepts: Three Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioningIterative correction: Iterative correction The most basic idea in iterative methods Evaluate residual accurately, but solve approximately, where is an approximate inverse to A A sequence of complementary solves can be used, e.g., with first and then one has Optimal polynomials of lead to various preconditioned Krylov methods Scale recurrence, e.g., with , leads to multilevel methods Multilevel Preconditioning: Finest Grid Multilevel PreconditioningSchwarz Preconditioning: Schwarz Preconditioning Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of Let Boolean rectangular matrix extract the subset of : Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain supportIteration count estimates from the Schwarz theory: Iteration count estimates from the Schwarz theory In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned systemSchur Preconditioning: Schur Preconditioning Given a partition Condense: Let M be a good preconditioner for S Then is a preconditioner for A Moreover, solves with may be done approximately if all degrees of freedom are retained Schwarz polynomials: Schwarz polynomials Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively) Schwarz-on-Schur: Schwarz-on-Schur Preconditioning the Schur complement is complex in and of itself; Schwarz can be used on the reduced problem “Neumann-Neumann” alg Multigrid on the Schur complement Schwarz-inside-Schur: Schwarz-inside-Schur Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization Constraint Objective Lagrangian Form the gradient of the Lagrangian with respect to each of x, u, and : Schwarz-inside-Schur: Schwarz-inside-Schur Equality constrained optimization leads to the KKT system for states x , designs u , and multipliers Then Schwarz-inside-Schur, cont.: Schwarz-inside-Schur, cont. Problems is the Jacobian of a PDE huge! involve Hessians of objective and constraints second derivatives and huge H is unreasonable to form, store, or invert Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic differentiation (vector-to-vector map) Form approximate inverse action of state Jacobian and its transpose by Schwarz Nonlinear Schwarz preconditioning: Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and outside and is fundamentally Jacobian-free It replaces with a new nonlinear system possessing the same root, Define a correction to the partition (e.g., subdomain) of the solution vector by solving the following local nonlinear system: where is nonzero only in the components of the partition Then sum the corrections: Nonlinear Schwarz, cont.: Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is nonsingular in a neighborhood of the desired root then and have the same unique root To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product Remarkably, (Cai-Keyes, 2000) it can be shown that where and All required actions are available in terms of ! Experimental example of nonlinear Schwarz: Experimental example of nonlinear Schwarz “Unreasonable effectiveness” of Schwarz: “Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the inverse of the sums? When the decomposition is right! Good decompositions are a compromise between conditioning and parallel complexity, in practice Let be a complete set of orthonormal row eigenvectors for A : orSome anecdotal successes: Some anecdotal successes Newton-Krylov-Schwarz Computational aerodynamics (Anderson et al., NASA, ODU, ANL) FETI (Schwarz on Schur) Computational structural dynamics (Farhat & Pierson, CU-Boulder) LNKS (Schwarz inside Schur) PDE-constrained optimization (Biros, NYU & Ghattas, CMU)Newton-Krylov-Schwarz: Newton-Krylov-Schwarz Newton nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994)Jacobian-Free Newton-Krylov Method: Jacobian-Free Newton-Krylov Method In the Jacobian-Free Newton-Krylov (JFNK) method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products These are approximated by the Fréchet derivatives so that the actual Jacobian elements are never explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error Schwarz preconditions, using approximate elements Computational Aerodynamics: Computational Aerodynamics mesh c/o D. Mavriplis, ICASE Implemented in PETSc www.mcs.anl.gov/petsc Transonic “Lambda” Shock, Mach contours on surfacesFixed-size Parallel Scaling Results: Fixed-size Parallel Scaling Results c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith This scaling study, featuring our widest range of processor number, was done for the incompressible case. Fixed-size Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors): Fixed-size Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)PDE Workingsets: PDE Workingsets Smallest: data for single stencil Largest: data for entire subdomain Intermediate: data for a neighborhood collection of stencils, reused as possibleImprovements Resulting from Locality Reordering: Improvements Resulting from Locality ReorderingCache Traffic for PDEs: Cache Traffic for PDEs As successive workingsets “drop” into a level of memory, capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth Slide49: FETI-DP for structural mechanics 1mdof 4mdof 9mdof 18mdof 30mdof 60mdof c/o C. Farhat and K. Pierson Numerically scalable, hardware scalable solutions for realistic solid/shell models Used in Sandia codes Salinas, Adagio, Andante PDE-constrained Optimization: PDE-constrained Optimization c/o G. Biros and O. Ghattas Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis) www.cs.nyu.edu/~biros/veltisto/Slide51: Lab-university collaborations to develop “Integrated Software Infrastructure Centers” (ISICs) and partner with application groups For FY2002, 51 new projects at $57M/year total Approximately one-third for ISICs A third for grid infrastructure and collaboratories A third for applications groups 5 Tflop/s IBM SP platforms “Seaborg” at NERSC (#3 in latest “Top 500”) and “Cheetah” at ORNL (being installed now) available for SciDAC Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC: Introducing “Terascale Optimal PDE Simulations” (TOPS) ISIC Nine institutions, $18M, five years, 24 co-PIs TOPS: TOPS Not just algorithms, but vertically integrated software suites Portable, scalable, extensible, tunable, modular implementations Starring PETSc and hypre, among other existing packages Background of PETSc Library(in which FUN3D example was implemented): Background of PETSc Library (in which FUN3D example was implemented) Developed by Balay, Gropp, McInnes & Smith (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by four additional staff (Buschelman, Kaushik, Knepley, Zhang) under SciDAC Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitive tasks (e.g., gather/scatter) See http://www.mcs.anl.gov/petscUser Code/PETSc Library Interactions: PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PC KSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library InteractionsUser Code/PETSc Library Interactions: PETSc code User code Application Initialization Function Evaluation Jacobian Evaluation Post- Processing PC KSP PETSc Main Routine Linear Solvers (SLES) Nonlinear Solvers (SNES) Timestepping Solvers (TS) User Code/PETSc Library Interactions To be AD codeBackground of Hypre Library(to be combined with PETSc under SciDAC): Background of Hypre Library (to be combined with PETSc under SciDAC) Developed by Chow, Cleary & Falgout (LLNL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments; now joined by seven additional staff (Henson, Jones, Lambert, Painter, Tong, Treadway, Yang) under ASCI and SciDAC Object-oriented design similar to PETSc Concentrates on linear problems only Richer in preconditioners than PETSc, with focus on algebraic multigrid Includes other preconditioners, including sparse approximate inverse (Parasails) and parallel ILU (Euclid) See http://www.llnl.gov/CASC/hypre/Hypre’s “Conceptual Interfaces”: Hypre’s “Conceptual Interfaces” Slide c/o E. Chow, LLNLSample of Hypre’s Scaled Efficiency: Sample of Hypre’s Scaled Efficiency Slide c/o E. Chow, LLNLScope for TOPS: Scope for TOPS Design and implementation of “solvers” Time integrators, with sens. analysis Nonlinear solvers, with sens. analysis Optimizers Linear solvers Eigensolvers Software integration Performance optimization Optimizer Linear solver Eigensolver Time integrator Nonlinear solver Indicates dependence Sens. Analyzer TOPS Philosophy on PDEs: TOPS Philosophy on PDEs Solution of a system of PDEs is rarely a goal in itself PDEs are solved to derive various outputs from specified inputs Actual goal is characterization of a response surface or a design or control strategy Together with analysis, sensitivities and stability are often desired Tools for PDE solution should also support related desires TOPS Philosophy on Operators: TOPS Philosophy on Operators A continuous operator may appear in a discrete code in many different instances Optimal algorithms tend to be hierarchical and nested iterative Processor-scalable algorithms tend to be domain-decomposed and concurrent iterative Majority of progress towards desired highly resolved, high fidelity result occurs through cost-effective low resolution, low fidelity parallel efficient stages Operator abstractions and recurrence must be supportedIt’s 2002; do you know what your solver is up to?: It’s 2002; do you know what your solver is up to? Has your solver not been updated in the past five years? Is your solver running at 1-10% of machine peak? Do you spend more time in your solver than in your physics? Is your discretization or model fidelity limited by the solver? Is your time stepping limited by stability? Are you running loops around your analysis code? Do you care how sensitive to parameters your results are? If the answer to any of these questions is “yes”, you are a potential customer!TOPS project goals/success metrics: TOPS project goals/success metrics Understand range of algorithmic options and their tradeoffs (e.g., memory vs. time, inner iteration work vs. outer) Can try all reasonable options from different sources easily without recoding or extensive recompilation Know how their solvers are performing Spend more time in their physics than in their solvers Are intelligently driving solver research, and publishing joint papers with TOPS researchers Can simulate truly new physics, as solver limits are steadily pushed back (finer meshes, higher fidelity models, complex coupling, etc.) TOPS will have succeeded if users —Conclusions: Conclusions Domain decomposition and multilevel iteration the dominant paradigm in contemporary terascale PDE simulation Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive! Many challenges loom at the “next scale” of computation Undoubtedly, new theory/algorithms will be part of the solutionAcknowledgments: Acknowledgments Collaborators or Contributors: George Biros (NYU) Xiao-Chuan Cai (Univ. Colorado, Boulder) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) Kendall Pierson (Sandia) PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSCRelated URLs: Related URLs Personal homepage: papers, talks, etc. http://www.math.odu.edu/~keyes SciDAC initiative http://www.science.doe.gov/scidac TOPS project http://www.math.odu.edu/~keyes/scidac PETSc project http://www.mcs.anl.gov/petsc Hypre project http://www.llnl.gov/CASC/hypre ASCI platforms http://www.llnl.gov/asci/platformsBibliography: Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002, to be submitted to J. Comp. Phys. Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, to appear in SIAM J. Sci. Comp. High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362 Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17 Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136 Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99 Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297Agenda for future research: Agenda for future research High concurrency (100,000 processors) Asynchrony Fault tolerance Automated tuning Integration of simulation with studies of sensitivity, stability, and optimization High ConcurrencyToday Future: High Concurrency Today Future 100,000 processors, in a room or as part of a grid Most phases of DD computations scale well favorable surface-to-volume comm-to-comp ratio However, latencies will nix frequent exact reductions Paradigm: extrapolate data in retarded messages; correct (if necessary) when message arrives, such as in C(p,q,j) schemes by Garbey and Tromeur-Dervout 10,000 processors in a single room with tightly coupled network DD computations scale well, when provided with network rich enough for parallel near neighbor communication fast global reductions (complexity sublinear in processor count) AsynchronyToday Future: Asynchrony Today Future Adaptivity requirements and far-flung, nondedicated networks will lead to idleness and imbalance at synchronization points Need algorithms with looser outer loops than global Newton-Krylov Can we design algorithms that are robust with respect to incomplete convergence of inner tasks, like inexact Newton? Paradigm: nonlinear Schwarz with regional (not global) nonlinear solvers where most execution time is spent A priori partitionings for quasi-static meshes provide load-balanced computational tasks between frequent synchronization points Good load balance is critical to parallel scalability on 1,000 processors and more Fault ToleranceToday Future: Fault Tolerance Today Future c/o A. Geist With 100,000 processors or worldwide networks, MTBF will be in minutes Checkpoint-restart could take longer than the time to next failure Paradigm: naturally fault tolerant algorithms, robust with respect to failure, such as a new FD algorithm at ORNL Fault tolerance is not a driver in most scientific application code projects FT handled as follows: Detection of wrong System – in hardware Framework – by runtime env Library – in math or comm lib Notification of application Interrupt – signal sent to job Error code returned by app process Recovery Restart from checkpoint Migration of task to new hardware Reassignment of work to remaining tasks Automated TuningToday Future: Automated Tuning Today Future Less knowledgeable users required to employ parallel iterative solvers in taxing applications Need safe defaults and automated tuning strategies Paradigm: parallel direct search (PDS) derivative-free optimization methods, using overall parallel computational complexity as objective function and algorithm tuning parameters as design variables, to tune solver in preproduction trial executions Knowledgeable user-developers parameterize their solvers with experience and theoretically informed intuition for: problem size/processor ratio outer solver type Krylov solver type DD preconditioner type maximum subspace dimensions overlaps fill levels inner tolerances potentially many others Integrated SoftwareToday Future: Integrated Software Today Future Analysis increasingly an “inner loop” around which more sophisticated science-driven tasks are wrapped Need PDE task functionality (e.g., residual evaluation, Jacobian evaluation, Jacobian inverse) exposed to optimization/sensitivity/stability algorithms Paradigm: integrated software based on common distributed data structures Each analysis is a “special effort” Optimization, sensitivity analysis (e.g., for uncertainty quantification), and stability analysis to fully exploit and contextualize scientific results are rare Slide75: David E. Keyes Center for Computational Science Old Dominion University Institute for Computer Applications in Science & Engineering NASA Langley Research Center Institute for Scientific Computing Research Lawrence Livermore National Laboratory Domain Decomposition in the Mainstream of Computational Science