Performance Tools

Uploaded from authorPOINT Lite
Download as
 PPT
Presentation Description 

No description available

Happy Thanksgiving
What's up on authorSTREAM?
Views: 429
Like it  ( Likes) Dislike it  ( Dislikes)
Added: November 20, 2007 This Presentation is Public 
Presentation Category : Entertainment All Rights Reserved
Presentation Transcript

Performance Analysis Tools: Performance Analysis Tools Rick Kufrin, NCSA Shirley Moore, U. Tennessee Sameer S. Shende, U. Oregon NCSA Workshop on Effective Use of Multi-Core Technology July 2007


Topics: Topics Tools status & futures on Abe PAPI (hardware performance counters) PerfSuite (basic measurement software) mpiP (monitoring MPI statistics) TAU (advanced performance analysis)


Abe Tool Status (July ’07) : Abe Tool Status (July ’07) No additional supported tools yet installed Base directory will be: /usr/apps/tools Kernel recently patched for hardware performance counter support (w/perfctr until perfmon2 stabilizes) - higher-level software to follow Initial tool selection based on prior experience and demand, feedback is welcomed Contact consult@ncsa.uiuc.edu with inquiries; will route appropriately


Slide4: PAPI Performance Application Programming Interface The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. Parallel Tools Consortium project started in 1998 Developed by University of Tennessee, Knoxville http://icl.cs.utk.edu/papi/


PAPI Counter Interfaces: PAPI Counter Interfaces PAPI provides 3 interfaces to the underlying counter hardware: The low level interface manages hardware events in user defined groups called EventSets, and provides access to advanced features. The high level interface provides the ability to start, stop and read the counters for a specified list of events. Graphical and end-user tools provide facile data collection and visualization


PAPI Implementation: PAPI Implementation 3rd Party and GUI Tools PAPI Low Level Machine Specific Layer Portable Layer PAPI Machine Dependent Substrate PAPI High Level Hardware Performance Counters Operating System Kernel Extension


PAPI Hardware Events: PAPI Hardware Events Preset Events Standard set of over 100 events for application performance tuning No standardization of the exact definition Mapped to either single or linear combinations of native events on each platform Use papi_avail utility to see what preset events are available on a given platform Native Events Any event countable by the CPU Same interface as for preset events Use papi_native_avail utility to see all available native events Use papi_event_chooser utility to select a compatible set of events


PAPI High-level Interface: PAPI High-level Interface Meant for application programmers wanting coarse-grained measurements Calls the lower level API Allows only PAPI preset events Easier to use and less setup (less additional code) than low-level Supports 8 calls in C or Fortran:


PAPI High-level Example: PAPI High-level Example #include "papi.h” #define NUM_EVENTS 2 long_long values[NUM_EVENTS]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring… */ do_work(); /* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);


Low-level Interface: Low-level Interface Increased efficiency and functionality over the high level PAPI interface Obtain information about the executable, the hardware, and the memory environment Multiplexing Callbacks on counter overflow Profiling About 60 functions


PAPI Low-level Example: PAPI Low-level Example #include "papi.h” #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}; int EventSet; long_long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values);


Component PAPI (PAPI-C): Component PAPI (PAPI-C) Goals: Support simultaneous access to on- and off-processor counters Isolate hardware dependent code in a separable ‘substrate’ module Extend platform independent code to support multiple simultaneous substrates Add or modify API calls to support access to any of several substrates Modify build environment for easy selection and configuration of multiple available substrates Will be released as PAPI 4.0


Extension to PAPI to Support Multiple Substrates: Extension to PAPI to Support Multiple Substrates PAPI Low Level Machine Specific Layer Portable Layer PAPI High Level Hardware Independent Layer PAPI Machine Dependent Substrate Off-Processor Hardware Counters Operating System Kernel Extension


PAPI-C Status: PAPI-C Status PAPI 3.9 pre-release available with documentation Implemented Myrinet substrate (native counters) Implemented ACPI temperature sensor substrate Working on Inifinband and Cray Seastar substrates (access to Seastar counters not available under Catamount but expected under CNL) Asked by Cray engineers for input on desired metrics for next network switch Tested on HPC Challenge benchmarks Tested platforms include Pentium III, Pentium 4, Core2Duo, Itanium (I and II) and AMD Opteron Installed and tested on ARL MSRC Linux clusters and ASC MSRC SGI Altix


PAPI-C New Routines: PAPI-C New Routines PAPI_get_component_info() PAPI_num_cmp_hwctrs() PAPI_get_cmp_opt() PAPI_set_cmp_opt() PAPI_set_cmp_domain() PAPI_set_cmp_granularity()


Multiple Measurements: Multiple Measurements HPCC HPL benchmark on Opteron with 3 performance metrics: FLOPS; Temperature; Network Sends/Receives Temperature is from an on-chip thermal diode


High-level tools that use PAPI: High-level tools that use PAPI TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/ HPCToolkit (Rice Univ) http://hipersoft.cs.rice.edu/hpctoolkit/ KOJAK (UTK, FZ Juelich) http://icl.cs.utk.edu/kojak/ PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/ Titanium (UC Berkeley) http://www.cs.berkeley.edu/Research/Projects/titanium/ SCALEA (Thomas Fahringer, U Innsbruck) http://www.par.univie.ac.at/project/scalea/ Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/ SvPablo (UNC Renaissance Computing Institute) http://www.renci.unc.edu/Software/Pablo/pablo.htm


PerfSuite: PerfSuite Design Goals Remove the barriers to the initial steps of performance analysis (don’t make it hard) Separate data collection from presentation Machine-independent representation Focus on the “Big Picture” (remember the 80/20 rule?) A primary goal is to provide an “entry point” that can help you to decide how to proceed


PerfSuite and XML: PerfSuite and XML In PerfSuite, nearly all data (input, output, configuration, etc) is represented as XML (eXtensible Markup Language) documents This provides the ability to manipulate & transform the data in many ways using standard software / skills Machine-independent (no binary files) ...opens the data up to the user There are numerous high-quality XML-aware libraries available from either compiled or interpreted languages that can make it easy to transform the data for your needs Web browsers (e.g. Mozilla, IE) have built-in XML capabilities


PerfSuite Counter-Related Software: PerfSuite Counter-Related Software Four performance counter-related utilities: psconfig - configure / select performance events psinv - query events and machine information psrun - generate raw counter or statistical profiling data from an unmodified binary psprocess - pre- and post-process data Four libraries (shared and static) libperfsuite – the “core” library that can be used standalone and will be built regardless of the availability of other software libpshwpc – HardWare Performance Counter library, also built regardless of other software. Without counter support, will only perform time-based profiling through profil(). A version suitable for threaded programs is available (_r suffix). libpshwpc_mpi – a convenience library based on the MPI standard PMPI interface.


Example XML Event Document: Example XML Event Document You can edit this file like any text file, load it into psconfig, modify it, save it, etc. Select for use through env variable PS_HWPC_CONFIG


Configuring for Profiling: Configuring for Profiling Setting up for profiling is similar to counting - all you have to do is modify the XML configuration document: The XML document “root element” is now , not You can supply an optional “threshold”, or sampling rate Only one event is allowed in the document


A Quick “Cookbook” for psrun: A Quick “Cookbook” for psrun # First, be sure to set all paths properly (can do in .cshrc/.profile) % set PSDIR=/opt/perfsuite % source $PSDIR/bin/psenv.csh # Use psrun on your program to generate the data, # then use psprocess to produce an HTML file % psrun myprog % psprocess --html myprog.12345.xml > myprog.html # Take a look at the results % mozilla myprog.html # Second run, but this time profiling instead of counting % psrun -c $PSDIR/share/perfsuite/xml/pshwpc/profil.xml myprog % psprocess -e myprog myprog.67890.xml


psrun: psrun Hardware performance counting and profiling with unmodified dynamically-linked executables Available for x86, x86-64, and ia64 POSIX threads support Automatic multiplexing Can be used with MPI Optionally collects resource usage Supports all PAPI standard events Input/Output = XML documents (can request plain text)


PerfSuite Environment Variables: PerfSuite Environment Variables PS_HWPC: “off” or “on”, controls whether measurement takes place at all (for API) PS_HWPC_CONFIG: set to the name of the XML event file created with psconfig or “by hand”. A default is used if not set PS_HWPC_FILE: controls the prefix of the XML output document (default “psrun”) PS_HWPC_ANNOTATION - adds an arbitrary “note” to the XML output PS_HWPC_DOMAIN: controls whether counting at user or system level (or both) PS_HWPC_THRESHOLD: sets threshold for profiling PS_HWPC_FORMAT: “text” or “xml”, controls whether output is in an XML document or plain text (similar to a psprocess report) PSRUN_DOFORK: if set (to anything), monitors child processes also


psprocess (HTML mode): psprocess (HTML mode) This style of output is customizable by you. By default, the information it contains and its visual appearance are based on PerfSuite-provided defaults, but these can be easily replaced to suit your needs. This output is generated by psprocess using XML Transformations. The stylesheet is in the share/perfsuite/xml/pshwpc subdirectory, with a “xsl” file extension


psprocess (text mode): psprocess (text mode) PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Mon Dec 30 11:31:53 AM Central Standard Time 2002 Generator : psprocess 0.5 XML Source : /u/ncsa/anyuser/performance/psrun-ia64.xml Execution Information =========================== Date : Sun Dec 15 21:01:20 2002 Host : user01 Processor and System Information =========================== Node CPUs : 2 Vendor : Intel Family : IPF Model : Itanium CPU Revision : 6 Clock (MHz) : 800.136 Memory (MB) : 2007.16 Pagesize (KB): 16


psprocess (text mode, cont’d): psprocess (text mode, cont’d) Cache Information ========================== Cache levels : 3 -------------------------------- Level 1 Type : data Size (KB) : 16 Linesize (B) : 32 Assoc : 4 Type : instruction Size (KB) : 16 Linesize (B) : 32 Assoc : 4 -------------------------------- Level 2 Type : unified Size (KB) : 96 Linesize (B) : 64 Assoc : 6 The reports (text or HTML) generated by psprocess have several sections, covering: Report creation details Run details Machine information Raw counter listings Counter explanations and index Derived metrics Run annotation defined by you


psprocess (text mode, cont’d): psprocess (text mode, cont’d) Index Description Counter Value ================================================================= 1 Conditional branch instructions mispredicted..... 4831072449 4 Floating point instructions...................... 86124489172 5 Total cycles..................................... 594547754568 6 Instructions completed........................... 1049339828741 Statistics ================================================================= Graduated instructions per cycle................... 1.765 Graduated floating point instructions per cycle.... 0.145 Level 3 cache miss ratio (data).................... 0.957 Bandwidth used to level 3 cache (MB/s)............. 385.087 % cycles with no instruction issue................. 10.410 % cycles stalled on memory access.................. 43.139 MFLOPS (cycles).................................... 115.905 MFLOPS (wallclock)................................. 114.441


PerfSuite Library Access (API): PerfSuite Library Access (API) All of the functionality is also available from within your program (C/C++/Fortran) through a small API Same XML documents are read, same XML documents are written, small additional functionality Why would you want to use this? Primarily to gain finer control over where measurements are taken in your program. For example, you might defer measurement until program initialization has completed For complex uses, you are probably better off using an “industrial-strength” performance library The intent of the API is to “abstract out” the process of performance measurement to a very high level


libpshwpc Library Routines: libpshwpc Library Routines The libpshwpc API contains nine routines that you can call from your C/C++ or Fortran program. Call “init” once, call “start” and “suspend” as many times as you like. Call “stop” (supplying a file name prefix of your choice) to get the performance data XML document. Optionally, call “shutdown”. C / C++ ps_hwpc_init (void) ps_hwpc_start (void) ps_hwpc_suspend (void) ps_hwpc_read (ps_hwpc_values_t *values) ps_hwpc_stop (char *prefix) ps_hwpc_shutdown (void) ps_hwpc_numevents (int *numevents) ps_hwpc_eventnames (char ***eventnames) ps_hwpc_psrun (void) Fortran subroutine equivalents add an additional “ierr” status final argument


Example Fortran API Use: Example Fortran API Use include 'fperfsuite.h' call PSF_hwpc_init(ierr) call PSF_hwpc_start(ierr) do j = 1, n do i = 1, m do k = 1, l c(i,j) = c(i,j) + a(i,k)*b(k,j) end do end do end do call PSF_hwpc_stop('perf', ierr) call PSF_hwpc_shutdown(ierr) % ifort -c matmult.f -I/opt/perfsuite/include % ifort matmult.o -L /usr/apps/tools/perfsuite/lib/intel -L/usr/apps/tools/papi/lib -lpshwpc -lperfsuite -lpapi


Using Processor “Native” Events: Using Processor “Native” Events It’s easy to work with native events in addition to PAPI standard events by modifying the configuration file slightly. Instead of using the XML attributes type=“preset” name=“PAPI_EVENTNAME”, use the attribute type=“native” and enclose the event name as the content of the element Can be used with profiling configurations NOPS_RETIRED BACK_END_BUBBLE_ALL


Advanced Use (psrun): Advanced Use (psrun) psrun supports a few options that can be useful in working with shared or distributed memory programs: -p / --pthreads uses a POSIX thread-aware variant of the library that captures thread creation and measures performance of each, depositing the results in an XML document with the thread ID embedded: -f / --fork monitors child processes that are created. Not enabled by default. -a / --annotate inserts an XML “element” with a user-supplied annotation (text)


Advanced Use (psprocess): Advanced Use (psprocess) psprocess is meant to be a “generic” processor for different XML document types generated by PerfSuite. For hardware counting, the most common type is Individual documents can be combined into a “multi-document” with the option –c / --combine. With hardware counter data, psprocess summarizes the information contained in them with descriptive statistics (mean, max, min, sum, stddev) -s LIST is a very useful option to be used with profiling runs. LIST is a comma-separated list of modules, files, functions, lines used to limit the amount of output -t THRESHOLD is also helpful in limiting the output of profiling runs. THRESHOLD is a number that specifies the minimum % of samples required for a given entry to be displayed. Example: “-t 2” means “don’t show me anything that didn’t account for at least 2% of the samples collected”


Application Example: CX3D: Application Example: CX3D Fortran 90 / MPI code (Forschungszentrum Juelich) that simulates Czochralski crystal growth. Spatial decomposition across processors can be specified at runtime. We’ll look at the steps involved in using PerfSuite on 8 processors to obtain profiling and counting information. The application measures elapsed time internally with system_clock(). For the 8-proc run, the measured wall clock time for a 4x2 decomposition is 40.88 secs. We can also measure parallel runs using gprof by using the environment variable GMON_OUT_PREFIX to override the default “gmon.out” filename.


Profiling Results (gprof summary): Profiling Results (gprof summary) % cumulative self self total time seconds seconds calls ms/call ms/call name 76.79 246.25 246.25 8000 30.78 30.93 velo_ 9.01 275.15 28.90 8000 3.61 3.64 temp_ 3.74 287.14 11.99 8000 1.50 1.50 curr_ 2.04 293.68 6.54 gmpi_net_lookup 1.81 299.49 5.81 gm_ntoh_u8 1.31 303.69 4.21 MPID_RecvComplete 0.75 306.12 2.42 _gm_ntoh_u8 0.71 308.38 2.27 8008 0.28 0.32 bound_ % time attributed to the highest routine (velo) ranges from 79.21 to 74.42. $ gprof –s cx.gprof ${GMON_OUT_PREFIX}.* $ gprof –s cx.gprof gmon.sum


Profiling Results (psprocess individual): Profiling Results (psprocess individual) Profile Information ======================================================================== Class : PAPI Event : PAPI_TOT_CYC (Total cycles) Period : 30600000 Samples : 4012 Domain : user Run Time : 40.65 (seconds) Min Self % : (all) Module Summary ------------------------------------------------------------------------ Samples Self % Total % Module 3942 98.26% 98.26% /u/ncsa/rkufrin/apps/cx3d/cx 69 1.72% 99.98% /opt/gm/lib/libgm.so.0.0.0 1 0.02% 100.00% /lib/tls/libpthread-0.34.so


Profiling Results (psprocess, cont’d): Profiling Results (psprocess, cont’d) File Summary -------------------------------------------------------------------------------- Samples Self % Total % File 3182 79.31% 79.31% /u/ncsa/rkufrin/apps/cx3d/velo.f 384 9.57% 88.88% /u/ncsa/rkufrin/apps/cx3d/temp.f 164 4.09% 92.97% /u/ncsa/rkufrin/apps/cx3d/testin.f 143 3.56% 96.54% /u/ncsa/rkufrin/apps/cx3d/curr.f 53 1.32% 97.86% ./include/gm_send_queue.h 23 0.57% 98.43% ?? 22 0.55% 98.98% /u/ncsa/rkufrin/apps/cx3d/bound.f 15 0.37% 99.35% /u/ncsa/rkufrin/apps/cx3d/csendxs.f 14 0.35% 99.70% ./libgm/gm_send.c 10 0.25% 99.95% /u/ncsa/rkufrin/apps/cx3d/crecvxs.f 1 0.02% 99.98% ./libgm/gm_ptr_hash.c 1 0.02% 100.00% ./libgm/gm_hash.c Function Summary -------------------------------------------------------------------------------- Samples Self % Total % Function 3182 79.31% 79.31% velo 384 9.57% 88.88% temp 164 4.09% 92.97% testin 143 3.56% 96.54% curr 54 1.35% 97.88% gm_send_with_callback


Profiling Results (psprocess, cont’d): Profiling Results (psprocess, cont’d) Function:File:Line Summary -------------------------------------------------------------------------------- Samples Self % Total % Function:File:Line 687 17.12% 17.12% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:232 535 13.33% 30.46% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:260 509 12.69% 43.15% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:210 378 9.42% 52.57% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:356 189 4.71% 57.28% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:493 $ mpirun –np 8 psrun –c profile_cycles.xml ./cx $ psprocess –e cx psrun.PID.xml profile_cycles.xml:


Summary Information (psprocess): Summary Information (psprocess) Aggregate Statistics Min Max Median Mean StdDev Sum ============================================================================================ % CPU utilization..................... 97.88 98.41 98.09 98.12 0.17 784.93 % cycles stalled on any resource...... 0.00 0.00 0.00 0.00 0.00 0.00 CPU time (seconds).................... 39.95 40.15 39.99 40.01 0.07 320.11 Floating point operations per cycle... 0.05 0.05 0.05 0.05 0.00 0.39 Floating point operations per graduated instruction 0.04 0.04 0.04 0.04 0.00 0.31 Graduated instructions per cycle...... 1.27 1.30 1.29 1.29 0.01 10.28 Graduated instructions per issued instruction 0.99 1.00 1.00 1.00 0.00 7.97 Issued instructions per cycle......... 1.28 1.31 1.29 1.29 0.01 10.33 Level 2 cache hit rate (data)......... 0.96 0.97 0.97 0.97 0.00 7.74 Level 2 cache line reuse (data)....... 27.49 30.82 29.57 29.28 1.22 234.26 MFLOPS (cycles)....................... 145.53 154.10 151.18 150.40 3.63 1203.21 MFLOPS (wall clock)................... 142.45 151.50 148.37 147.57 3.64 1180.56 MIPS (cycles)......................... 3881.34 3952.56 3924.68 3922.56 28.18 31380.47 MIPS (wall clock)..................... 3799.24 3877.19 3854.91 3848.68 30.42 30789.40 MVOPS (cycles)........................ 0.00 0.00 0.00 0.00 0.00 0.00 MVOPS (wall clock).................... 0.00 0.00 0.00 0.00 0.00 0.00 Mispredicted branches per correctly predicted branch 0.00 0.01 0.01 0.01 0.00 0.05 Vector instructions per cycle......... 0.00 0.00 0.00 0.00 0.00 0.00 Vector instructions per graduated instruction 0.00 0.00 0.00 0.00 0.00 0.00 Wall clock time (seconds)............. 40.60 40.88 40.79 40.78 0.10 326.25 $ psprocess –c cx.*.xml > combined.xml $ psprocess combined.xml


mpiP http://mpip.sourceforge.net/: mpiP http://mpip.sourceforge.net/ Lightweight, scalable profiling library for MPI applications Collects statistical information about MPI functions Uses communication only when reporting; less overhead How to use Include mpiP library during link process, no source changes required Automatic profiling information gathering No recompiling is required What to analyze Overview of application’s time in MPI Gathers MPI callsites’ aggregate time within the application


mpiP Output: mpiP Output Report stored in file with suffix “.mpiP” All data written from MPI task 0 mpiP: mpiP: mpiP V3.0.0 (Build Oct 4 2006/12:40:28) mpiP: Direct questions and errors to mpip-help@lists.sourceforge.net mpiP: mpiP: mpiP: Storing mpiP output in [./9-test-mpip-time.exe.2.12390.1.mpiP]. mpiP: @ mpiP @ Command : /g/g0/chcham/mpiP/devo/testing/./9-test-mpip-time.exe @ Version : 2.8.2 @ MPIP Build date : Jan 10 2005, 15:15:47 @ Start time : 2005 01 10 16:01:32 @ Stop time : 2005 01 10 16:01:42 @ Timer Used : gettimeofday @ MPIP env var : -t 10.0 @ Collector Rank : 0 @ Collector PID : 25972 @ Final Output Dir : . @ MPI Task Assignment : 0 mcr88 @ MPI Task Assignment : 1 mcr88 @ MPI Task Assignment : 2 mcr89 @ MPI Task Assignment : 3 mcr89


mpiP Summary and Callsite Index: mpiP Summary and Callsite Index --------------------------------------------------------------------------- @--- MPI Time (seconds) --------------------------------------------------- --------------------------------------------------------------------------- Task AppTime MPITime MPI% 0 10 0.000243 0.00 1 10 10 99.92 2 10 10 99.92 3 10 10 99.92 * 40 30 74.94 --------------------------------------------------------------------------- @--- Callsites: 2 --------------------------------------------------------- --------------------------------------------------------------------------- ID Lev File/Address Line Parent_Funct MPI_Call 1 0 9-test-mpip-time.c 52 main Barrier 2 0 9-test-mpip-time.c 61 main Barrier Per-Task and Aggregate Time/Percentage for Application and MPI Cross-Reference to Callsites (for interpreting remainder of report)


Per-Site Time/Message Statistics: Per-Site Time/Message Statistics --------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ---------------- --------------------------------------------------------------------------- Call Site Time App% MPI% COV Barrier 2 3e+04 75.00 100.00 0.67 Barrier 1 0.405 0.00 0.00 0.59 --------------------------------------------------------------------------- @--- Aggregate Sent Message Size (top twenty, descending, bytes) ---------- --------------------------------------------------------------------------- Call Site Count Total Avrg MPI% Send 7 320 1.92e+06 6e+03 99.96 Bcast 1 12 336 28 0 .02 --------------------------------------------------------------------------- @--- Callsite Time statistics (all, milliseconds): 8 ---------------------- --------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.107 0.107 0.107 0.00 44.03 Barrier 1 * 4 0.174 0.137 0.107 0.00 0.00 Barrier 2 0 1 0.136 0.136 0.136 0.00 55.97 Barrier 2 1 1 1e+04 1e+04 1e+04 99.92 100.00 Barrier 2 2 1 1e+04 1e+04 1e+04 99.92 100.00 Barrier 2 3 1 1e+04 1e+04 1e+04 99.92 100.00 Barrier 2 * 4 1e+04 7.5e+03 0.136 74.94 100.00


TAU Performance System: TAU Performance System Tuning and Analysis Utilities (15+ year project effort) Performance system framework for HPC systems Integrated, scalable, flexible, and parallel Targets a general complex system computation model Entities: nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining Partners: LLNL, ANL, LANL, Research Center Jülich


TAU Parallel Performance System Goals: TAU Parallel Performance System Goals Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based Support for performance mapping Support for comparing performance (single-core/multi-cores) Integration of leading performance technology Scalable (very large) parallel performance analysis


TAU Performance System Architecture: TAU Performance System Architecture


TAU Performance System Architecture: TAU Performance System Architecture


Building Bridges to Other Tools: TAU: Building Bridges to Other Tools: TAU


TAU Instrumentation Approach: TAU Instrumentation Approach Support for standard program events Routines, classes and templates Statement-level blocks Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support for hardware performance counters (PAPI) Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization Eliminate instrumentation in lightweight routines


TAU Instrumentation Mechanisms: TAU Instrumentation Mechanisms Source code Manual (TAU API, TAU component API) Automatic (robust) C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec) Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked Executable code Dynamic instrumentation (pre-execution) (DynInstAPI) Virtual machine instrumentation (e.g., Java using JVMPI) TAU_COMPILER to automate instrumentation process


Using TAU: A brief Introduction: Using TAU: A brief Introduction To instrument source code using PDT Choose an appropriate TAU stub makefile in /lib: % setenv TAU_MAKEFILE /usr/tau-2.x/xt3/lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optVerbose …’ (see tau_compiler.sh) And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers: % mpif90 foo.f90 changes to % tau_f90.sh foo.f90 Execute application and analyze performance data: % pprof (for text based profile display) % paraprof (for GUI)


TAU Measurement Mechanisms: TAU Measurement Mechanisms Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, malloc/leaks) Support for tracking I/O (wrappers, Fortran instrumentation of read/write/print calls) Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events


Types of Parallel Performance Profiling: Types of Parallel Performance Profiling Flat profiles Metric (e.g., time) spent in an event (callgraph nodes) Exclusive/inclusive, # of calls, child calls Callpath profiles (Calldepth profiles) Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” (event name) TAU_CALLPATH_DEPTH environment variable Phase profiles Flat profiles under a phase (nested phases are allowed) Default “main” phase Supports static or dynamic (per-iteration) phases


Performance Analysis and Visualization: Performance Analysis and Visualization Analysis of parallel profile and trace measurement Parallel profile analysis ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2profile) Performance data management framework (PerfDMF) Parallel trace analysis Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden) Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)


ParaProf Parallel Performance Profile Analysis: ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial


ParaProf – Flat Profile (Miranda, BG/L): ParaProf – Flat Profile (Miranda, BG/L)


ParaProf – Stacked View (Miranda): ParaProf – Stacked View (Miranda)


ParaProf – Callpath Profile (Flash): ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne


Comparing Effects of MultiCore Processors: Comparing Effects of MultiCore Processors AORSA2D on 4k cores PAPI resource stalls Jaguar Cray XT (ORNL) Blue is single node Red is dual core


Comparing FLOPS: MultiCore Processors: Comparing FLOPS: MultiCore Processors AORSA2D on 4k cores Jaguar Cray XT3(ORNL) Floating pt ins/second Blue is dual core Red is single node


ParaProf – Scalable Histogram View (Miranda): ParaProf – Scalable Histogram View (Miranda) 8k processors 16k processors


ParaProf – 3D Full Profile (Miranda): ParaProf – 3D Full Profile (Miranda) 16k processors


ParaProf – 3D Scatterplot (S3D – XT4 only): ParaProf – 3D Scatterplot (S3D – XT4 only) Each point is a “thread” of execution A total of four metrics shown in relation ParaVis 3D profile visualization library JOGL 6400 cores I/O takes less time on one node (rank 0) Events (exclusive time metric) MPI_Barrier(), two loops write operation


S3D Scatter Plot: Visualizing Hybrid XT3+XT4: 6400 cores S3D Scatter Plot: Visualizing Hybrid XT3+XT4 Red nodes are XT4, blue are XT3


S3D: 6400 cores on XT3+XT4 System (Jaguar): S3D: 6400 cores on XT3+XT4 System (Jaguar) Gap represents XT3 nodes


Visualizing S3D Profiles in ParaProf: Visualizing S3D Profiles in ParaProf Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time


Profile Snapshots in ParaProf: Profile Snapshots in ParaProf Initialization Checkpointing Finalization Profile snapshots are parallel profiles recorded at runtime Used to highlight profile changes during execution


Profile Snapshots in ParaProf: Profile Snapshots in ParaProf


Profile Snapshots in ParaProf: Profile Snapshots in ParaProf Breakdown as a percentage


Acknowledgements (TAU): Acknowledgements (TAU) Dr. Allen D. Malony, Professor Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student Brad Davidson, Systems administrator