Performance Analysis Tools: Performance Analysis Tools Rick Kufrin, NCSA
Shirley Moore, U. Tennessee
Sameer S. Shende, U. Oregon
NCSA Workshop on Effective Use of Multi-Core Technology
July 2007
Topics: Topics Tools status & futures on Abe
PAPI (hardware performance counters)
PerfSuite (basic measurement software)
mpiP (monitoring MPI statistics)
TAU (advanced performance analysis)
Abe Tool Status (July ’07) : Abe Tool Status (July ’07) No additional supported tools yet installed
Base directory will be: /usr/apps/tools
Kernel recently patched for hardware performance counter support (w/perfctr until perfmon2 stabilizes) - higher-level software to follow
Initial tool selection based on prior experience and demand, feedback is welcomed
Contact consult@ncsa.uiuc.edu with inquiries; will route appropriately
Slide4: PAPI Performance Application Programming Interface
The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
Parallel Tools Consortium project started in 1998
Developed by University of Tennessee, Knoxville
http://icl.cs.utk.edu/papi/
PAPI Counter Interfaces: PAPI Counter Interfaces PAPI provides 3 interfaces to the underlying counter hardware:
The low level interface manages hardware events in user defined groups called EventSets, and provides access to advanced features.
The high level interface provides the ability to start, stop and read the counters for a specified list of events.
Graphical and end-user tools provide facile data collection and visualization
PAPI Implementation: PAPI Implementation 3rd Party and GUI Tools PAPI Low Level Machine
Specific
Layer Portable
Layer PAPI Machine Dependent Substrate PAPI High Level Hardware Performance Counters Operating System Kernel Extension
PAPI Hardware Events: PAPI Hardware Events Preset Events
Standard set of over 100 events for application performance tuning
No standardization of the exact definition
Mapped to either single or linear combinations of native events on each platform
Use papi_avail utility to see what preset events are available on a given platform
Native Events
Any event countable by the CPU
Same interface as for preset events
Use papi_native_avail utility to see all available native events
Use papi_event_chooser utility to select a compatible set of events
PAPI High-level Interface: PAPI High-level Interface Meant for application programmers wanting coarse-grained measurements
Calls the lower level API
Allows only PAPI preset events
Easier to use and less setup (less additional code) than low-level
Supports 8 calls in C or Fortran:
PAPI High-level Example: PAPI High-level Example #include "papi.h”
#define NUM_EVENTS 2
long_long values[NUM_EVENTS];
unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};
/* Start the counters */
PAPI_start_counters((int*)Events,NUM_EVENTS);
/* What we are monitoring… */
do_work();
/* Stop counters and store results in values */
retval = PAPI_stop_counters(values,NUM_EVENTS);
Low-level Interface: Low-level Interface Increased efficiency and functionality over the high level PAPI interface
Obtain information about the executable, the hardware, and the memory environment
Multiplexing
Callbacks on counter overflow
Profiling
About 60 functions
PAPI Low-level Example: PAPI Low-level Example #include "papi.h”
#define NUM_EVENTS 2
int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC};
int EventSet;
long_long values[NUM_EVENTS];
/* Initialize the Library */
retval = PAPI_library_init(PAPI_VER_CURRENT);
/* Allocate space for the new eventset and do setup */
retval = PAPI_create_eventset(&EventSet);
/* Add Flops and total cycles to the eventset */
retval = PAPI_add_events(EventSet,Events,NUM_EVENTS);
/* Start the counters */
retval = PAPI_start(EventSet);
do_work(); /* What we want to monitor*/
/*Stop counters and store results in values */
retval = PAPI_stop(EventSet,values);
Component PAPI (PAPI-C): Component PAPI (PAPI-C) Goals:
Support simultaneous access to on- and off-processor counters
Isolate hardware dependent code in a separable ‘substrate’ module
Extend platform independent code to support multiple simultaneous substrates
Add or modify API calls to support access to any of several substrates
Modify build environment for easy selection and configuration of multiple available substrates
Will be released as PAPI 4.0
Extension to PAPI to Support Multiple Substrates: Extension to PAPI to Support Multiple Substrates
PAPI Low Level Machine
Specific
Layer Portable
Layer PAPI High Level Hardware Independent Layer PAPI Machine Dependent Substrate Off-Processor Hardware Counters Operating System Kernel Extension
PAPI-C Status: PAPI-C Status PAPI 3.9 pre-release available with documentation
Implemented Myrinet substrate (native counters)
Implemented ACPI temperature sensor substrate
Working on Inifinband and Cray Seastar substrates (access to Seastar counters not available under Catamount but expected under CNL)
Asked by Cray engineers for input on desired metrics for next network switch
Tested on HPC Challenge benchmarks
Tested platforms include Pentium III, Pentium 4, Core2Duo, Itanium (I and II) and AMD Opteron
Installed and tested on ARL MSRC Linux clusters and ASC MSRC SGI Altix
PAPI-C New Routines: PAPI-C New Routines PAPI_get_component_info()
PAPI_num_cmp_hwctrs()
PAPI_get_cmp_opt()
PAPI_set_cmp_opt()
PAPI_set_cmp_domain()
PAPI_set_cmp_granularity()
Multiple Measurements: Multiple Measurements HPCC HPL benchmark on Opteron with 3 performance metrics:
FLOPS; Temperature; Network Sends/Receives
Temperature is from an on-chip thermal diode
High-level tools that use PAPI: High-level tools that use PAPI TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/
HPCToolkit (Rice Univ) http://hipersoft.cs.rice.edu/hpctoolkit/
KOJAK (UTK, FZ Juelich) http://icl.cs.utk.edu/kojak/
PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/
Titanium (UC Berkeley) http://www.cs.berkeley.edu/Research/Projects/titanium/
SCALEA (Thomas Fahringer, U Innsbruck) http://www.par.univie.ac.at/project/scalea/
Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/
SvPablo (UNC Renaissance Computing Institute) http://www.renci.unc.edu/Software/Pablo/pablo.htm
PerfSuite: PerfSuite Design Goals
Remove the barriers to the initial steps of performance analysis (don’t make it hard)
Separate data collection from presentation
Machine-independent representation
Focus on the “Big Picture” (remember the 80/20 rule?)
A primary goal is to provide an “entry point” that can help you to decide how to proceed
PerfSuite and XML: PerfSuite and XML In PerfSuite, nearly all data (input, output, configuration, etc) is represented as XML (eXtensible Markup Language) documents
This provides the ability to manipulate & transform the data in many ways using standard software / skills
Machine-independent (no binary files)
...opens the data up to the user
There are numerous high-quality XML-aware libraries available from either compiled or interpreted languages that can make it easy to transform the data for your needs
Web browsers (e.g. Mozilla, IE) have built-in XML capabilities
PerfSuite Counter-Related Software: PerfSuite Counter-Related Software Four performance counter-related utilities:
psconfig - configure / select performance events
psinv - query events and machine information
psrun - generate raw counter or statistical profiling data from an unmodified binary
psprocess - pre- and post-process data
Four libraries (shared and static)
libperfsuite – the “core” library that can be used standalone and will be built regardless of the availability of other software
libpshwpc – HardWare Performance Counter library, also built regardless of other software. Without counter support, will only perform time-based profiling through profil(). A version suitable for threaded programs is available (_r suffix).
libpshwpc_mpi – a convenience library based on the MPI standard PMPI interface.
Example XML Event Document: Example XML Event Document You can edit this file like any text file, load it into psconfig, modify it, save it, etc.
Select for use through env variable PS_HWPC_CONFIG
Configuring for Profiling: Configuring for Profiling Setting up for profiling is similar to counting - all you have to do is modify the XML configuration document:
The XML document “root element” is now , not
You can supply an optional “threshold”, or sampling rate
Only one event is allowed in the document
A Quick “Cookbook” for psrun: A Quick “Cookbook” for psrun # First, be sure to set all paths properly (can do in .cshrc/.profile)
% set PSDIR=/opt/perfsuite
% source $PSDIR/bin/psenv.csh
# Use psrun on your program to generate the data,
# then use psprocess to produce an HTML file
% psrun myprog
% psprocess --html myprog.12345.xml > myprog.html
# Take a look at the results
% mozilla myprog.html
# Second run, but this time profiling instead of counting
% psrun -c $PSDIR/share/perfsuite/xml/pshwpc/profil.xml myprog
% psprocess -e myprog myprog.67890.xml
psrun: psrun Hardware performance counting and profiling with unmodified dynamically-linked executables
Available for x86, x86-64, and ia64
POSIX threads support
Automatic multiplexing
Can be used with MPI
Optionally collects resource usage
Supports all PAPI standard events
Input/Output = XML documents (can request plain text)
PerfSuite Environment Variables: PerfSuite Environment Variables PS_HWPC: “off” or “on”, controls whether measurement takes place at all (for API)
PS_HWPC_CONFIG: set to the name of the XML event file created with psconfig or “by hand”. A default is used if not set
PS_HWPC_FILE: controls the prefix of the XML output document (default “psrun”)
PS_HWPC_ANNOTATION - adds an arbitrary “note” to the XML output
PS_HWPC_DOMAIN: controls whether counting at user or system level (or both)
PS_HWPC_THRESHOLD: sets threshold for profiling
PS_HWPC_FORMAT: “text” or “xml”, controls whether output is in an XML document or plain text (similar to a psprocess report)
PSRUN_DOFORK: if set (to anything), monitors child processes also
psprocess (HTML mode): psprocess (HTML mode) This style of output is customizable by you.
By default, the information it contains and its visual appearance are based on PerfSuite-provided defaults, but these can be easily replaced to suit your needs.
This output is generated by psprocess using XML Transformations. The stylesheet is in the share/perfsuite/xml/pshwpc subdirectory, with a “xsl” file extension
psprocess (text mode): psprocess (text mode) PerfSuite Hardware Performance Summary Report
Version : 1.0
Created : Mon Dec 30 11:31:53 AM Central Standard Time 2002
Generator : psprocess 0.5
XML Source : /u/ncsa/anyuser/performance/psrun-ia64.xml
Execution Information
===========================
Date : Sun Dec 15 21:01:20 2002
Host : user01
Processor and System Information
===========================
Node CPUs : 2
Vendor : Intel
Family : IPF
Model : Itanium
CPU Revision : 6
Clock (MHz) : 800.136
Memory (MB) : 2007.16
Pagesize (KB): 16
psprocess (text mode, cont’d): psprocess (text mode, cont’d) Cache Information
==========================
Cache levels : 3
--------------------------------
Level 1
Type : data
Size (KB) : 16
Linesize (B) : 32
Assoc : 4
Type : instruction
Size (KB) : 16
Linesize (B) : 32
Assoc : 4
--------------------------------
Level 2
Type : unified
Size (KB) : 96
Linesize (B) : 64
Assoc : 6 The reports (text or HTML) generated by psprocess have several sections, covering:
Report creation details
Run details
Machine information
Raw counter listings
Counter explanations and index
Derived metrics
Run annotation defined by you
psprocess (text mode, cont’d): psprocess (text mode, cont’d)
Index Description Counter Value =================================================================
1 Conditional branch instructions mispredicted..... 4831072449
4 Floating point instructions...................... 86124489172
5 Total cycles..................................... 594547754568
6 Instructions completed........................... 1049339828741
Statistics
=================================================================
Graduated instructions per cycle................... 1.765
Graduated floating point instructions per cycle.... 0.145
Level 3 cache miss ratio (data).................... 0.957
Bandwidth used to level 3 cache (MB/s)............. 385.087
% cycles with no instruction issue................. 10.410
% cycles stalled on memory access.................. 43.139
MFLOPS (cycles).................................... 115.905
MFLOPS (wallclock)................................. 114.441
PerfSuite Library Access (API): PerfSuite Library Access (API) All of the functionality is also available from within your program (C/C++/Fortran) through a small API
Same XML documents are read, same XML documents are written, small additional functionality
Why would you want to use this?
Primarily to gain finer control over where measurements are taken in your program. For example, you might defer measurement until program initialization has completed
For complex uses, you are probably better off using an “industrial-strength” performance library
The intent of the API is to “abstract out” the process of performance measurement to a very high level
libpshwpc Library Routines: libpshwpc Library Routines The libpshwpc API contains nine routines that you can call from your C/C++ or Fortran program.
Call “init” once, call “start” and “suspend” as many times as you like. Call “stop” (supplying a file name prefix of your choice) to get the performance data XML document.
Optionally, call “shutdown”. C / C++
ps_hwpc_init (void)
ps_hwpc_start (void)
ps_hwpc_suspend (void)
ps_hwpc_read (ps_hwpc_values_t
*values)
ps_hwpc_stop (char *prefix)
ps_hwpc_shutdown (void)
ps_hwpc_numevents (int *numevents)
ps_hwpc_eventnames (char ***eventnames)
ps_hwpc_psrun (void) Fortran subroutine equivalents add an
additional “ierr” status final argument
Example Fortran API Use: Example Fortran API Use include 'fperfsuite.h'
call PSF_hwpc_init(ierr)
call PSF_hwpc_start(ierr)
do j = 1, n
do i = 1, m
do k = 1, l
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
call PSF_hwpc_stop('perf', ierr)
call PSF_hwpc_shutdown(ierr) % ifort -c matmult.f -I/opt/perfsuite/include
% ifort matmult.o -L /usr/apps/tools/perfsuite/lib/intel -L/usr/apps/tools/papi/lib -lpshwpc -lperfsuite -lpapi
Using Processor “Native” Events: Using Processor “Native” Events It’s easy to work with native events in addition to PAPI standard events by modifying the configuration file slightly.
Instead of using the XML attributes type=“preset” name=“PAPI_EVENTNAME”, use the attribute type=“native” and enclose the event name as the content of the element
Can be used with profiling configurations
NOPS_RETIRED
BACK_END_BUBBLE_ALL
Advanced Use (psrun): Advanced Use (psrun) psrun supports a few options that can be useful in working with shared or distributed memory programs:
-p / --pthreads
uses a POSIX thread-aware variant of the library that captures thread creation and measures performance of each, depositing the results in an XML document with the thread ID embedded:
-f / --fork
monitors child processes that are created. Not enabled by default.
-a / --annotate
inserts an XML “element” with a user-supplied annotation (text)
Advanced Use (psprocess): Advanced Use (psprocess) psprocess is meant to be a “generic” processor for different XML document types generated by PerfSuite. For hardware counting, the most common type is
Individual documents can be combined into a “multi-document” with the option –c / --combine. With hardware counter data, psprocess summarizes the information contained in them with descriptive statistics (mean, max, min, sum, stddev)
-s LIST is a very useful option to be used with profiling runs. LIST is a comma-separated list of modules, files, functions, lines used to limit the amount of output
-t THRESHOLD is also helpful in limiting the output of profiling runs. THRESHOLD is a number that specifies the minimum % of samples required for a given entry to be displayed. Example: “-t 2” means “don’t show me anything that didn’t account for at least 2% of the samples collected”
Application Example: CX3D: Application Example: CX3D Fortran 90 / MPI code (Forschungszentrum Juelich) that simulates Czochralski crystal growth.
Spatial decomposition across processors can be specified at runtime.
We’ll look at the steps involved in using PerfSuite on 8 processors to obtain profiling and counting information.
The application measures elapsed time internally with system_clock(). For the 8-proc run, the measured wall clock time for a 4x2 decomposition is 40.88 secs.
We can also measure parallel runs using gprof by using the environment variable GMON_OUT_PREFIX to override the default “gmon.out” filename.
Profiling Results (gprof summary): Profiling Results (gprof summary) % cumulative self self total
time seconds seconds calls ms/call ms/call name
76.79 246.25 246.25 8000 30.78 30.93 velo_
9.01 275.15 28.90 8000 3.61 3.64 temp_
3.74 287.14 11.99 8000 1.50 1.50 curr_
2.04 293.68 6.54 gmpi_net_lookup
1.81 299.49 5.81 gm_ntoh_u8
1.31 303.69 4.21 MPID_RecvComplete
0.75 306.12 2.42 _gm_ntoh_u8
0.71 308.38 2.27 8008 0.28 0.32 bound_
% time attributed to the highest routine (velo) ranges from 79.21 to 74.42.
$ gprof –s cx.gprof ${GMON_OUT_PREFIX}.*
$ gprof –s cx.gprof gmon.sum
Profiling Results (psprocess individual): Profiling Results (psprocess individual) Profile Information
========================================================================
Class : PAPI
Event : PAPI_TOT_CYC (Total cycles)
Period : 30600000
Samples : 4012
Domain : user
Run Time : 40.65 (seconds)
Min Self % : (all)
Module Summary
------------------------------------------------------------------------
Samples Self % Total % Module
3942 98.26% 98.26% /u/ncsa/rkufrin/apps/cx3d/cx
69 1.72% 99.98% /opt/gm/lib/libgm.so.0.0.0
1 0.02% 100.00% /lib/tls/libpthread-0.34.so
Profiling Results (psprocess, cont’d): Profiling Results (psprocess, cont’d) File Summary
--------------------------------------------------------------------------------
Samples Self % Total % File
3182 79.31% 79.31% /u/ncsa/rkufrin/apps/cx3d/velo.f
384 9.57% 88.88% /u/ncsa/rkufrin/apps/cx3d/temp.f
164 4.09% 92.97% /u/ncsa/rkufrin/apps/cx3d/testin.f
143 3.56% 96.54% /u/ncsa/rkufrin/apps/cx3d/curr.f
53 1.32% 97.86% ./include/gm_send_queue.h
23 0.57% 98.43% ??
22 0.55% 98.98% /u/ncsa/rkufrin/apps/cx3d/bound.f
15 0.37% 99.35% /u/ncsa/rkufrin/apps/cx3d/csendxs.f
14 0.35% 99.70% ./libgm/gm_send.c
10 0.25% 99.95% /u/ncsa/rkufrin/apps/cx3d/crecvxs.f
1 0.02% 99.98% ./libgm/gm_ptr_hash.c
1 0.02% 100.00% ./libgm/gm_hash.c
Function Summary
--------------------------------------------------------------------------------
Samples Self % Total % Function
3182 79.31% 79.31% velo
384 9.57% 88.88% temp
164 4.09% 92.97% testin
143 3.56% 96.54% curr
54 1.35% 97.88% gm_send_with_callback
Profiling Results (psprocess, cont’d): Profiling Results (psprocess, cont’d) Function:File:Line Summary
--------------------------------------------------------------------------------
Samples Self % Total % Function:File:Line
687 17.12% 17.12% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:232
535 13.33% 30.46% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:260
509 12.69% 43.15% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:210
378 9.42% 52.57% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:356
189 4.71% 57.28% velo:/u/ncsa/rkufrin/apps/cx3d/velo.f:493
$ mpirun –np 8 psrun –c profile_cycles.xml ./cx
$ psprocess –e cx psrun.PID.xml
profile_cycles.xml:
Summary Information (psprocess): Summary Information (psprocess) Aggregate Statistics Min Max Median Mean StdDev Sum
============================================================================================
% CPU utilization..................... 97.88 98.41 98.09 98.12 0.17 784.93
% cycles stalled on any resource...... 0.00 0.00 0.00 0.00 0.00 0.00
CPU time (seconds).................... 39.95 40.15 39.99 40.01 0.07 320.11
Floating point operations per cycle... 0.05 0.05 0.05 0.05 0.00 0.39
Floating point operations per graduated instruction
0.04 0.04 0.04 0.04 0.00 0.31
Graduated instructions per cycle...... 1.27 1.30 1.29 1.29 0.01 10.28
Graduated instructions per issued instruction
0.99 1.00 1.00 1.00 0.00 7.97
Issued instructions per cycle......... 1.28 1.31 1.29 1.29 0.01 10.33
Level 2 cache hit rate (data)......... 0.96 0.97 0.97 0.97 0.00 7.74
Level 2 cache line reuse (data)....... 27.49 30.82 29.57 29.28 1.22 234.26
MFLOPS (cycles)....................... 145.53 154.10 151.18 150.40 3.63 1203.21
MFLOPS (wall clock)................... 142.45 151.50 148.37 147.57 3.64 1180.56
MIPS (cycles)......................... 3881.34 3952.56 3924.68 3922.56 28.18 31380.47
MIPS (wall clock)..................... 3799.24 3877.19 3854.91 3848.68 30.42 30789.40
MVOPS (cycles)........................ 0.00 0.00 0.00 0.00 0.00 0.00
MVOPS (wall clock).................... 0.00 0.00 0.00 0.00 0.00 0.00
Mispredicted branches per correctly predicted branch
0.00 0.01 0.01 0.01 0.00 0.05
Vector instructions per cycle......... 0.00 0.00 0.00 0.00 0.00 0.00
Vector instructions per graduated instruction
0.00 0.00 0.00 0.00 0.00 0.00
Wall clock time (seconds)............. 40.60 40.88 40.79 40.78 0.10 326.25
$ psprocess –c cx.*.xml > combined.xml
$ psprocess combined.xml
mpiPhttp://mpip.sourceforge.net/: mpiP http://mpip.sourceforge.net/ Lightweight, scalable profiling library for MPI applications
Collects statistical information about MPI functions
Uses communication only when reporting; less overhead
How to use
Include mpiP library during link process, no source changes required
Automatic profiling information gathering
No recompiling is required
What to analyze
Overview of application’s time in MPI
Gathers MPI callsites’ aggregate time within the application
mpiP Output: mpiP Output Report stored in file with suffix “.mpiP”
All data written from MPI task 0 mpiP: mpiP: mpiP V3.0.0 (Build Oct 4 2006/12:40:28)
mpiP: Direct questions and errors to mpip-help@lists.sourceforge.net
mpiP:
mpiP:
mpiP: Storing mpiP output in [./9-test-mpip-time.exe.2.12390.1.mpiP].
mpiP: @ mpiP @ Command : /g/g0/chcham/mpiP/devo/testing/./9-test-mpip-time.exe @ Version : 2.8.2 @ MPIP Build date : Jan 10 2005, 15:15:47 @ Start time : 2005 01 10 16:01:32 @ Stop time : 2005 01 10 16:01:42 @ Timer Used : gettimeofday @ MPIP env var : -t 10.0 @ Collector Rank : 0 @ Collector PID : 25972 @ Final Output Dir : . @ MPI Task Assignment : 0 mcr88 @ MPI Task Assignment : 1 mcr88 @ MPI Task Assignment : 2 mcr89 @ MPI Task Assignment : 3 mcr89
mpiP Summary and Callsite Index: mpiP Summary and Callsite Index ---------------------------------------------------------------------------
@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task AppTime MPITime MPI%
0 10 0.000243 0.00
1 10 10 99.92
2 10 10 99.92
3 10 10 99.92
* 40 30 74.94 ---------------------------------------------------------------------------
@--- Callsites: 2 ---------------------------------------------------------
---------------------------------------------------------------------------
ID Lev File/Address Line Parent_Funct MPI_Call
1 0 9-test-mpip-time.c 52 main Barrier
2 0 9-test-mpip-time.c 61 main Barrier Per-Task and Aggregate Time/Percentage for Application and MPI Cross-Reference to Callsites (for interpreting remainder of report)
Per-Site Time/Message Statistics: Per-Site Time/Message Statistics ---------------------------------------------------------------------------
@--- Aggregate Time (top twenty, descending, milliseconds) ----------------
---------------------------------------------------------------------------
Call Site Time App% MPI% COV
Barrier 2 3e+04 75.00 100.00 0.67
Barrier 1 0.405 0.00 0.00 0.59 ---------------------------------------------------------------------------
@--- Aggregate Sent Message Size (top twenty, descending, bytes) ----------
---------------------------------------------------------------------------
Call Site Count Total Avrg MPI%
Send 7 320 1.92e+06 6e+03 99.96
Bcast 1 12 336 28 0 .02 ---------------------------------------------------------------------------
@--- Callsite Time statistics (all, milliseconds): 8 ----------------------
---------------------------------------------------------------------------
Name Site Rank Count Max Mean Min App% MPI%
Barrier 1 0 1 0.107 0.107 0.107 0.00 44.03
Barrier 1 * 4 0.174 0.137 0.107 0.00 0.00
Barrier 2 0 1 0.136 0.136 0.136 0.00 55.97
Barrier 2 1 1 1e+04 1e+04 1e+04 99.92 100.00
Barrier 2 2 1 1e+04 1e+04 1e+04 99.92 100.00
Barrier 2 3 1 1e+04 1e+04 1e+04 99.92 100.00
Barrier 2 * 4 1e+04 7.5e+03 0.136 74.94 100.00
TAU Performance System: TAU Performance System Tuning and Analysis Utilities (15+ year project effort)
Performance system framework for HPC systems
Integrated, scalable, flexible, and parallel
Targets a general complex system computation model
Entities: nodes / contexts / threads
Multi-level: system / software / parallelism
Measurement and analysis abstraction
Integrated toolkit for performance problem solving
Instrumentation, measurement, analysis, and visualization
Portable performance profiling and tracing facility
Performance data management and data mining
Partners: LLNL, ANL, LANL, Research Center Jülich
TAU Parallel Performance System Goals: TAU Parallel Performance System Goals Portable (open source) parallel performance system
Computer system architectures and operating systems
Different programming languages and compilers
Multi-level, multi-language performance instrumentation
Flexible and configurable performance measurement
Support for multiple parallel programming paradigms
Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based
Support for performance mapping
Support for comparing performance (single-core/multi-cores)
Integration of leading performance technology
Scalable (very large) parallel performance analysis
TAU Performance System Architecture: TAU Performance System Architecture
TAU Performance System Architecture: TAU Performance System Architecture
Building Bridges to Other Tools: TAU: Building Bridges to Other Tools: TAU
TAU Instrumentation Approach: TAU Instrumentation Approach Support for standard program events
Routines, classes and templates
Statement-level blocks
Support for user-defined events
Begin/End events (“user-defined timers”)
Atomic events (e.g., size of memory allocated/freed)
Selection of event statistics
Support for hardware performance counters (PAPI)
Support definition of “semantic” entities for mapping
Support for event groups (aggregation, selection)
Instrumentation optimization
Eliminate instrumentation in lightweight routines
TAU Instrumentation Mechanisms: TAU Instrumentation Mechanisms Source code
Manual (TAU API, TAU component API)
Automatic (robust)
C, C++, F77/90/95 (Program Database Toolkit (PDT))
OpenMP (directive rewriting (Opari), POMP2 spec)
Object code
Pre-instrumented libraries (e.g., MPI using PMPI)
Statically-linked and dynamically-linked
Executable code
Dynamic instrumentation (pre-execution) (DynInstAPI)
Virtual machine instrumentation (e.g., Java using JVMPI)
TAU_COMPILER to automate instrumentation process
Using TAU: A brief Introduction: Using TAU: A brief Introduction To instrument source code using PDT
Choose an appropriate TAU stub makefile in /lib:
% setenv TAU_MAKEFILE /usr/tau-2.x/xt3/lib/Makefile.tau-mpi-pdt-pgi
% setenv TAU_OPTIONS ‘-optVerbose …’ (see tau_compiler.sh)
And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers:
% mpif90 foo.f90
changes to
% tau_f90.sh foo.f90
Execute application and analyze performance data:
% pprof (for text based profile display)
% paraprof (for GUI)
TAU Measurement Mechanisms: TAU Measurement Mechanisms Parallel profiling
Function-level, block-level, statement-level
Supports user-defined events and mapping events
TAU parallel profile stored (dumped) during execution
Support for flat, callgraph/callpath, phase profiling
Support for memory profiling (headroom, malloc/leaks)
Support for tracking I/O (wrappers, Fortran instrumentation of read/write/print calls)
Tracing
All profile-level events
Inter-process communication events
Inclusion of multiple counter data in traced events
Types of Parallel Performance Profiling: Types of Parallel Performance Profiling Flat profiles
Metric (e.g., time) spent in an event (callgraph nodes)
Exclusive/inclusive, # of calls, child calls
Callpath profiles (Calldepth profiles)
Time spent along a calling path (edges in callgraph)
“main=> f1 => f2 => MPI_Send” (event name)
TAU_CALLPATH_DEPTH environment variable
Phase profiles
Flat profiles under a phase (nested phases are allowed)
Default “main” phase
Supports static or dynamic (per-iteration) phases
Performance Analysis and Visualization: Performance Analysis and Visualization Analysis of parallel profile and trace measurement
Parallel profile analysis
ParaProf: parallel profile analysis and presentation
ParaVis: parallel performance visualization package
Profile generation from trace data (tau2profile)
Performance data management framework (PerfDMF)
Parallel trace analysis
Translation to VTF (V3.0), EPILOG, OTF formats
Integration with VNG (Technical University of Dresden)
Online parallel analysis and visualization
Integration with CUBE browser (KOJAK, UTK, FZJ)
ParaProf Parallel Performance Profile Analysis: ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed
(database) Metadata Application Experiment Trial
ParaProf – Flat Profile (Miranda, BG/L): ParaProf – Flat Profile (Miranda, BG/L)
ParaProf – Stacked View (Miranda): ParaProf – Stacked View (Miranda)
ParaProf – Callpath Profile (Flash): ParaProf – Callpath Profile (Flash) Flash
thermonuclear flashes
Fortran + MPI
Argonne
Comparing Effects of MultiCore Processors: Comparing Effects of MultiCore Processors AORSA2D on 4k cores
PAPI resource stalls
Jaguar Cray XT (ORNL)
Blue is single node
Red is dual core
Comparing FLOPS: MultiCore Processors: Comparing FLOPS: MultiCore Processors
AORSA2D on 4k cores
Jaguar Cray XT3(ORNL)
Floating pt ins/second
Blue is dual core
Red is single node
ParaProf – Scalable Histogram View (Miranda): ParaProf – Scalable Histogram View (Miranda) 8k processors 16k processors
ParaProf – 3D Full Profile (Miranda): ParaProf – 3D Full Profile (Miranda) 16k processors
ParaProf – 3D Scatterplot (S3D – XT4 only): ParaProf – 3D Scatterplot (S3D – XT4 only) Each point is a “thread” of execution
A total of four metrics shown in relation
ParaVis 3D profile visualization library
JOGL 6400 cores I/O takes less time on one node (rank 0) Events (exclusive time metric)
MPI_Barrier(), two loops
write operation
S3D Scatter Plot: Visualizing Hybrid XT3+XT4: 6400 cores S3D Scatter Plot: Visualizing Hybrid XT3+XT4 Red nodes are XT4, blue are XT3
S3D: 6400 cores on XT3+XT4 System (Jaguar): S3D: 6400 cores on XT3+XT4 System (Jaguar)
Gap represents XT3 nodes
Visualizing S3D Profiles in ParaProf: Visualizing S3D Profiles in ParaProf Gap represents XT3 nodes
MPI_Wait takes less time, other routines take more time
Profile Snapshots in ParaProf: Profile Snapshots in ParaProf Initialization Checkpointing Finalization Profile snapshots are parallel profiles recorded at runtime
Used to highlight profile changes during execution
Profile Snapshots in ParaProf: Profile Snapshots in ParaProf
Profile Snapshots in ParaProf: Profile Snapshots in ParaProf Breakdown as a percentage
Acknowledgements (TAU): Acknowledgements (TAU) Dr. Allen D. Malony, Professor
Alan Morris, Senior software engineer
Wyatt Spear, Software engineer
Scott Biersdorff, Software engineer
Kevin Huck, Ph.D. student
Aroon Nataraj, Ph.D. student
Brad Davidson, Systems administrator