Category: Entertainment

Presentation Description

No description available.


Presentation Transcript

Profiling techniques on SDSC systemsDmitry PekurovskySDSC Summer InstituteJuly 16, 2007: 

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

Overview of Talk: 

Overview of Talk Standard profiling using prof, gprof and gmon (DataStar, Blue Gene) IBM High Performance Computing Toolkit (IHPCT) on DataStar and Blue Gene Hardware Performance Monitoring – HPM MPI Tracer/Profiler Xprofiler - CPU profiling tool PeekPerf – Visualization of performance trace information Integrated Performance Monitoring (IPM) on DataStar and Blue Gene Suggested standard tuning procedure

Standard Profiling using prof, gprof : 

Standard Profiling using prof, gprof Standard profiling (prof, gprof) is available on both DataStar and Blue Gene. Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands Timer tick profiling information: Add –pg to the link options Procedure level profiling with timer tick info: Add –pg to compile and link options Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options Each task generates a gmon.out.x file where x corresponds to the rank of the task. Output can be read using the gprof command (and Xprofiler as detailed later in the talk)

Available tools: 

Available tools

Example of profiling using gmon: 

Example of profiling using gmon Step 1: Compile using the –pg and –g options: mpxlf –pg –g pois-imp.f –o example1 Step 2: Run the code to produce the gmon.out.x files: ds100 % ls gmon.out.* gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3 Step 3: Use gprof to analyze the output: gprof -s example1 gmon.out.0 gmon.out.1 gmon.out.2 gmon.out.3 gprof example1 gmon.sum andgt; summary.dat (The first command produces gmon.sum which is analyzed in the second line and the ouput redirected to summary.dat)

Example of profiling using gmon: 

Example of profiling using gmon gprof output has the call graph info, flat profile and function index Section of sample call graph: called/total parents index %time self descendents called+self name index called/total children 0.00 4.13 4/4 .__start [3] [1] 82.8 0.00 4.13 4 .poisimp [1] 4.10 0.03 4/4 .poisson [2] ------------------------------------------------------------------- 4.10 0.03 4/4 .poisimp [1] [2] 82.8 4.10 0.03 4 .poisson [2] 0.02 0.00 32000/32000 .solve [12] 0.01 0.00 84032/84032 ._sin [26] ------------------------------------------------------------------

Example of profiling using gmon: 

Example of profiling using gmon Section of sample flat profile andamp; function index % cumulative self self total time seconds seconds calls ms/call ms/call name 82.2 4.10 4.10 4 1025.00 1032.50 .poisson [2] 1.2 4.16 0.06 uitrunc_const [4] 1.0 4.21 0.05 .lapi_recv_vec [5] 1.0 4.26 0.05 .shm_submit_slot [6] 0.8 4.30 0.04 ._Vector_dgsp_xfer [7] 0.8 4.34 0.04 .mpci_send [8] 0.6 4.37 0.03 ._lapi_shm_get [9] 0.6 4.40 0.03 ._mpi_allgather [10] 0.6 4.43 0.03 .allgather_tree_b [11] 0.4 4.45 0.02 32000 0.00 0.00 .solve [12] ... ... Index by function name [27] .LAPI_Util.GL [9] ._lapi_shm_get [47] .atoi [28] .MPI(int) [40] ._lapi_shm_setup [48] .fast_free [29] .MPID_Welcome_EA [41] ._mem_alloc [49] .fclose_unlocked [13] .MPID_msg_arrived [10] ._mpi_allgather [50] .fflush_unlocked [30] .MPI__Allgather [17] ._null_hndlr [51] .free.GL


Xprofiler CPU profiling tool similar to gprof Can be used to profile both serial and parallel applications Use procedure-profiling information to construct a graphical display of the functions within an application Provide quick access to the profiled data and helps users identify functions that are the most CPU-intensive Based on sampling (support from both compiler and kernel) Charge execution time to source lines and show disassembly code Xprofiler is in your default path on Datastar

Running Xprofiler: 

Running Xprofiler Compile the program with –g -pg Run the program gmon.out file is generated (MPI applications generate gmon.out.1, …, gmon.out.n) On datastar: xprofiler a.out gmon.* Run Xprofiler

Xprofiler: Main Display: 

Xprofiler: Main Display Width of a bar: time including called routines Height of a bar: time excluding called routines Call arrows labeled with number of calls Overview window for easy navigation (View  Overview)

Xprofiler - Disassembler Code: 

Xprofiler - Disassembler Code


HPMCOUNT hpmcount runs an application and then reports execution wall clock time, hardware performance counter information, derived hardware metrics, and resource utilization statistics (such as memory usage!) . Usage Serial Job: hpmcount executable_name Parallel jobs: poe hpmcount executable_name andlt;poe optionsandgt; For parallel jobs you can use the above poe line in a batch script. Note that there is hpm output for each task when you run this in parallel. Set MP_LABELIO=yes to identify the task ID of the output

HPMCOUNT (cont.): 

HPMCOUNT (cont.)

HPM event groups: 

HPM event groups

HPMCOUNT output example: 

HPMCOUNT output example


LIBHPM Instrumentation library Provides performance information for instrumented program sections Supports multiple (nested) instrumentation sections Multiple sections may have the same ID Run-time performance information collection Available on Datastar and Blue Gene


Functions hpmInit( taskID, progName ) / f_hpminit( taskID, progName ) taskID is an integer value indicating the node ID. progName is a string with the program name. hpmStart( instID, label ) / f_hpmstart( instID, label ) instID is the instrumented section ID. It should be andgt; 0 and andlt;= 100 ( can be overridden) Label is a string containing a label, which is displayed by PeekPerf. hpmStop( instID ) / f_hpmstop( instID ) For each call to hpmStart, there should be a corresponding call to hpmStop with matching instID hpmTerminate( taskID ) / f_hpmterminate( taskID ) This function will generate the output. If the program exits without calling hpmTerminate, no performance information will be generated.

Message-Passing Performance:: 

Message-Passing Performance: MP_Profiler Library Captures 'summary' data for MPI calls Source code traceback User MUST call MPI_Finalize() in order to get output files. No changes to source code MUST compile with –g to obtain source line number information MP_Tracer Library Captures 'timestamped' data for MPI calls Source traceback Available on Datastar and Blue Gene

Trace flags: 

Trace flags Datastar: IHPCT_BASE=/usr/local/apps/ihpct TRACELIB=$(IHPCT_BASE)/lib MPITRACE = -L$(TRACELIB) -lmpitrace MPIPROF = -L$(TRACELIB) –lmpiprof HPMINC=$(IHPCT_BASE)/include HPMLIB = -L$(IHPCT_BASE)/lib/pwr4 -lhpm_r -lpmapi -lm Blue Gene (new - untested) IHPCT_BASE = /usr/local/apps/hpc_toolkit TRACELIB=$(IHPCT_BASE)/lib MPITRACE = -L$(TRACELIB) –lmpitrace_f ( OR -lmpitrace_c) HPMINC=$(IHPCT_BASE)/include HPMLIB = -L$(IHPCT_BASE)/lib –lhpm.rts -lpmapi -lm

MP_Profiler Summary Output: 

MP_Profiler Summary Output

MP_Profiler Sample Call Graph Output: 

MP_Profiler Sample Call Graph Output

MP_Profiler Message Size Distribution: 

MP_Profiler Message Size Distribution

Environment Flags: 

Environment Flags TRACELEVEL Level of trace back the caller in the stack Used to skipped wrappers Default: 0 TRACE_TEXTONLY If set to '1', plain text output is generated Otherwise, a viz file is generated TRACE_PERFILE If set to '1', the output is shown for each source file Otherwise, output is a summary of all source files TRACE_PERSIZE If set to '1', the statistics for a function is shown for every message size Otherwise, summary for all message sizes is given


PeekPerf PeekPerf is a viewer for data generated by HPM, Tracer and Profiling libraries, and DPOMP.

Integrated Performance Monitoring (IPM) : 

Integrated Performance Monitoring (IPM) Allows users to obtain a concise summary of the performance and communication characteristics of their codes. Information on use available at On Blue Gene you need to recompile your code, linking to the IPM library by adding -L/usr/local/apps/ipm/lib/ -lipm to the link stage. For example: C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm Run your job using poe-ipm on DataStar and mpirun-ipm on Blue Gene. DO NOT use together with HPMCOUNT !


IPM Output: 

IPM Output In addition to summary, an in-depth analysis is available, including: Load balancing Communication pattern topology Message size distribution A file will be produced with a name combining your username and a number generated by IPM (for example mahidhar.1160615104.920400.0) To generate a Web page showing detailed analysis of your code, run the ipm_parse_sdsc command followed by the filename. bg-login1 0512/RUN1andgt; /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0 IPM at SDSC - Webpage creation in progress Please wait - this may take several minutes. 100..200..300..400..500.. IPM: Data processing finished - Creating HTML output - please wait. The web page will be visible at: Note the webpage will stay online for 30 days It can be regenerated at any time, or a local copy can be saved using your web browser

IPM results: Webpage snapshot: 

IPM results: Webpage snapshot


Standard Tuning Procedure: 

Standard Tuning Procedure Pick suitable dataset (a good representation of your production runs) and optimal processor set Get rough estimate FLOPS. Running with hpmcount or IPM (on DataStar) is the quickest way to do this. 5-15% of peak is normal range Understand scaling problems by running at different processor count Single processor performance profiling: gprof, Xprofiler,HPM – identify routines or regions that dominate execution time Consider creating a simple kernel that manifests the same behavior – ease of testing HPM – study CPU performance in detail (cache use etc)

MPI profiling: 

MPI profiling


References DataStar user guide Blue Gene user guide IBM HPC Toolkit Link IPM

authorStream Live Help