SDSC Blue Gene: Optimization and DebuggingMahidhar TatineniSDSC, April 6, 2007: SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007
Overview of Talk: Overview of Talk Blue Gene system overview – processor, networks
Compiler optimizations
Second FPU – restrictions, optimizations and limitations
Virtual Node (VN) andamp; Communication Coprocessor (CO) mode
Profiling
Integrated Performance Monitoring (IPM)
Troubleshooting – common issues
Blue Gene core files – using addr2line
Standard tuning procedure
Task mapping
BG System Overview: Processor Chip: BG System Overview: Processor Chip
Blue Gene compute nodes : Blue Gene compute nodes 700 MHz powerpc processor
1 integer unit (FXU), 1 load/store unit, 2 FPUs
L1: 32kB, 32byte line, 64way
L2: 16, 128-byte lines acts as prefetch buffer
L3: 4MB, 35 cycles, shared
Main limitations
512 MB memory per node
MPI only, no OpenMP or pthreads
Limited system calls w/ compute node kernel
Executables must be statically linked (no shared libraries)
Blue Gene: Networks: Blue Gene: Networks Three dimensional (3-D) Torus
Interconnects all compute nodes
1.4 Gb/s on all 6 bidirectional node links (2.1 GB/s per node)
Global Tree
Collectives functionality
2.8 GB/s of bandwidth per link
Latency of tree traversal in the order of 5µs
Interconnects all compute and I/O nodes
Gigabit Ethernet
Low Latency Global Barrier and Interrupt
Control Network
Using the compilers: Options: Using the compilers: Options Compiler options
-qarch=440 uses only single FPU per processor (minimum option)
-qarch=440d allows both FPUs per processor (alternate option)
-qtune=440 tunes for the 440 processor
-O3 gives minimal optimization with no SIMDization
-O3 –qarch=440d adds backend SIMDization
-O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization
-O4 adds compile-time interprocedural analysis
-O5 adds link-time interprocedural analysis
(TPO SIMDization default with –O4 and –O5)
Current recommendation:
Start with -O3 –qarch=440d –qtune=440
Try –O4, -O5 next
Practical flags: Practical flags When linking mass libraries
-Wl,--allow-multiple-definition
When taking too long to compile
Try compile on bg-login4 (alternate login node)
Try –qnoipa option
When compiling .f90 files –qsuffix=f=f90
To obtain a detailed compilation report
-qdebug=diagnostic –qlist –qsource –qreport=hotlist
With XL compilers, you can combine opt flags with -g
Second FPU: Second FPU To generate code to take advantage of the second FPU 16-byte alignment is required and may need alignment assertions.
Easiest approach to take advantage of the second FPU is to use optimized math library routines (like MASS, ESSL)
The XL compiler has two different components that can generate SIMD code
The back-end optimizer with –O3 –qarch=440d
The TPO front-end, with –qhot or –O4, -O5
In many applications loads and stores are the bottleneck and one can saturate the bandwidth to L3 or memory =andgt; double FPU instructions can help for data in L1 but not for data in L3 or memory.
Second FPU – Usage example: Second FPU – Usage example An example using alignment assertion
FORTRAN:
call alignx(16,x(1))
call alignx(16,y(1))
do i = 1, n
y(i) = a*x(i)+y(i)
enddo
C:
double *x, *y;
__alignx(16,x);
__alignx(16,y);
for (i=0; iandlt;n; i++) y[i]=a*x[i]+y[i];
Math libraries from IBM: Math libraries from IBM Engineering andamp; Scientific Subroutine Libraries (ESSL)
Mathematics Accelerated Scientific Subroutines (MASS)
Mathematics Accelerated Scientific Subroutines Vectorized (MASSV)
ESSL, MASS, MASSV tuned specifically for Blue Gene and will help significantly in improving performance
Modes for running jobs – VN & CO: Modes for running jobs – VN andamp; CO The default mode is the communication coprocessor (CO) mode. One of the processors on the node is the main processor running the compute processor. Second processor behaves as an offload engine (mainly for communication functions).
In the Virtual Node (VN) mode both processors are used for the compute processes. In this mode the node resources (primarily the memory and torus network) are shared by both processes. Hence, in the VN mode users will have half the memory/node as compared to the CO mode.
I/O intensive tasks which require large amount of data interchange between compute nodes benefit by using the CO mode.
Applications which are primarily CPU bound and do not have a large per node memory requirement benefit from the VN mode.
Profiling your code on the Blue Gene: Profiling your code on the Blue Gene Standard profiling (prof, gprof) is available on the Blue Gene.
Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands
Timer tick profiling information: Add –pg to the link options
Procedure level profiling with timer tick info: Add –pg to compile and link options
Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options
Each task generates a gmon.out.x file where x corresponds to the rank of the task.
Output can be read using the gprof command.
Example of profiling using gmon: Example of profiling using gmon /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gprof --sumbg poisson
gprof poisson gmon.sum andgt; test.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
40.59 11.45 11.45 8 1.43 2.55 poisson
31.58 20.36 8.91 204800 0.00 0.00 solve
11.59 23.63 3.27 cvtloop
3.47 24.61 0.98 BGLML_Messager_VMadvance
1.95 25.16 0.55 __ cvt_r
1.28 25.52 0.36 BGLML_Messager_tree_advance
1.28 25.88 0.36 memcpy
1.10 26.19 0.31 WriteUnit
0.96 26.46 0.27 write
0.71 26.66 0.20 sinl
0.67 26.85 0.19 _ xlfBeginIO
Integrated Performance Monitoring (IPM) : Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes.
Recompile your code, linking to the IPM library by adding -L/usr/local/apps/ipm/lib/ -lipm to the link stage. For example:
C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm
Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm
Run your job using mpirun-ipm
IPM Output: IPM Output For both BlueGene and DataStar, a report will be produced at the end of your output summarizing the data collected. Additionally, a file will be produced with a filename that contains your username and a number generated by IPM (for example mahidhar.1160615104.920400.0)
In order to generate a Web page showing the analysis of your code, run the ipm_parse command followed by the filename.
bg-login1 0512/RUN1andgt; /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0
IPM at SDSC - Webpage creation in progress
Please wait - this may take several minutes.
100..200..300..400..500..
IPM: Data processing finished - Creating HTML output - please wait.
The web page will be visible at:
http://www.sdsc.edu/us/tools/top/ipm/output/bgsn.14860.0
Note the webpage will stay online for 30 days
It can be regenerated at any time,
or a local copy can be saved using your web browser
IPM results: Webpage snapshot: IPM results: Webpage snapshot
Troubleshooting – Common Issues: Troubleshooting – Common Issues Running out of memory
Rogue pointers – Blue Gene applications run in the same address space as the Compute Node Kernel and the communications buffers. You can create a pointer that references the area used for communications (Compute Kernel is protected). This could lead to spurious and unpredictable errors in communications (and may even cause the node to hang)
Forcing MPI to allocate too much memory through excessive buffering of messages
Using unsupported system calls (details on next slide)
Unsupported System Calls w/ CNK: Unsupported System Calls w/ CNK The following calls are not supported by the Compute Node Kernel (CNK)
fork() and pthread_create()
System() function
gethostname() and getlogin()
signal(SIGTRAP,xl__trcce) or signal(SIGNAL,xl__trbk)
usleep()
Core files on the Blue Gene: Core files on the Blue Gene The core files on the Blue Gene are in plain text. A sample is as follows:
bg-login1 /gpfs-wan/mahidharandgt; more core.0
Summary:
program.........................../a.out
ended with software signal.......0x00000005 (SIGTRAP - trace trap)
generated by interrupt...........0x00000006 (program interrupt)
while executing instruction at...0x0020074c
..
..
Memory:
stack top........................0x10000000
stack frame pointer..............0x0fff7ef0
end of heap......................0x00386000
start of program.................0x00200000
brk() failed w/ ENOMEM...........0 time(s)
..
..
Function Call Chain:
0x0020074c
0x002001e4
End of stack
The address can be translated using the addr2line command.
bg-login1 /gpfs-wan/mahidharandgt; addr2line a.out 0x0020074c
??:0
/gpfs-wan/mahidhar/sample.f:38
Standard Tuning Procedure: Standard Tuning Procedure Pick suitable dataset and optimal processor set
Get rough estimate of % of peak FLOPS
5-15% range is normal
Understand scaling problems by running at different processor count
Run using IPM to check
Communication/Computation ratio
Any anomalies, too many messages, too many collectives
Large differences between profiles of different tasks etc.
Understand load imbalances if any
Ex: task 0 is spending too much time in I/O
Task n has very small communication time compared to others etc.
Task Mapping: Task Mapping The default task layout is XYZT. Hence in the VN mode this can lead to an inefficiencies. You will get two tasks per node only if you have #tasks = 2*#nodes. Otherwise the XYZT layout will leave some nodes with just one task.
Set BGLMPI_MAPPING = TXYZ to ensure that you get two tasks per node when you are in the VN mode and are asking for less than 2*#nodes tasks. Using the TXYZ mapping puts tasks 0 and 1 on the first node, tasks 2 and 3 on the next node and so on, with the nodes in x, y, z torus order.
Can use a mapfile to specify the mapping of tasks to nodes.
Mapfile : Mapfile Can be used with the –mapfile option of mpirun
The mapfile contains the information for associating torus coordinates to MPI ranks 0 to N-1
The format of the mapfile is as follows:
x0 y0 z0 t0 x1 y1 z1 t1 x2 y2 z2 t2 ...
where MPI task 0 is mapped to torus coordinates x0,y0,z0 using processor t0 on that node.
The processor number, t0, is always 0 for co-processor mode, and would be either 0 or 1 for virtual node mode. There is one line in the mapping file for each MPI task, in MPI order.
Cartesian communicator functions: Cartesian communicator functions Functions to map nodes to specific hardware or processor set (pset) configurations
PMI_Cart_comm_create()
Creates a four-dimensional communicator that mimics the exact hardware on which it is run. This is a collective operation which runs on all the nodes.
PMI_Pset_same_comm_create()
Creates a set of communicators, where all the nodes in a given communicator are part of the same pset (all share the same I/O node)
Can be used to manage I/O effectively
PMI_Pset_diff_comm_create()
Creates a set of communicators, where no two nodes in the communicator are part of the same pset.
Can be used to manage I/O effectively
References : References Blue Gene Web site at SDSC
http://www.sdsc.edu/us/resources/bluegene
Blue Gene Application development guide (from IBM redbooks)
http://www.redbooks.ibm.com/abstracts/sg247179.html
Exploiting the Dual Floating Point Units in Blue Gene/L, Whitepaper
http://www-1.ibm.com/support/docview.wss?uid=swg27007511andamp;aid=1
Using the XL compilers for Blue Gene
http://www-1.ibm.com/support/docview.wss?uid=swg27007895andamp;aid=1