SDSC BG Optimization Debug

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

SDSC Blue Gene: Optimization and DebuggingMahidhar TatineniSDSC, April 6, 2007: 

SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007

Overview of Talk: 

Overview of Talk Blue Gene system overview – processor, networks Compiler optimizations Second FPU – restrictions, optimizations and limitations Virtual Node (VN) andamp; Communication Coprocessor (CO) mode Profiling Integrated Performance Monitoring (IPM) Troubleshooting – common issues Blue Gene core files – using addr2line Standard tuning procedure Task mapping

BG System Overview: Processor Chip: 

BG System Overview: Processor Chip

Blue Gene compute nodes : 

Blue Gene compute nodes 700 MHz powerpc processor 1 integer unit (FXU), 1 load/store unit, 2 FPUs L1: 32kB, 32byte line, 64way L2: 16, 128-byte lines acts as prefetch buffer L3: 4MB, 35 cycles, shared Main limitations 512 MB memory per node MPI only, no OpenMP or pthreads Limited system calls w/ compute node kernel Executables must be statically linked (no shared libraries)

Blue Gene: Networks: 

Blue Gene: Networks Three dimensional (3-D) Torus Interconnects all compute nodes 1.4 Gb/s on all 6 bidirectional node links (2.1 GB/s per node) Global Tree Collectives functionality 2.8 GB/s of bandwidth per link Latency of tree traversal in the order of 5µs Interconnects all compute and I/O nodes Gigabit Ethernet Low Latency Global Barrier and Interrupt Control Network

Using the compilers: Options: 

Using the compilers: Options Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 tunes for the 440 processor -O3 gives minimal optimization with no SIMDization -O3 –qarch=440d adds backend SIMDization -O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (TPO SIMDization default with –O4 and –O5) Current recommendation: Start with -O3 –qarch=440d –qtune=440 Try –O4, -O5 next

Practical flags: 

Practical flags When linking mass libraries -Wl,--allow-multiple-definition When taking too long to compile Try compile on bg-login4 (alternate login node) Try –qnoipa option When compiling .f90 files –qsuffix=f=f90 To obtain a detailed compilation report -qdebug=diagnostic –qlist –qsource –qreport=hotlist With XL compilers, you can combine opt flags with -g

Second FPU: 

Second FPU To generate code to take advantage of the second FPU 16-byte alignment is required and may need alignment assertions. Easiest approach to take advantage of the second FPU is to use optimized math library routines (like MASS, ESSL) The XL compiler has two different components that can generate SIMD code The back-end optimizer with –O3 –qarch=440d The TPO front-end, with –qhot or –O4, -O5 In many applications loads and stores are the bottleneck and one can saturate the bandwidth to L3 or memory =andgt; double FPU instructions can help for data in L1 but not for data in L3 or memory.

Second FPU – Usage example: 

Second FPU – Usage example An example using alignment assertion FORTRAN: call alignx(16,x(1)) call alignx(16,y(1)) do i = 1, n y(i) = a*x(i)+y(i) enddo C: double *x, *y; __alignx(16,x); __alignx(16,y); for (i=0; iandlt;n; i++) y[i]=a*x[i]+y[i];

Math libraries from IBM: 

Math libraries from IBM Engineering andamp; Scientific Subroutine Libraries (ESSL) Mathematics Accelerated Scientific Subroutines (MASS) Mathematics Accelerated Scientific Subroutines Vectorized (MASSV) ESSL, MASS, MASSV tuned specifically for Blue Gene and will help significantly in improving performance

Modes for running jobs – VN & CO: 

Modes for running jobs – VN andamp; CO The default mode is the communication coprocessor (CO) mode. One of the processors on the node is the main processor running the compute processor. Second processor behaves as an offload engine (mainly for communication functions). In the Virtual Node (VN) mode both processors are used for the compute processes. In this mode the node resources (primarily the memory and torus network) are shared by both processes. Hence, in the VN mode users will have half the memory/node as compared to the CO mode. I/O intensive tasks which require large amount of data interchange between compute nodes benefit by using the CO mode. Applications which are primarily CPU bound and do not have a large per node memory requirement benefit from the VN mode.

Profiling your code on the Blue Gene: 

Profiling your code on the Blue Gene Standard profiling (prof, gprof) is available on the Blue Gene. Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands Timer tick profiling information: Add –pg to the link options Procedure level profiling with timer tick info: Add –pg to compile and link options Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options Each task generates a gmon.out.x file where x corresponds to the rank of the task. Output can be read using the gprof command.

Example of profiling using gmon: 

Example of profiling using gmon /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gprof --sumbg poisson gprof poisson gmon.sum andgt; test.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 40.59 11.45 11.45 8 1.43 2.55 poisson 31.58 20.36 8.91 204800 0.00 0.00 solve 11.59 23.63 3.27 cvtloop 3.47 24.61 0.98 BGLML_Messager_VMadvance 1.95 25.16 0.55 __ cvt_r 1.28 25.52 0.36 BGLML_Messager_tree_advance 1.28 25.88 0.36 memcpy 1.10 26.19 0.31 WriteUnit 0.96 26.46 0.27 write 0.71 26.66 0.20 sinl 0.67 26.85 0.19 _ xlfBeginIO

Integrated Performance Monitoring (IPM) : 

Integrated Performance Monitoring (IPM) Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. Recompile your code, linking to the IPM library by adding -L/usr/local/apps/ipm/lib/ -lipm to the link stage. For example: C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm Run your job using mpirun-ipm

IPM Output: 

IPM Output For both BlueGene and DataStar, a report will be produced at the end of your output summarizing the data collected. Additionally, a file will be produced with a filename that contains your username and a number generated by IPM (for example mahidhar.1160615104.920400.0) In order to generate a Web page showing the analysis of your code, run the ipm_parse command followed by the filename. bg-login1 0512/RUN1andgt; /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0 IPM at SDSC - Webpage creation in progress Please wait - this may take several minutes. 100..200..300..400..500.. IPM: Data processing finished - Creating HTML output - please wait. The web page will be visible at: http://www.sdsc.edu/us/tools/top/ipm/output/bgsn.14860.0 Note the webpage will stay online for 30 days It can be regenerated at any time, or a local copy can be saved using your web browser

IPM results: Webpage snapshot: 

IPM results: Webpage snapshot

Troubleshooting – Common Issues: 

Troubleshooting – Common Issues Running out of memory Rogue pointers – Blue Gene applications run in the same address space as the Compute Node Kernel and the communications buffers. You can create a pointer that references the area used for communications (Compute Kernel is protected). This could lead to spurious and unpredictable errors in communications (and may even cause the node to hang) Forcing MPI to allocate too much memory through excessive buffering of messages Using unsupported system calls (details on next slide)

Unsupported System Calls w/ CNK: 

Unsupported System Calls w/ CNK The following calls are not supported by the Compute Node Kernel (CNK) fork() and pthread_create() System() function gethostname() and getlogin() signal(SIGTRAP,xl__trcce) or signal(SIGNAL,xl__trbk) usleep()

Core files on the Blue Gene: 

Core files on the Blue Gene The core files on the Blue Gene are in plain text. A sample is as follows: bg-login1 /gpfs-wan/mahidharandgt; more core.0 Summary: program.........................../a.out ended with software signal.......0x00000005 (SIGTRAP - trace trap) generated by interrupt...........0x00000006 (program interrupt) while executing instruction at...0x0020074c .. .. Memory: stack top........................0x10000000 stack frame pointer..............0x0fff7ef0 end of heap......................0x00386000 start of program.................0x00200000 brk() failed w/ ENOMEM...........0 time(s) .. .. Function Call Chain: 0x0020074c 0x002001e4 End of stack The address can be translated using the addr2line command. bg-login1 /gpfs-wan/mahidharandgt; addr2line a.out 0x0020074c ??:0 /gpfs-wan/mahidhar/sample.f:38

Standard Tuning Procedure: 

Standard Tuning Procedure Pick suitable dataset and optimal processor set Get rough estimate of % of peak FLOPS 5-15% range is normal Understand scaling problems by running at different processor count Run using IPM to check Communication/Computation ratio Any anomalies, too many messages, too many collectives Large differences between profiles of different tasks etc. Understand load imbalances if any Ex: task 0 is spending too much time in I/O Task n has very small communication time compared to others etc.

Task Mapping: 

Task Mapping The default task layout is XYZT. Hence in the VN mode this can lead to an inefficiencies. You will get two tasks per node only if you have #tasks = 2*#nodes. Otherwise the XYZT layout will leave some nodes with just one task. Set BGLMPI_MAPPING = TXYZ to ensure that you get two tasks per node when you are in the VN mode and are asking for less than 2*#nodes tasks. Using the TXYZ mapping puts tasks 0 and 1 on the first node, tasks 2 and 3 on the next node and so on, with the nodes in x, y, z torus order. Can use a mapfile to specify the mapping of tasks to nodes.

Mapfile : 

Mapfile Can be used with the –mapfile option of mpirun The mapfile contains the information for associating torus coordinates to MPI ranks 0 to N-1 The format of the mapfile is as follows: x0 y0 z0 t0 x1 y1 z1 t1 x2 y2 z2 t2 ... where MPI task 0 is mapped to torus coordinates x0,y0,z0 using processor t0 on that node. The processor number, t0, is always 0 for co-processor mode, and would be either 0 or 1 for virtual node mode. There is one line in the mapping file for each MPI task, in MPI order.

Cartesian communicator functions: 

Cartesian communicator functions Functions to map nodes to specific hardware or processor set (pset) configurations PMI_Cart_comm_create() Creates a four-dimensional communicator that mimics the exact hardware on which it is run. This is a collective operation which runs on all the nodes. PMI_Pset_same_comm_create() Creates a set of communicators, where all the nodes in a given communicator are part of the same pset (all share the same I/O node) Can be used to manage I/O effectively PMI_Pset_diff_comm_create() Creates a set of communicators, where no two nodes in the communicator are part of the same pset. Can be used to manage I/O effectively

References : 

References Blue Gene Web site at SDSC http://www.sdsc.edu/us/resources/bluegene Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html Exploiting the Dual Floating Point Units in Blue Gene/L, Whitepaper http://www-1.ibm.com/support/docview.wss?uid=swg27007511andamp;aid=1 Using the XL compilers for Blue Gene http://www-1.ibm.com/support/docview.wss?uid=swg27007895andamp;aid=1

authorStream Live Help