Tim

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

AMD Opteron Architecture and Software Infrastructure Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com

Slide2: 

Agenda Architecture – Opteron, Itanium, XeonEMT ACML – AMD’s superior equivalent of MKL Compilers – PGI performance enhancements Performance – delivering on the promise Summary and Closing Points

Architecture Agenda Opteron = Execution + Memory Access + IO : 

Architecture Agenda Opteron = Execution + Memory Access + IO Customer Centric 64-bit Computing Instruction Decoding Processors with Artificial Intelligence Not all X86 Processors are created = RISC Cores – scrupulous instruction preference Scalable Memory Bandwidth and IO physical memory scales with CPU # memory bandwidth scales with CPU # increased single threaded memory bandwidth memory latency does not scale with CPU # dramatically lower memory latency

Customer Centric 64-bit Computing Opteron vs Itanium : 

Customer Centric 64-bit Computing Opteron vs Itanium Progressive 64-bit approach: 32-bit instruction + prefix byte leverages x86 compiler technology – reliable compilers, port easily code size increase is minimal (~5%) – large caches not required x86 CPUs = RISC cores + CISCRISC instruction decoders provides x86 processors high clock frequency and legacy compatibility processor not compiler manages RISC core - recompile rarely Itanium is a slave to the compiler - recompile often out-of-order execution and register renaming Opteron manages it’s registers intelligently – less compiler reliant Itanium requires the compiler to think for it – strong compiler reliance Both Opteron and Itanium are RISC, but Opteron doesn’t require reinventing compilers, large caches & a mint to purchace

All X86 RISC Cores aren’t created = Opteron vs Xeon EMT : 

All X86 RISC Cores aren’t created = Opteron vs Xeon EMT Opteron: INT and FP Execution Units Xeon EMT: FP Execution Units 80-bits 128-bits 80-bit x 3 = 240-bit bandwidth 128-bits x 1 = 128-bit bandwidth constriction limits performance 80-bit x 2 = 160 bits 12 pipeline stages 31 pipeline stages # of int pipes and pipeline depth impact integer throughput Opteron has 3 integer pipes – +50% reg,reg move thoughput Opteron has 3 ALU/AGU units – +50% +,-,logical, shift throughput # pipeline stages differs – shorter instruction execution latency Different Register File Sizes (Opteron 80-bit, Xeon 128-bit) size dictates # RISC ops an x86 instruction decodes into instruction selection preference is different for Opteron and Xeon64 Design of FPU and issue bandwidth from FP scheduler Opteron has 240 bits per clock SIMD throughput, Xeon has 128 bits Coupled with register file size, Opteron is a more robust engine Though Xeon64 and Opteron are instruction compatible, Opteron doesn’t require extensive compiler tuning to perform well

AMD OpteronTM,Pentium®4 (FPU analysis) Throughput of SSE, SSE2, x87 Operations: 

AMD OpteronTM,Pentium®4 (FPU analysis) Throughput of SSE, SSE2, x87 Operations

AMD OpteronTM,Pentium®4 (ALU Analysis) Throughput and Latency Comparison: 

AMD OpteronTM,Pentium®4 (ALU Analysis) Throughput and Latency Comparison

Scalable Memory Bandwidth and IO Opteron’s on die IO controller: 

Scalable Memory Bandwidth and IO Opteron’s on die IO controller 4 Separate IO Channels per CPU – Scalable SMP Bandwidth HyptertransportTM Interconnect – low SMP memory latency Commodity/High Performance SMP Solution presently dual core ready – SRQ controller has port for 2nd core fewer # of chips required for MP chipsets – lowering cost of SMP systems

Slide9: 

ACML 2.1 Agenda Features BLAS, LAPACK, FFT Performance Open MP Performance ACML 2.5 Snap Shot – Soon to be released

Slide10: 

Work carried out in collaboration with the Numerical Algorithms Group (NAG). Below is a list of contributors for each facet of ACML. NAG Project Manager: Mick Pont AMD Compiler/OS Support: Chip Freitag BLAS/LAPACK Arch/Optimization: Ed Smyth Tim Wilkens FFT Arch/Optimization: Lawrence Mulholland Tim Wilkens Themos Tsikas Acknowledgments

Components of ACML BLAS, LAPACK, FFTs : 

Components of ACML BLAS, LAPACK, FFTs Linear Algebra (LA) Basic Linear Algebra Subroutines (BLAS) Level 1 (vector-vector operations) Level 2 (matrix-vector operations) Level 3 (matrix-matrix operations) Routines involving sparse vectors Linear Algebra PACKage (LAPACK) leverage BLAS to perform complex operations 28 Threaded LAPACK routines Fast Fourier Transforms (FFTs) 1D, 2D, single, double, r-r, r-c, c-r, c-c support C and Fortran interfaces

Driving Enterprise Business Market Segment Use of HPC Math Libraries: 

Driving Enterprise Business Market Segment Use of HPC Math Libraries

Connecting HPC and you How HPC impacts our daily lives : 

Gaming – Real World Realism water surfaces, physics gaming engines Rendered Movies modeling real clothing surfaces (PDEs) Medical Procedures CAT scan imaging, Cancer Radiation Therapy Airline Flight Schedules minimizing equations of constraint (fuel, food, time, etc) National Security voice analysis and authentication, weapons simulation Connecting HPC and you How HPC impacts our daily lives

Slide14: 

AMD Core Math Library (ACML) Assembly Optimizations and Accuracy 3 ACML 32-bit binaries supports AMD processors with/without SSE & SSE2 1 ACML 64-bit binary supports AMD Athlon 64 and AMD Opteron

64-bit BLAS Performance DGEMM (Double Precision General Matrix Multiply): 

64-bit BLAS Performance DGEMM (Double Precision General Matrix Multiply)

64-bit BLAS Performance DSYMM (Double Precision Symmetric Matrix Multiply): 

64-bit BLAS Performance DSYMM (Double Precision Symmetric Matrix Multiply)

64-bit LAPACK Performance DGETRF (Double Precision LU Factorization): 

64-bit LAPACK Performance DGETRF (Double Precision LU Factorization)

64-bit LAPACK Performance DGETRS (Double Precision LU Solve): 

64-bit LAPACK Performance DGETRS (Double Precision LU Solve)

64-bit LAPACK Performance DPOTRF (Double Precision Cholesky Factorization): 

64-bit LAPACK Performance DPOTRF (Double Precision Cholesky Factorization)

64-bit LAPACK Performance DGEQRF (Double Precision QR Factorization): 

64-bit LAPACK Performance DGEQRF (Double Precision QR Factorization)

64-bit FFT Performance (non-power of 2) MKL vs ACML on 2.2 Ghz Opteron: 

64-bit FFT Performance (non-power of 2) MKL vs ACML on 2.2 Ghz Opteron

64-bit FFT Performance (non-power of 2) 2.2 Ghz Opteron vs 3.2 Ghz XeonEMT: 

64-bit FFT Performance (non-power of 2) 2.2 Ghz Opteron vs 3.2 Ghz XeonEMT

Multithreaded LAPACK Performance Double Precsion (LU, Cholesky, QR Factorize/Solve): 

Multithreaded LAPACK Performance Double Precsion (LU, Cholesky, QR Factorize/Solve)

Slide24: 

Conclusion and Closing Points How good is our performance? Averaging over 70 BLAS/LAPACK/FFT routines Computation weighted average All measurements performed on an 4P AMD OpteronTM 844 Quartet Server ACML 32-bit is 55% faster than MKL 6.1 November 14, 2007 Computation Products Group 24 ACML 64-bit is 80% faster than MKL 6.1

64-ACML 2.5 Snapshot Small Dgemm Enhancements: 

64-ACML 2.5 Snapshot Small Dgemm Enhancements

Slide26: 

Compiler Ecosystem PGI , Pathscale , GNU , Absoft Intel, Microsoft and SUN

Compiler Comparisons Table Critical Features Supported by x86 Compilers : 

Compiler Comparisons Table Critical Features Supported by x86 Compilers                                                                    

Tuning Performance with Compilers Maintaining Stability while Optimizing: 

Tuning Performance with Compilers Maintaining Stability while Optimizing STEP 0: Build application using the following procedure: compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse if problems persist compile at Optimization level 1: -tp k8-64 –O0 STEP 1: Profile binary and determine performance critical routines STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability

Tuning Memory IO Bandwidth Optimizing large streaming operations: 

Tuning Memory IO Bandwidth Optimizing large streaming operations 2 Methods of writing to memory in x86/x86-64: traditional memory stores cause write allocates to cache Mov %rax,[%rdi] movsd %xmm0,[%rdi] movapd %xmm0,[%rdi] page to be modified is read into cache cache is modified, written to memory when new memory page loaded to write N bytes, 2N bytes of bandwidth generated non-temporal stores bypass cache and write directly to memory no write allocate to cache, to write N bytes, N bytes of bandwidth generated data is not backed up into cache, do not use with often reused data Use only on functions which write L2/2 > bytes of data or more, normally would assure little cache reuse value Group all eligible routines into a common file to as to simplify the compilation procedure. Enable non-temporal stores in PGI compiler with the –Mnontemporal compiler option

PGI Compiler Flags Optimization Flags: 

PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -tp k8-64 –fastsse –Mipa=fast enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code Middle of the ground: -tp k8-64 –fast –Mscalarsse enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -tp k8-64 –O0 (or –O1)

PGI Compiler Flags Functionality Flags: 

PGI Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -Mlarge_arrays use if any array in your application is greater than 2GB -KPIC use when linking to shared object (dynamically linked) libraries -mp process OpenMP/SGI directives/pragmas (build multi-threaded code) -Mconcur attempt auto-parallelization of your code on SMP system with OpenMP

Absoft Compiler Flags Optimization Flags: 

Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -O3 loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code Middle of the ground: -O2 enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -O1

Absoft Compiler Flags Functionality Flags: 

Absoft Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -g77 enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries) -fpic use when linking to shared object (dynamically linked) libraries -safefp performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs

Pathscale Compiler Flags Optimization Flags: 

Pathscale Compiler Flags Optimization Flags Most aggressive: -Ofast Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno Aggressive : -O3 optimizations for highest quality code enabled at cost of compile time Some generally beneficial optimization included may hurt performance Reasonable: -O2 Extensive conservative optimizations Optimizations almost always beneficial Faster compile time Avoids changes which affect floating point accuracy.

Pathscale Compiler Flags Functionality Flags: 

Pathscale Compiler Flags Functionality Flags -mcmodel=medium use if static data structures are greater than 2GB -ffortran-bounds-check (fortran) check array bounds -shared generate position independent code for calling shared object libraries Feedback Directed Optimization STEP 0: Compile binary with -fb_create_fbdata STEP 1: Run code collect data STEP 2: Recompile binary with -fb_opt fbdata -march=(opteron|athlon64|athlon64fx) Optimize code for selected platform (Opteron is default)

Microsoft Compiler Flags Optimization Flags: 

Microsoft Compiler Flags Optimization Flags Recommended Flags : /O2 /Ob2 /GL /fp:fast /O2 turns on several general optimization & /O2 enable inline expansion /GL enables inter-procedural optimizations /fp:fast allows the compiler to use a fast floating point model Feedback Directed Optimization STEP 0: Compile binary with /LTCG:PGI STEP 1: Run code collect data STEP 2: Recompile binary with /LTCG:PGO Turn off Buffer Over Run Checking The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance

Microsoft Compiler Flags Functionality Flags: 

Microsoft Compiler Flags Functionality Flags /GT enables run-time information /Wp64 supports fiber safety for data allocated using static thread-local storage /LD detects most 64-bit portability problems /Oa creates a dynamic-link library /Ow assumes aliasing across function calls but not inside functions

64-Bit Operating Systems Recommendations and Status: 

64-Bit Operating Systems Recommendations and Status SUSE SLES 9 with latest Service Pack available Has technology for supporting latest AMD processor features Widest breadth of NUMA support and enabled by default Oprofile system profiler installable as an RPM and modularized complete support for static & dynamically linked 32-bit binaries Red Hat Enterprise Server 3.0 Service Pack 2 or later NUMA features support not as complete as that of SUSE SLES 9 Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactory only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit) Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications

Slide39: 

AMD64 ISV Performance Fluent, LS-DYNA, STAR-CD, Ansys

64-bit Fluent v6.1.25 Serial Performance Comparison: 

64-bit Fluent v6.1.25 Serial Performance Comparison

64-bit Fluent v6.1.25 Serial Performance Comparison: 

64-bit Fluent v6.1.25 Serial Performance Comparison

64-bit Fluent v6.1.25 2P Performance Comparison: 

64-bit Fluent v6.1.25 2P Performance Comparison

64-bit Fluent v6.1.25 4P Performance Comparison: 

64-bit Fluent v6.1.25 4P Performance Comparison

64-bit Fluent v6.1.25 8P Performance Comparison: 

64-bit Fluent v6.1.25 8P Performance Comparison

64-bit LS-DYNA v5434 Neon Model Performance: 

64-bit LS-DYNA v5434 Neon Model Performance

64-bit LS-DYNA v5434 Neon Model Performance: 

64-bit LS-DYNA v5434 Neon Model Performance

64-bit LS-DYNA v5434 3-Car Model Performance: 

64-bit LS-DYNA v5434 3-Car Model Performance

64-bit LS-DYNA v5434 3-Car Model Performance: 

64-bit LS-DYNA v5434 3-Car Model Performance

Slide49: 

AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies. Trademark Attribution