PAPI

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

PAPI 3.0.8.1 on Blue Gene L: 

PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Presentation overview: 

Presentation overview Project objectives PAPI explanation Blue Gene L explanation Current state of research

Project objectives: 

Project objectives Upgrade PAPI on BG/L Provide interface for network counters Allow Lawrence Livermore National Lab users to also have access to PAPI Using network counters to place tasks optimally on BG/L

PAPI – Intro: 

PAPI – Intro Courtesy of http://icl.cs.utk.edu/papi/

PAPI – Intro: 

PAPI – Intro PAPI useful to profile your own programs. Many tools based on PAPI PapiEx – Command line measurement tool PerfSuite – Aggregate measurement and statistical profiling package and API HPCToolkit – Statistical profiling package Many more!

PAPI – Supported platforms: 

PAPI – Supported platforms IBM – POWER3, 604, 604e, POWER4 Cray T3E, Cray X1 AMD – Athlon, Opteron Intel – P1 to P4, Itanium I and II UltraSparc I, II andamp; III MIPS R10K, R12K, R14K Alpha

PAPI – Generic Interface: 

PAPI – Generic Interface Call sequence for generic interface PAPI_library_init – Initialize memory for PAPI’s data structures PAPI_create_eventset – Create an empty list of events PAPI_add_event – Add events to be counted PAPI_start – Begin counting all events within the specified eventset PAPI_stop – Stop all counters and read their current values

PAPI – Events: Presets: 

PAPI – Events: Presets Presets – list of predefined events implemented on all systems where they can be supported Not all presets available on every architecture (e.g. BG/L has no cache lower than L3 – thus L1 cache hit preset not applicable) Native events form the basic building blocks for PAPI presets

PAPI – Events: Presets: 

PAPI – Events: Presets Courtesy of http://icl.cs.utk.edu/papi/

PAPI – Events: Native: 

PAPI – Events: Native In addition to the predefined PAPI preset events, the PAPI library also exposes a majority of the events native to each platform Can be added to eventsets in the same manner as presets

PAPI – Events: Native: 

PAPI – Events: Native

PAPI – Internals: 

PAPI – Internals Array of eventsets is the main portion

PAPI – Other features: 

PAPI – Other features Multiplexing – If there are not enough hardware counters Thread safe – Profiling is thread safe Overflow detection – Hardware counters have limited space

PAPI – PAPI2 vs PAPI3: 

PAPI – PAPI2 vs PAPI3 PAPI 3 significantly reduced overheads for starting, stopping and reading the counters Courtesy of http://icl.cs.utk.edu/papi/

PAPI – PAPI2 vs PAPI3: 

PAPI – PAPI2 vs PAPI3 Better native event support in PAPI3 Better thread support in PAPI3 Overflow and Profiling enhancements in PAPI3 Myriad bug fixes and code cleanup in PAPI3

PAPI – PAPI2 vs PAPI3: 

PAPI – PAPI2 vs PAPI3 Overlapping eventsets supported in PAPI2 Minor changes in the API – mostly dereferencing variables

Blue Gene L – Intro: 

Blue Gene L – Intro 65,536 nodes connected in 64 x 32 x 32 3D torus Nodes made up of PowerPC 440 embedded processors Smaller than most super computers Consumes less power

Blue Gene L: 

Blue Gene L

Blue Gene L - Networks: 

Blue Gene L - Networks 3D torus network (node to node) Tree network (broadcasts)

Blue Gene L – HW counters: 

Blue Gene L – HW counters 48 universal performance counters 4 floating point unit counters Counters 32 bit – must use virtual counters to prevent overflow

Blue Gene L – HW counters: 

Blue Gene L – HW counters

Research – Overall goals: 

Research – Overall goals Network hardware counters new Use network counters to determine traffic between tasks Try to optimize placement of tasks to minimize communication latency Given counts and distances: cost = counts * distance. Minimize over all nodes

Research – Counting: 

Research – Counting First goal to determine what is being counted

Research – Networks: 

Research – Networks For each MPI call – determine which network counters are being used Tree is supposed to be for broadcasts Torus is supposed to be for point to point communication Ambiguities in the specification

Research – Future decisions: 

Research – Future decisions How to profile a target application Manually insert PAPI instrumentation: a lot of work Instrument binaries with counting code What information to store All counts on each node: a lot of data Sample of all nodes: not as accurate (what if the tasks behave / communicate differently?

Research – Future decisions: 

Research – Future decisions How to use collected information Profile an application to obtain counter feedback to determine optimized static task layout Dynamically migrate tasks in response to counters