Hwu Wen mei Faculty Summit 071707

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

An IMplicitly PArallel Compiler Technology Based on Phoenix: 

For thousand-core microprocessors Wen-mei Hwu with Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi, Kidd, Barghsorkhi, Mahesri, Tsao, Stratton, Navarro, Lumetta, Frank, Patel University of Illinois, Urbana-Champaign An IMplicitly PArallel Compiler Technology Based on Phoenix

Background: 

Background Academic compiler research infrastructure is a tough business IMPACT, Trimaran, and ORC for VLIW and Itanium processors Polaris and SUIF for multiprocessors LLVM for portability and safety In 2001, IMPACT team moved into many-core compilation with MARCO FCRC funding A new implicitly parallel programming model that balance the burden on programmers and the compiler in parallel programming Infrastructure work has slowed down ground-breaking work Timely visit by the Phoenix team in January 2007 Rapid progress has since been taking place Future IMPACT research will be built on Phoenix

The Next Software Challenge: 

The Next Software Challenge Today, multi-core make more effective use of area and power than large ILP CPU’s Scaling from 4-core to 1000-core chips could happen in the next 15 years All semiconductor market domains converging to concurrent system platforms PCs, game consoles, mobile handsets, servers, supercomputers, networking, etc. Big picture We need to make these systems effectively execute valuable, demanding apps.

The Compiler Challenge: 

The Compiler Challenge To meet this challenge, the compiler must Allow simple, effective control by programmers Discover and verify parallelism Eliminate tedious efforts in performance tuning Reduce testing and support cost of parallel programs 'Compilers and tools must extend the human’s ability to manage parallelism by doing the heavy lifting.'

An Initial Experimental Platform: 

A quiet revolution and potential build-up Calculation: 450 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact An Initial Experimental Platform

GeForce 8800: 

16 highly threaded SM’s, andgt;128 FPU’s, 450 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU GeForce 8800

Some Hand-code Results: 

Some Hand-code Results [HKR HotChips-2007]

Computing Q: Performance: 

Computing Q: Performance GPU (V8): 96 GFLOPS 446x CPU (V6): 230 MFLOPS

Lessons Learned: 

Lessons Learned Parallelism extraction requires global understanding Most programmers only understand parts of an application Algorithms need to be re-designed Programmers benefit from clear view of the algorithmic effect on parallelism Real but rare dependencies often needs to be ignored Error checking code, etc., parallel code is often not equivalent to sequential code Getting more than a small speedup over sequential code is very tricky ~20 versions typically experimented for each application to move away from architecture bottlenecks

Implicitly Parallel Programming Flow: 

Stylized C/C++ or DSL w/ assertions Concurrency discovery Visualizable concurrent form Human Code-gen space exploration Visualizable sequential assembly code with parallel annotations Parallel HW w/sequential state gen Deep analysis w/ feedback assistance Systematic search for best/correct code gen Debugger parallel execution w/ sequential semantics Implicitly Parallel Programming Flow For increased composability For increased scalability For increased supportability

Key Ideas: 

Key Ideas Deep program analyses that extend programmer and DSE knowledge for parallelism discovery Key to reduced programmer parallelization efforts Exclusion of infrequent but real dependences using HW STU (Speculative Threading with Undo) support Key to successful parallelization of many real applications Rich program information maintained in IR for access by tools and HW Key to integrate multiple programming models and tools Intuitive, visual presentation to programmers Key to good programmer understanding of algorithm effects Managed parallel execution arrangement search space Key to reduced programmer performance tuning efforts

Parallelism in Algorithms(H.263 motion estimation example): 

Parallelism in Algorithms (H.263 motion estimation example)

MPEG-4 H.263 EncoderParallelism Redicovery: 

MPEG-4 H.263 Encoder Parallelism Redicovery (a) (b) (c) (d) (e)

Code Gen Space Exploration: 

Code Gen Space Exploration

Moving an Accurate Interprocedural Analysis into Phoenix: 

Unification Based Fulcra Moving an Accurate Interprocedural Analysis into Phoenix

Getting Started with Phoenix: 

Getting Started with Phoenix Meetings with Phoenix team in January 2007 Determined the set of Phoenix API routines necessary to support IMPACT analyses and transformations Received custom build of Phoenix that supports full type information

Fulcra to Phoenix – Action!: 

Fulcra to Phoenix – Action!

Phoenix Support Wish List: 

April 16, 2007 Phoenix Support Wish List Access to code across file boundaries LTCG Access to multiple files within a pass Full (Source code level) type information Feed results from Fulcra back to Phoenix Need more information on Phoenix alias representation In the long run, we need highly extendable IR and API for Phoenix

Conclusion: 

Conclusion Compiler research for many-cores will require a very high quality infrastructure with strong engineering support New language extensions, new user models, new functionalities, new analyses, new transformations We chose Phoenix based on its robustness, features and engineering support Our current industry partners are also moving into Phoenix We also plan to share our advanced extensions to the other academic Phoenix users