logging in or signing up puschel spiral Pasquale Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 187 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 16, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript SPIRAL: Current Status: SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)Sponsor: Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.Moore’s Law and High(est) Performance Scientific Computing: Moore’s Law and High(est) Performance Scientific ComputingSPIRAL: SPIRAL Automates Implementation Platform-Adaptation Optimization of DSP algorithmsSPIRAL system: SPIRAL systemSlide6: DSP Transform Algorithm DSP Algorithms: Example 4-point DFT: DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): product of structured sparse matrices mathematical notation Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker productDSP Algorithms: Terminology: DSP Algorithms: Terminology Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into codeDSP Transforms: DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transformRules = Breakdown Strategies: Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive translation iterative/ recursive Formula for a DCT, size 16: Formula for a DCT, size 16Slide12: Algorithm (Formula) Implementation DSP TransformFormulas in SPL: Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • • • • • •Slide14: SPL Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrollingSPL Compiler, 4-point FFT: SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) fast algorithm as formula as SPL program #codetype complex realSPL Compiler: Summary: SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code I-Code C, FORTRAN function Template Table SPL Formula Template Definition Symbol Definition Optimization I-Code SPL Program Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templatesSlide17: Algorithm (Formula) Implementation DSP Transform SearchWhy Search?: Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost best formula is platform-dependent ~31000 formulasSearch Methods available in SPIRAL: Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Search over algorithm space and implementation options (degree of unrolling)STEER: STEER Population n: Population n+1: …… …… Mutation Cross-Breeding expand differently swap expansions Survival of FittestExperimental Results (C code): Experimental Results (C code) high performance code (compared with FFTW) different transforms search methods (applicable to all transforms) generated high quality codeSPIRAL System: SPIRAL System Available for download (v3.1): www.ece.cmu.edu/~spiral Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2000/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible New version (4.0) in preparationSlide23: Recent WorkLearning to Generate Fast Algorithms: Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHTSIMD Short Vector Extensions: SIMD Short Vector Extensions + x vector length = 4 (4-way) Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited Problems: very difficult to use Vector code generation from SPL formulas: Vector code generation from SPL formulasGenerated Vector Code DFTs: Pentium 4: Generated Vector Code DFTs: Pentium 4 gflops DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library Generated Vector Code, Other Transforms: Generated Vector Code, Other Transforms normalized runtime normalized runtime transform size transform size 2-dim DCT WHT speedups up to factor of 2.5 Slide29: Flexible Loop Interleaving (Runtime Gain WHT) Alpha 21264: up to 70% UltraSparc III: up to 60% Athlon XP: up to 55% Pentium 4: up to 45%Parallel Code Generation: Example WHT: Parallel Code Generation: Example WHT WHT size log(N) 1 thread 8 threads 10 threads PowerPC RS64 III Parallelized constructs: In A, A In Code Scheduling: Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduledFilters and Wavelets: Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules:Conclusions: Conclusions Automatic code generation for the entire domain of (linear) DSP algorithms Portable high performance across platforms and across time Integration of math (high) level and implementation (low) level Intelligence through search and learning in the space of alternatives Future Plans: Future Plans Transforms: Radon, Gabor, Hankel, structured matrices, … Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs Instructions: Vector, FMAs, prefetching, OpenMP, MPI Beyond transforms: entire DSP applications Other domains amenable to SPIRAL approach?Slide35: http://www.ece.cmu.edu/~spiral Questions? You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
puschel spiral Pasquale Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 187 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 16, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript SPIRAL: Current Status: SPIRAL: Current Status José Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) David Padua (UIUC) Viktor Prasanna (USC) Markus Püschel (CMU) Manuela Veloso (CMU) Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC) Faculty Students http://www.ece.cmu.edu/~spiral Markus Püschel Collaborators Christoph Überhuber (TU Vienna) Franz Franchetti (TU Vienna)Sponsor: Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.Moore’s Law and High(est) Performance Scientific Computing: Moore’s Law and High(est) Performance Scientific ComputingSPIRAL: SPIRAL Automates Implementation Platform-Adaptation Optimization of DSP algorithmsSPIRAL system: SPIRAL systemSlide6: DSP Transform Algorithm DSP Algorithms: Example 4-point DFT: DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): product of structured sparse matrices mathematical notation Fourier transform Identity Permutation Diagonal matrix (twiddles) Kronecker productDSP Algorithms: Terminology: DSP Algorithms: Terminology Transform Rule Formula parameterized matrix a breakdown strategy product of sparse matrices recursive application of rules uniquely defines an algorithm efficient representation easy manipulation Ruletree few constructs and primitives uniquely defines an algorithm can be translated into codeDSP Transforms: DSP Transforms Others: filters, discrete wavelet transforms, Haar, Hartley, … discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transformRules = Breakdown Strategies: Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive translation iterative/ recursive Formula for a DCT, size 16: Formula for a DCT, size 16Slide12: Algorithm (Formula) Implementation DSP TransformFormulas in SPL: Formulas in SPL ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • • • • • •Slide14: SPL Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrollingSPL Compiler, 4-point FFT: SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) fast algorithm as formula as SPL program #codetype complex realSPL Compiler: Summary: SPL Compiler: Summary Parsing Intermediate Code Generation Intermediate Code Restructuring Target Code Generation Symbol Table Abstract Syntax Tree I-Code I-Code C, FORTRAN function Template Table SPL Formula Template Definition Symbol Definition Optimization I-Code SPL Program Built-in optimizations: single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE Extensible through templatesSlide17: Algorithm (Formula) Implementation DSP Transform SearchWhy Search?: Why Search? DCT, type IV, size 16 maaaany different formulas large spread in runtimes, even for modest size not due to arithmetic cost best formula is platform-dependent ~31000 formulasSearch Methods available in SPIRAL: Search Methods available in SPIRAL Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm) Search over algorithm space and implementation options (degree of unrolling)STEER: STEER Population n: Population n+1: …… …… Mutation Cross-Breeding expand differently swap expansions Survival of FittestExperimental Results (C code): Experimental Results (C code) high performance code (compared with FFTW) different transforms search methods (applicable to all transforms) generated high quality codeSPIRAL System: SPIRAL System Available for download (v3.1): www.ece.cmu.edu/~spiral Easy installation (Unix: configure/make; Windows: install shield) Unix/Linux and Windows 98/ME/NT/2000/XP Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants Extensible New version (4.0) in preparationSlide23: Recent WorkLearning to Generate Fast Algorithms: Learning to Generate Fast Algorithms Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) Learns from a transform of one size, generates the best algorithm for many sizes Tested for DFT and WHTSIMD Short Vector Extensions: SIMD Short Vector Extensions + x vector length = 4 (4-way) Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited Problems: very difficult to use Vector code generation from SPL formulas: Vector code generation from SPL formulasGenerated Vector Code DFTs: Pentium 4: Generated Vector Code DFTs: Pentium 4 gflops DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 n speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library Generated Vector Code, Other Transforms: Generated Vector Code, Other Transforms normalized runtime normalized runtime transform size transform size 2-dim DCT WHT speedups up to factor of 2.5 Slide29: Flexible Loop Interleaving (Runtime Gain WHT) Alpha 21264: up to 70% UltraSparc III: up to 60% Athlon XP: up to 55% Pentium 4: up to 45%Parallel Code Generation: Example WHT: Parallel Code Generation: Example WHT WHT size log(N) 1 thread 8 threads 10 threads PowerPC RS64 III Parallelized constructs: In A, A In Code Scheduling: Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduledFilters and Wavelets: Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules:Conclusions: Conclusions Automatic code generation for the entire domain of (linear) DSP algorithms Portable high performance across platforms and across time Integration of math (high) level and implementation (low) level Intelligence through search and learning in the space of alternatives Future Plans: Future Plans Transforms: Radon, Gabor, Hankel, structured matrices, … Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs Instructions: Vector, FMAs, prefetching, OpenMP, MPI Beyond transforms: entire DSP applications Other domains amenable to SPIRAL approach?Slide35: http://www.ece.cmu.edu/~spiral Questions?