logging in or signing up Chaver Calogera Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 291 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 12, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado Index: Index Motivation Experimental environment Lifting Transform Memory hierarchy exploitation SIMD optimization Conclusions Future workSlide3: MotivationSlide4: Motivation Applications based on the Wavelet Transform: JPEG-2000 MPEG-4 Usage of the lifting scheme Study based on a modern general purpose microprocessor Pentium 4 Objectives: Efficient exploitation of Memory Hierarchy Use of the SIMD ISA extensionsSlide5: Experimental EnvironmentSlide6: Experimental Environment RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800) Memory 512 KB, 128 Byte/Line L2 8 KB, 64 Byte/Line, Write-Through DL1 NA IL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler CompilerSlide7: Lifting TransformSlide8: D 1st 1st 1st 1st 1st 1st Lifting Transform Original element 1st step 2nd step A D D D A A A 1st 1stSlide9: N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element ApproximationSlide10: Lifting Transform Horizontal Filtering Vertical Filtering Slide11: Memory Hierarchy ExploitationSlide12: Poor data locality of one component (canonical layouts) E.g. : column-major layout processing image rows (Horizontal Filtering) Aggregation (loop tiling) Memory Hierarchy Exploitation Poor data locality of the whole transform Other layoutsSlide13: Memory Hierarchy Exploitation Horizontal Filtering Vertical Filtering Slide14: Aggregation Horizontal Filtering IMAGE Memory Hierarchy ExploitationSlide15: Memory Hierarchy Exploitation INPLACE Common implementation of the transform Memory: Only requires the original matrix For most applications needs post-processing MALLAT Memory: requires 2 matrices Stores the image in the expected order INPLACE-MALLAT Memory: requires 2 matrices Stores the image in the expected order Different studied schemesSlide16: Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 logical view physical view INPLACESlide17: Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 MATRIX 2 logical view physical view MALLATSlide18: Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view physical view INPLACE- MALLATSlide19: Memory Hierarchy Exploitation Execution time breakdown for several sizes comparing both compilers. I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively. Each bar shows the execution time of each level and the post-processing step. Slide20: The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above These 2 approaches have a noticeable slowdown for the 1st level: Larger working set More complex access pattern The Inplace-Mallat version achieves the best execution time ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONSSlide21: SIMD OptimizationSlide22: Objective: Extract the parallelism available on the Lifting Transform Different strategies: Semi-automatic vectorization Hand-coded vectorization Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout) SIMD OptimizationSlide23: SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employedSlide24: Original element 1st step 2nd step A D SIMD Optimization 1st 1stSlide25: SIMD Optimization Column-major layout Vectorial Horizontal filtering + a x + Horizontal filtering + a x + a a a Slide26: SIMD Optimization Column-major layout Vectorial Vertical filtering + a x + Vertical filtering + a x + a a a Slide27: for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } } Horizontal Vectorial Filtering (semi-automatic) SIMD OptimizationSlide28: SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering Slide29: Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2 t3 t4 + a x + a a a Slide30: SIMD Optimization Execution time breakdown of the horizontal filtering (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.Slide31: SIMD Optimization Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses. The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level. For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONSSlide32: SIMD Optimization Execution time breakdown of the whole transform (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.Slide33: SIMD Optimization Speedup between 1,5 and 2 depending on the strategy. For ICC the shortest execution time is reached by the mallat version. When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONSSlide34: SIMD Optimization Speedup achieved by the different vectorial codes over the inplace-mallat and inplace. We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.Slide35: SIMD Optimization The speedup grows with the image size since. On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy. Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes CONCLUSIONSSlide36: ConclusionsSlide37: Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme. SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results. Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2. The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability. Most of our insights are compiler independent. ConclusionsSlide38: Future workSlide39: 4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: http://www.dacya.ucm.es/dchaver You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Chaver Calogera Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 291 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 12, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions: Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado Index: Index Motivation Experimental environment Lifting Transform Memory hierarchy exploitation SIMD optimization Conclusions Future workSlide3: MotivationSlide4: Motivation Applications based on the Wavelet Transform: JPEG-2000 MPEG-4 Usage of the lifting scheme Study based on a modern general purpose microprocessor Pentium 4 Objectives: Efficient exploitation of Memory Hierarchy Use of the SIMD ISA extensionsSlide5: Experimental EnvironmentSlide6: Experimental Environment RedHat Distribution 7.2 (Enigma) Operating System 1 GB RDRAM (PC800) Memory 512 KB, 128 Byte/Line L2 8 KB, 64 Byte/Line, Write-Through DL1 NA IL1 Cache DFI WT70-EC Motherboard Intel Pentium4 (2,4 GHz) Platform Intel ICC compiler GCC compiler CompilerSlide7: Lifting TransformSlide8: D 1st 1st 1st 1st 1st 1st Lifting Transform Original element 1st step 2nd step A D D D A A A 1st 1stSlide9: N Levels Lifting Transform 1 Level Horizontal Filtering (1D Lifting Transform) Vertical Filtering (1D Lifting Transform) Original element ApproximationSlide10: Lifting Transform Horizontal Filtering Vertical Filtering Slide11: Memory Hierarchy ExploitationSlide12: Poor data locality of one component (canonical layouts) E.g. : column-major layout processing image rows (Horizontal Filtering) Aggregation (loop tiling) Memory Hierarchy Exploitation Poor data locality of the whole transform Other layoutsSlide13: Memory Hierarchy Exploitation Horizontal Filtering Vertical Filtering Slide14: Aggregation Horizontal Filtering IMAGE Memory Hierarchy ExploitationSlide15: Memory Hierarchy Exploitation INPLACE Common implementation of the transform Memory: Only requires the original matrix For most applications needs post-processing MALLAT Memory: requires 2 matrices Stores the image in the expected order INPLACE-MALLAT Memory: requires 2 matrices Stores the image in the expected order Different studied schemesSlide16: Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 logical view physical view INPLACESlide17: Memory Hierarchy Exploitation O O O O O O O O O O O O O O O O MATRIX 1 MATRIX 2 logical view physical view MALLATSlide18: Memory Hierarchy Exploitation MATRIX 1 MATRIX 2 O O O O O O O O O O O O O O O O logical view physical view INPLACE- MALLATSlide19: Memory Hierarchy Exploitation Execution time breakdown for several sizes comparing both compilers. I, IM and M denote inplace, inplace-mallat, and mallat strategies respectively. Each bar shows the execution time of each level and the post-processing step. Slide20: The Mallat and Inplace-Mallat approaches outperform the Inplace approach for levels 2 and above These 2 approaches have a noticeable slowdown for the 1st level: Larger working set More complex access pattern The Inplace-Mallat version achieves the best execution time ICC compiler outperforms GCC for Mallat and Inplace-Mallat, but not for the Inplace approach Memory Hierarchy Exploitation CONCLUSIONSSlide21: SIMD OptimizationSlide22: Objective: Extract the parallelism available on the Lifting Transform Different strategies: Semi-automatic vectorization Hand-coded vectorization Only the horizontal filtering of the transform can be semi-automatically vectorized (when using a column-major layout) SIMD OptimizationSlide23: SIMD Optimization Automatic Vectorization (Intel C/C++ Compiler) Inner loops Simple array index manipulation Iterate over contiguous memory locations Global variables avoided Pointer disambiguation if pointers are employedSlide24: Original element 1st step 2nd step A D SIMD Optimization 1st 1stSlide25: SIMD Optimization Column-major layout Vectorial Horizontal filtering + a x + Horizontal filtering + a x + a a a Slide26: SIMD Optimization Column-major layout Vectorial Vertical filtering + a x + Vertical filtering + a x + a a a Slide27: for(j=2,k=1;j<(#columns-4);j+=2,k++) { #pragma vector aligned for(i=0;i<#rows;i++) { /* 1st operation */ col3=col3 + alfa*( col4+ col2); /* 2nd operation */ col2=col2 + beta*( col3+ col1); /* 3rd operation */ col1=col1 + gama*( col2+ col0); /* 4th operation */ col0 =col0 + delt*( col1+ col-1); /* Last step */ detail = col1 *phi_inv; aprox = col0 *phi; } } Horizontal Vectorial Filtering (semi-automatic) SIMD OptimizationSlide28: SIMD Optimization Hand-coded Vectorization SIMD parallelism has to be explicitly expressed Intrinsics allow more flexibility Possibility to also vectorize the vertical filtering Slide29: Horizontal Vectorial Filtering (hand) SIMD Optimization /* 1st operation */ t2 = _mm_load_ps(col2); t4 = _mm_load_ps(col4); t3 = _mm_load_ps(col3); coeff = _mm_set_ps1(alfa); t4 = _mm_add_ps(t2,t4); t4 = _mm_mul_ps(t4,coeff); t3 = _mm_add_ps(t4,t3); _mm_store_ps(col3,t3); /* 2nd operation */ /* 3rd operation */ /* 4th operation */ /* Last step */ _mm_store_ps(detail,t1); _mm_store_ps(aprox,t0); t2 t3 t4 + a x + a a a Slide30: SIMD Optimization Execution time breakdown of the horizontal filtering (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.Slide31: SIMD Optimization Speedup between 4 and 6 depending on the strategy. The reason for such a high improvement is due not only to the vectorial computations, but also to a considerable reduction in the memory accesses. The speedups achieved by the strategies with recursive layouts (i.e. inplace-mallat and mallat) are higher than the inplace version counterparts, since the computation on the latter can only be vectorized in the first level. For ICC, both vectorization approaches (i.e. automatic and hand-tuned) produce similar speedups, which highlights the quality of the ICC vectorizer. CONCLUSIONSSlide32: SIMD Optimization Execution time breakdown of the whole transform (10242 pixels image). I, IM and M denote inplace, inplace-mallat and mallat approaches. S, A and H denote scalar, automatic-vectorized and hand-coded-vectorized.Slide33: SIMD Optimization Speedup between 1,5 and 2 depending on the strategy. For ICC the shortest execution time is reached by the mallat version. When using GCC both recursive-layout strategies obtain similar results. CONCLUSIONSSlide34: SIMD Optimization Speedup achieved by the different vectorial codes over the inplace-mallat and inplace. We show the hand-coded ICC, the automatic ICC, and the hand-coded GCC.Slide35: SIMD Optimization The speedup grows with the image size since. On average, the speedup is about 1.8 over the inplace-mallat scheme, growing to about 2 when considering it over the inplace strategy. Focusing on the compilers, ICC clearly outperforms GCC by a significant 20-25% for all the image sizes CONCLUSIONSSlide36: ConclusionsSlide37: Scalar version: We have introduced a new scheme called Inplace-Mallat, that outperforms both the Inplace implementation and the Mallat scheme. SIMD exploitation: Code modifications for the vectorial processing of the lifting algorithm. Two different methodologies with ICC compiler: semi-automatic and intrinsic-based vectorizations. Both provide similar results. Speedup: Horizontal filtering about 4-6 (vectorization also reduces the pressure on the memory system). Whole transform around 2. The vectorial Mallat approach outperforms the other schemes and exhibits a better scalability. Most of our insights are compiler independent. ConclusionsSlide38: Future workSlide39: 4D layout for a lifting-based scheme Measurements using other platforms Intel Itanium Intel Pentium-4 with hiperthreading Parallelization using OpenMP (SMT) Future work For additional information: http://www.dacya.ucm.es/dchaver