logging in or signing up ê°•ì˜ìžë£Œ-parallel-chapter4 aSGuest96031 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 23 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: April 25, 2011 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Chapter 4 Vector Processors: Parallell Processing Systems 1 Chapter 4 Vector ProcessorsIntroduction: Parallell Processing Systems 2 Introduction Typical operations on array-oriented data One or more vectors ==> a scalar result two vectors ==> a vector a scalar and a vector ==> a vector a combination of the above three operationsIntroduction(continued): Parallell Processing Systems 3 Introduction(continued) Three architectures suitable for the vector processing environments pipelined vector processors parallel array processors systolic array architecturesIntroduction (continued): Parallell Processing Systems 4 Introduction (continued) Pipelined vector processors They utilize one or more pipelined ALUs to achieve high computation throughput. Parallel array processors They adopt a multiplicity of CPUs that operate on elements of arrays in parallel. Systolic array architectures They use extensive pipelining and parallel processing.Introductions(continued): Parallell Processing Systems 5 Introductions(continued) Vector processors are supercomputers optimized for fast execution of long groups of vectorizable scientific code. Vector processors are extensively pipelined architectures designed to operate on array-oriented data.4.1 Vector Processor Models: Parallell Processing Systems 6 4.1 Vector Processor Models Figure 4.1 shows a vector computational model. Start-up time: the number of clock cycles required prior to the generation of the first result. The time to complete N-element vector operation in a pipeline Start-up time + (N-1) X Initiation rate4.1 Vector Processor Models (continued): Parallell Processing Systems 7 4.1 Vector Processor Models (continued) Note that the start-up time adds a considerable overhead for small value of N and the effect of start-up time is negligible for large value of N. Example 4.14.1 Vector Processor Models (continued): Parallell Processing Systems 8 4.1 Vector Processor Models (continued) Memory-Oriented Vector Processor (Figure 4.2) versus Register-Oriented Vector Processor (Figure 4.3) The characteristics of vector processors contributing to the high performance High-speed memory A large number of registers Instruction set Multiplicity of overlapped processing levels4.2 Memory Design Considerations: Parallell Processing Systems 9 4.2 Memory Design Considerations Memory bandwidth the average number of words that can be accessed from the memory per second. Memory bandwidth must match the demand of multiple pipelined vector processors. Memory system configuration the number of memory modules bus width addressing decoding structure4.2 Memory Design Considerations(cont.): Parallell Processing Systems 10 4.2 Memory Design Considerations(cont.) Memory module characteristics Size Access time Cycle timeExample 4.2: Parallell Processing Systems 11 Example 4.2 Consider a vector processor with four 32-bit floating point processors, each requiring two 32-bit operands every clock cycle and producing one 32-bit result. Assume that one 32-bit instruction is fetched for each arithmetic operation. Total traffic per cycle? If the memory cycle time is 1.28 s and the processor cycle time is 40 ns, how can we match the demand?4.2 Memory Design Considerations(con.): Parallell Processing Systems 12 4.2 Memory Design Considerations(con.) How to match the demand rate between memory system and processors. Configuring with multiple memory modules allowing simultaneous access (Figure 4.4) Inserting fast intermediate memories .Example 4.3: Parallell Processing Systems 13 Example 4.3 C i = A i + B i , 1 i N Figure 4.5 shows the data structure in a memory system with 8 modules. Figure 4.6 shows the reservation table for the addition using a 3-stage pipelined adder and memory with 8 modules. 1 delay on AExample 4.4: Parallell Processing Systems 14 Example 4.4 An architecture with a 6-modules memory, 3-stage pipelined adder, memory access time equivalent to two processor cycle times. Figure 4.7 shows the reservation table. 3 delays in A 3 delays on outputExample 4.5 : Parallell Processing Systems 15 Example 4.5 C[I] = A[I] + b[I], 1 I N Assume N=64 and vector register length =64 The time unit for floating point addition is six clock periods. Including one clock period for transferring data from vector registers to additional unit and one clock cycle period to store the result into another vector register. In scalar mode: 64x8=512 clock periods In vector mode: 8+63 = 71 clock periods If N < 64? If N > 64?4.2 Memory Design Considerations(con.): Parallell Processing Systems 16 4.2 Memory Design Considerations(con.) Figure 4.8 shows the general structure of the vector processor with delay elements inserted in the input and output. A common method of further increasing the memory system bandwidth is to insert high speed intermediate memory between main memory and the processor pipeline.4.3 Architecture of the Cray Series: Parallell Processing Systems 17 4.3 Architecture of the Cray Series Cray X-MP/4(Figure 4.9): successor of Cray-1 Memory: is built out of several sections, each divided into banks. 25 to 100 Gbps 4 ports Memory conflict solution may require wait states to be inserted Solid state device is used as an exceptional fast access disk devices.4.3 Architecture of the Cray Series(continued): Parallell Processing Systems 18 4.3 Architecture of the Cray Series(continued) Cray X-MP/4(Figure 4.9) Processor interconnection The interconnection of CPUs assumed a coarse-grained multiprocessing environment. Central Processor(Figure 4.10) Each CPU is a register-oriented vector processor. Table 4.1 shows the functional unit characteristics. Strip mining Chaining Cray Y-MP, Cray-3, Cray-44.4 Two Other Architectures : Parallell Processing Systems 19 4.4 Two Other Architectures Convex C series From FPS(Floating-Point Systems, Inc.) C1, C2, C3 Figure 4.15 shows the architecture of Convex C120 system FPS 5000 Series Figure 4.18 shows the FPS 5000 Series architecture.4.5 Performance Evaluation: Parallell Processing Systems 20 4.5 Performance Evaluation Major characteristics to affect supercomputer architecture Clock speed Instruction issue rate Size and number of registers Memory size Number of concurrent paths to memory Ability to fetch/store vectors efficiently Number of duplicate arithmetic functional units Whether function can be chained together Indirect addressing capability Handling of conditional blocks of code4.5 Performance Evaluation (continued): Parallell Processing Systems 21 4.5 Performance Evaluation (continued) The sustained performance depends on the following factors: Level of vectorization Average vector length Possibility of vector chaining Overlap of scalar, vector, memory load/store operations possible Memory contention resolution mechanism adopted.4.5 Performance Evaluation (continued): Parallell Processing Systems 22 4.5 Performance Evaluation (continued) The Amdal ’ s law Speed-up = where s is the ratio the speed of the vector unit to that of scalar unit. The execution time of a vector loop with N element, T N = T memory + (N-1) T cycle , where T memory is the time to initialize starting address for each vector. ,, , R = , where F is the floating-point operation included in the loop4.6 Programming Vector Processors: Parallell Processing Systems 23 4.6 Programming Vector Processors Programming facilities Development of programming facilities Development of compiler In general, it is not possible to completely vectorize a sequential program. In general, an algorithm that is considered efficient for scalar computation need not be efficient for a vector environment. Modifications are then needed to take advantage of the vector hardware.4.6 Programming Vector Processors(continued): Parallell Processing Systems 24 4.6 Programming Vector Processors(continued) Several techniques adopted by vector processor environment: Scalar renaming Scalar expansion Loop unrolling Loop fusion or jamming Loop distribution Force maximum work into inner loop Subprogram in-lining Eliminate ambiguity using the PARAMETER statement Positioning frequently executed scalar conditional block first You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
ê°•ì˜ìžë£Œ-parallel-chapter4 aSGuest96031 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 23 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: April 25, 2011 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Chapter 4 Vector Processors: Parallell Processing Systems 1 Chapter 4 Vector ProcessorsIntroduction: Parallell Processing Systems 2 Introduction Typical operations on array-oriented data One or more vectors ==> a scalar result two vectors ==> a vector a scalar and a vector ==> a vector a combination of the above three operationsIntroduction(continued): Parallell Processing Systems 3 Introduction(continued) Three architectures suitable for the vector processing environments pipelined vector processors parallel array processors systolic array architecturesIntroduction (continued): Parallell Processing Systems 4 Introduction (continued) Pipelined vector processors They utilize one or more pipelined ALUs to achieve high computation throughput. Parallel array processors They adopt a multiplicity of CPUs that operate on elements of arrays in parallel. Systolic array architectures They use extensive pipelining and parallel processing.Introductions(continued): Parallell Processing Systems 5 Introductions(continued) Vector processors are supercomputers optimized for fast execution of long groups of vectorizable scientific code. Vector processors are extensively pipelined architectures designed to operate on array-oriented data.4.1 Vector Processor Models: Parallell Processing Systems 6 4.1 Vector Processor Models Figure 4.1 shows a vector computational model. Start-up time: the number of clock cycles required prior to the generation of the first result. The time to complete N-element vector operation in a pipeline Start-up time + (N-1) X Initiation rate4.1 Vector Processor Models (continued): Parallell Processing Systems 7 4.1 Vector Processor Models (continued) Note that the start-up time adds a considerable overhead for small value of N and the effect of start-up time is negligible for large value of N. Example 4.14.1 Vector Processor Models (continued): Parallell Processing Systems 8 4.1 Vector Processor Models (continued) Memory-Oriented Vector Processor (Figure 4.2) versus Register-Oriented Vector Processor (Figure 4.3) The characteristics of vector processors contributing to the high performance High-speed memory A large number of registers Instruction set Multiplicity of overlapped processing levels4.2 Memory Design Considerations: Parallell Processing Systems 9 4.2 Memory Design Considerations Memory bandwidth the average number of words that can be accessed from the memory per second. Memory bandwidth must match the demand of multiple pipelined vector processors. Memory system configuration the number of memory modules bus width addressing decoding structure4.2 Memory Design Considerations(cont.): Parallell Processing Systems 10 4.2 Memory Design Considerations(cont.) Memory module characteristics Size Access time Cycle timeExample 4.2: Parallell Processing Systems 11 Example 4.2 Consider a vector processor with four 32-bit floating point processors, each requiring two 32-bit operands every clock cycle and producing one 32-bit result. Assume that one 32-bit instruction is fetched for each arithmetic operation. Total traffic per cycle? If the memory cycle time is 1.28 s and the processor cycle time is 40 ns, how can we match the demand?4.2 Memory Design Considerations(con.): Parallell Processing Systems 12 4.2 Memory Design Considerations(con.) How to match the demand rate between memory system and processors. Configuring with multiple memory modules allowing simultaneous access (Figure 4.4) Inserting fast intermediate memories .Example 4.3: Parallell Processing Systems 13 Example 4.3 C i = A i + B i , 1 i N Figure 4.5 shows the data structure in a memory system with 8 modules. Figure 4.6 shows the reservation table for the addition using a 3-stage pipelined adder and memory with 8 modules. 1 delay on AExample 4.4: Parallell Processing Systems 14 Example 4.4 An architecture with a 6-modules memory, 3-stage pipelined adder, memory access time equivalent to two processor cycle times. Figure 4.7 shows the reservation table. 3 delays in A 3 delays on outputExample 4.5 : Parallell Processing Systems 15 Example 4.5 C[I] = A[I] + b[I], 1 I N Assume N=64 and vector register length =64 The time unit for floating point addition is six clock periods. Including one clock period for transferring data from vector registers to additional unit and one clock cycle period to store the result into another vector register. In scalar mode: 64x8=512 clock periods In vector mode: 8+63 = 71 clock periods If N < 64? If N > 64?4.2 Memory Design Considerations(con.): Parallell Processing Systems 16 4.2 Memory Design Considerations(con.) Figure 4.8 shows the general structure of the vector processor with delay elements inserted in the input and output. A common method of further increasing the memory system bandwidth is to insert high speed intermediate memory between main memory and the processor pipeline.4.3 Architecture of the Cray Series: Parallell Processing Systems 17 4.3 Architecture of the Cray Series Cray X-MP/4(Figure 4.9): successor of Cray-1 Memory: is built out of several sections, each divided into banks. 25 to 100 Gbps 4 ports Memory conflict solution may require wait states to be inserted Solid state device is used as an exceptional fast access disk devices.4.3 Architecture of the Cray Series(continued): Parallell Processing Systems 18 4.3 Architecture of the Cray Series(continued) Cray X-MP/4(Figure 4.9) Processor interconnection The interconnection of CPUs assumed a coarse-grained multiprocessing environment. Central Processor(Figure 4.10) Each CPU is a register-oriented vector processor. Table 4.1 shows the functional unit characteristics. Strip mining Chaining Cray Y-MP, Cray-3, Cray-44.4 Two Other Architectures : Parallell Processing Systems 19 4.4 Two Other Architectures Convex C series From FPS(Floating-Point Systems, Inc.) C1, C2, C3 Figure 4.15 shows the architecture of Convex C120 system FPS 5000 Series Figure 4.18 shows the FPS 5000 Series architecture.4.5 Performance Evaluation: Parallell Processing Systems 20 4.5 Performance Evaluation Major characteristics to affect supercomputer architecture Clock speed Instruction issue rate Size and number of registers Memory size Number of concurrent paths to memory Ability to fetch/store vectors efficiently Number of duplicate arithmetic functional units Whether function can be chained together Indirect addressing capability Handling of conditional blocks of code4.5 Performance Evaluation (continued): Parallell Processing Systems 21 4.5 Performance Evaluation (continued) The sustained performance depends on the following factors: Level of vectorization Average vector length Possibility of vector chaining Overlap of scalar, vector, memory load/store operations possible Memory contention resolution mechanism adopted.4.5 Performance Evaluation (continued): Parallell Processing Systems 22 4.5 Performance Evaluation (continued) The Amdal ’ s law Speed-up = where s is the ratio the speed of the vector unit to that of scalar unit. The execution time of a vector loop with N element, T N = T memory + (N-1) T cycle , where T memory is the time to initialize starting address for each vector. ,, , R = , where F is the floating-point operation included in the loop4.6 Programming Vector Processors: Parallell Processing Systems 23 4.6 Programming Vector Processors Programming facilities Development of programming facilities Development of compiler In general, it is not possible to completely vectorize a sequential program. In general, an algorithm that is considered efficient for scalar computation need not be efficient for a vector environment. Modifications are then needed to take advantage of the vector hardware.4.6 Programming Vector Processors(continued): Parallell Processing Systems 24 4.6 Programming Vector Processors(continued) Several techniques adopted by vector processor environment: Scalar renaming Scalar expansion Loop unrolling Loop fusion or jamming Loop distribution Force maximum work into inner loop Subprogram in-lining Eliminate ambiguity using the PARAMETER statement Positioning frequently executed scalar conditional block first