logging in or signing up Sytem on chip konkati Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 263 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: March 22, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript System on Chip C(SoC-C)Efficient programming abstractions for heterogeneous multicore Systems on Chip : 1 System on Chip C(SoC-C)Efficient programming abstractions for heterogeneous multicore Systems on Chip Alastair Reid ARM Ltd Yuan Lin University of Michigan Krisztian Flautner ARM Ltd Edmund Grimley-Evans ARM Ltd Mobile Consumer Electronics Trends : 2 Mobile Consumer Electronics Trends Mobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps) Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + … Trends in Mobile Processing Requirements : 3 Trends in Mobile Processing Requirements Video Encode Mobile Phone Encoding requirements for 30 frames/s VGA video 3GPP receiver requirements for different channel types Source: O. Silven and K. Jyrkka, Observation on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, 2007 Pocket Supercomputers : 4 Pocket Supercomputers The challenge is not processing power The challenge is energy efficiency Different Requirements : 5 Different Requirements Desktop/Laptop/Server 1-10Gop/s 10-100W Mobile Electronics 10-100Gop/s 100mW-1W 10x performance 1/100 power consumption = 1000x energy efficiency Energy Efficient Systems are “Lumpy” : 6 Energy Efficient Systems are “Lumpy” Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA Example Architecture : 7 Example Architecture Distributed Memories Control Processor SIMD Instruction Set Data Engines Accelerators Artist’s impression How do we program AMP systems? : 8 How do we program AMP systems? C doesn’t provide language features to support Multiple processors (or multi-ISA systems) Distributed memory Multiple threads Use Indirection (Strawman #1) : 9 Use Indirection (Strawman #1) Add a layer of indirection Operating System Layer of middleware Device drivers Hardware support All impose a cost in Power/Performance/Area Raise Pain Threshold (Strawman #2) : 10 Raise Pain Threshold (Strawman #2) Write efficient code at very low level of abstraction Problems Hard, slow and expensive to write, test, debug and maintain Design intent drowns in sea of low level detail Not portable across different architectures Expensive to try different points in design space Our Response : 11 Our Response Extend C Support Asymmetric Multiprocessors SoC-C language raises level of abstraction … but take care not to hide expensive operations Overview : 12 Overview Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement SoC-C Compilation Conclusion 3 steps in mapping an application : 13 3 steps in mapping an application Decide how to parallelize Choose processors for each pipeline stage Resolve distributed memory issues A Simple Program : 14 A Simple Program int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } Step 1: Decide how to parallelize : 15 Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } 50% of work 50% of work Step 1: Decide how to parallelize : 16 Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } PIPELINE indicates region to parallelize FIFO indicates boundaries between pipeline stages SoC-C Feature #1: Pipeline Parallelism : 17 SoC-C Feature #1: Pipeline Parallelism Annotations express coarse-grained pipeline parallelism PIPELINE indicates scope of parallelism FIFO indicates boundaries between pipeline stages Compiler splits into threads communicating through FIFOs Step 2: Choose Processors : 18 Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } Step 2: Choose Processors : 19 Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } @ P indicates processor to execute function SoC-C Feature #2: RPC Annotations : 20 SoC-C Feature #2: RPC Annotations Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call Does not change meaning of program Bulk data is not implicitly copied to processor’s local memory Step 3: Resolve Memory Issues : 21 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } P0 uses x x must be in M0 P1 uses z z must be in M1 P0 uses y y must be in M0 P1 uses y y must be in M1 Conflict?! Hardware Cache Coherency : 22 Hardware Cache Coherency P0 $0 P1 $1 write x read x write x invalidate x copy x invalidate x Step 3: Resolve Memory Issues : 23 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Two versions: y@M0, y@M1 write y@M0 y@M1 is invalid reads y@M1 Coherence error Step 3: Resolve Memory Issues : 24 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } SYNC(x) @ P copies data from one version of x to another using processor P read y@M1 y@M1 and y@M0 are valid SoC-C Feature #3: Compile Time Coherency : 25 SoC-C Feature #3: Compile Time Coherency Variables can have multiple coherent versions Compiler uses memory topology to determine which version is being accessed Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid Dataflow analysis propagates validity Reading from an invalid version is an error SYNC(x) copies from valid version to invalid version What SoC-C Provides : 26 What SoC-C Provides SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf) Compiling SoC-C : 27 Compiling SoC-C Data Placement Infer data placement Propagate coherence Split variables with multiple placement Pipeline Parallelism Identify maximal threads Split into multiple threads Apply zero copy optimization RPC (see paper for details) Step 1a: Infer Data Placement : 28 Step 1a: Infer Data Placement int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Solve Set of Constraints Step 1a: Infer Data Placement : 29 Step 1a: Infer Data Placement int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Solve Set of Constraints Memory Topology constrains where variables could live Step 1a: Infer Data Placement : 30 Solve Set of Constraints Memory Topology constrains where variables could live Step 1a: Infer Data Placement int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,?) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@?); } } Step 1b: Propagate Coherence : 31 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,?) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@?); } } Step 1b: Propagate Coherence : 32 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,M0) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1b: Propagate Coherence : 33 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Backwards Dataflow propagates need for valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,M0) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1b: Propagate Coherence : 34 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Backwards Dataflow propagates need for valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@M0); foo(y@M0, x@M0) @ P0; SYNC(y,M1,M0) @ DMA; FIFO(y@M1); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1c: Split Variables : 35 Step 1c: Split Variables int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Split variables with multiple locations Replace SYNC with memcpy Step 2: Implement Pipeline Annotation : 36 Step 2: Implement Pipeline Annotation int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Step 2a: Identify Dependent Operations : 37 Step 2a: Identify Dependent Operations int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Split use-def chains at FIFOs Step 2b: Identify Maximal Threads : 38 Step 2b: Identify Maximal Threads int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Split use-def chains at FIFOs Identify Thread Operations Step 2b: Split Into Multiple Threads : 39 Step 2b: Split Into Multiple Threads int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Perform Dataflow Analysis Split use-def chains at FIFOs Identify Thread Operations Split into threads Step 2c: Zero Copy Optimization : 40 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Generate Data Copy into FIFO Copy out of FIFO Consume Data Step 2c: Zero Copy Optimization : 41 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Calculate Live Range of variables passed through FIFOs Live Range of y1a Live Range of y1b Step 2c: Zero Copy Optimization : 42 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } } } Calculate Live Range of variables passed through FIFOs Transform FIFO operations to pass pointers instead of copying data Acquire empty buffer Generate data directly into buffer Pass full buffer to thread 2 Acquire full buffer from thread 1 Consume data directly from buffer Release empty buffer Order of transformations : 43 Order of transformations Dataflow-sensitive transformations go first Inferring data placement Coherence checking within threads Dependency analysis for parallelism Parallelism transformations Obscures data and control flow Thread-local optimizations go last Zero-copy optimization of FIFO operations Continuation passing thread implementation Related Work : 44 Related Work Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language Pipeline parallelism J.E. Smith, “Decoupled access/execute computer architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions Hardware Woh et al., “From Soda to Scotch: The Evolution of a Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008 (not cited by paper) The SoC-C Model : 45 The SoC-C Model Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating thread Model” Single memory: Compiler Managed Coherence handles “bookkeeping” Annotations change execution, not semantics Avoid need to restructure code Pipeline parallelism Compiler managed coherence Efficiency Avoid abstracting expensive operations programmer can optimize and reason about Slide 46: 46 Fin Language Design Meta Issues : 47 Language Design Meta Issues Compiler only uses simple analyses Easier to maintain consistency between different compiler versions/implementations Programmer makes the high-level decisions Code and Data Placement Inserting SYNC Load balancing Implementation by many source-source transforms Programmer can mix high- and low-level features 90-10 rule: use high-level features when you can, low-level features when you need to Step 3a: Resolve Overloaded RPC : 48 Step 3a: Resolve Overloaded RPC int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } } } Replace RPC by architecture specific call bar(…)@P1 DE32_bar(1,…) Step 3b: Split RPCs : 49 Step 3b: Split RPCs int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } } } RPCs have two phases start RPC wait for RPC to complete DE32_foo(0,…); start_DE32_foo(0,…); wait(semaphore_DE32[0]); Two Ways to Exploit Parallelism : 50 Two Ways to Exploit Parallelism Perform twice as much work 2 cores can perform 2x more work Perform same work for less energy DVFS (reduce current frequency) halving frequency and doubling #cores saves ~50% energy/op Shorter pipeline (reduce peak frequency) halving frequency and doubling #cores saves ~30% energy/op Techniques can be combined to give wider range of scaling Energy savings requires performance almost linear w/ #cores Parallel Speedup : 51 Parallel Speedup Efficient Same performance as hand-written code Near Linear Speedup Very efficient use of parallel hardware DVB-T Inner Receiver More realistic OFDM receiver 20 tasks, 500-7000 cycles per function, 29000 total You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Sytem on chip konkati Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 263 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: March 22, 2010 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript System on Chip C(SoC-C)Efficient programming abstractions for heterogeneous multicore Systems on Chip : 1 System on Chip C(SoC-C)Efficient programming abstractions for heterogeneous multicore Systems on Chip Alastair Reid ARM Ltd Yuan Lin University of Michigan Krisztian Flautner ARM Ltd Edmund Grimley-Evans ARM Ltd Mobile Consumer Electronics Trends : 2 Mobile Consumer Electronics Trends Mobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps) Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + … Trends in Mobile Processing Requirements : 3 Trends in Mobile Processing Requirements Video Encode Mobile Phone Encoding requirements for 30 frames/s VGA video 3GPP receiver requirements for different channel types Source: O. Silven and K. Jyrkka, Observation on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, 2007 Pocket Supercomputers : 4 Pocket Supercomputers The challenge is not processing power The challenge is energy efficiency Different Requirements : 5 Different Requirements Desktop/Laptop/Server 1-10Gop/s 10-100W Mobile Electronics 10-100Gop/s 100mW-1W 10x performance 1/100 power consumption = 1000x energy efficiency Energy Efficient Systems are “Lumpy” : 6 Energy Efficient Systems are “Lumpy” Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA Example Architecture : 7 Example Architecture Distributed Memories Control Processor SIMD Instruction Set Data Engines Accelerators Artist’s impression How do we program AMP systems? : 8 How do we program AMP systems? C doesn’t provide language features to support Multiple processors (or multi-ISA systems) Distributed memory Multiple threads Use Indirection (Strawman #1) : 9 Use Indirection (Strawman #1) Add a layer of indirection Operating System Layer of middleware Device drivers Hardware support All impose a cost in Power/Performance/Area Raise Pain Threshold (Strawman #2) : 10 Raise Pain Threshold (Strawman #2) Write efficient code at very low level of abstraction Problems Hard, slow and expensive to write, test, debug and maintain Design intent drowns in sea of low level detail Not portable across different architectures Expensive to try different points in design space Our Response : 11 Our Response Extend C Support Asymmetric Multiprocessors SoC-C language raises level of abstraction … but take care not to hide expensive operations Overview : 12 Overview Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement SoC-C Compilation Conclusion 3 steps in mapping an application : 13 3 steps in mapping an application Decide how to parallelize Choose processors for each pipeline stage Resolve distributed memory issues A Simple Program : 14 A Simple Program int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } Step 1: Decide how to parallelize : 15 Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } 50% of work 50% of work Step 1: Decide how to parallelize : 16 Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } PIPELINE indicates region to parallelize FIFO indicates boundaries between pipeline stages SoC-C Feature #1: Pipeline Parallelism : 17 SoC-C Feature #1: Pipeline Parallelism Annotations express coarse-grained pipeline parallelism PIPELINE indicates scope of parallelism FIFO indicates boundaries between pipeline stages Compiler splits into threads communicating through FIFOs Step 2: Choose Processors : 18 Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } Step 2: Choose Processors : 19 Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } @ P indicates processor to execute function SoC-C Feature #2: RPC Annotations : 20 SoC-C Feature #2: RPC Annotations Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call Does not change meaning of program Bulk data is not implicitly copied to processor’s local memory Step 3: Resolve Memory Issues : 21 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } P0 uses x x must be in M0 P1 uses z z must be in M1 P0 uses y y must be in M0 P1 uses y y must be in M1 Conflict?! Hardware Cache Coherency : 22 Hardware Cache Coherency P0 $0 P1 $1 write x read x write x invalidate x copy x invalidate x Step 3: Resolve Memory Issues : 23 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Two versions: y@M0, y@M1 write y@M0 y@M1 is invalid reads y@M1 Coherence error Step 3: Resolve Memory Issues : 24 Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } SYNC(x) @ P copies data from one version of x to another using processor P read y@M1 y@M1 and y@M0 are valid SoC-C Feature #3: Compile Time Coherency : 25 SoC-C Feature #3: Compile Time Coherency Variables can have multiple coherent versions Compiler uses memory topology to determine which version is being accessed Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid Dataflow analysis propagates validity Reading from an invalid version is an error SYNC(x) copies from valid version to invalid version What SoC-C Provides : 26 What SoC-C Provides SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf) Compiling SoC-C : 27 Compiling SoC-C Data Placement Infer data placement Propagate coherence Split variables with multiple placement Pipeline Parallelism Identify maximal threads Split into multiple threads Apply zero copy optimization RPC (see paper for details) Step 1a: Infer Data Placement : 28 Step 1a: Infer Data Placement int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Solve Set of Constraints Step 1a: Infer Data Placement : 29 Step 1a: Infer Data Placement int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } Solve Set of Constraints Memory Topology constrains where variables could live Step 1a: Infer Data Placement : 30 Solve Set of Constraints Memory Topology constrains where variables could live Step 1a: Infer Data Placement int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,?) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@?); } } Step 1b: Propagate Coherence : 31 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,?) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@?); } } Step 1b: Propagate Coherence : 32 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,M0) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1b: Propagate Coherence : 33 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Backwards Dataflow propagates need for valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@?); foo(y@M0, x@M0) @ P0; SYNC(y,?,M0) @ DMA; FIFO(y@?); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1b: Propagate Coherence : 34 Solve Set of Constraints Memory Topology constrains where variables could live Forwards Dataflow propagates availability of valid versions Backwards Dataflow propagates need for valid versions Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x@M0); foo(y@M0, x@M0) @ P0; SYNC(y,M1,M0) @ DMA; FIFO(y@M1); bar(z@M1, y@M1) @ P1; baz(z@M1) @ P1; put(z@M1); } } Step 1c: Split Variables : 35 Step 1c: Split Variables int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Split variables with multiple locations Replace SYNC with memcpy Step 2: Implement Pipeline Annotation : 36 Step 2: Implement Pipeline Annotation int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Step 2a: Identify Dependent Operations : 37 Step 2a: Identify Dependent Operations int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Split use-def chains at FIFOs Step 2b: Identify Maximal Threads : 38 Step 2b: Identify Maximal Threads int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } Dependency Analysis Split use-def chains at FIFOs Identify Thread Operations Step 2b: Split Into Multiple Threads : 39 Step 2b: Split Into Multiple Threads int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Perform Dataflow Analysis Split use-def chains at FIFOs Identify Thread Operations Split into threads Step 2c: Zero Copy Optimization : 40 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Generate Data Copy into FIFO Copy out of FIFO Consume Data Step 2c: Zero Copy Optimization : 41 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int y1a[100] @ {M1}; int y1b[100] @ {M1}; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } } } Calculate Live Range of variables passed through FIFOs Live Range of y1a Live Range of y1b Step 2c: Zero Copy Optimization : 42 Step 2c: Zero Copy Optimization int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } } } Calculate Live Range of variables passed through FIFOs Transform FIFO operations to pass pointers instead of copying data Acquire empty buffer Generate data directly into buffer Pass full buffer to thread 2 Acquire full buffer from thread 1 Consume data directly from buffer Release empty buffer Order of transformations : 43 Order of transformations Dataflow-sensitive transformations go first Inferring data placement Coherence checking within threads Dependency analysis for parallelism Parallelism transformations Obscures data and control flow Thread-local optimizations go last Zero-copy optimization of FIFO operations Continuation passing thread implementation Related Work : 44 Related Work Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language Pipeline parallelism J.E. Smith, “Decoupled access/execute computer architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions Hardware Woh et al., “From Soda to Scotch: The Evolution of a Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008 (not cited by paper) The SoC-C Model : 45 The SoC-C Model Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating thread Model” Single memory: Compiler Managed Coherence handles “bookkeeping” Annotations change execution, not semantics Avoid need to restructure code Pipeline parallelism Compiler managed coherence Efficiency Avoid abstracting expensive operations programmer can optimize and reason about Slide 46: 46 Fin Language Design Meta Issues : 47 Language Design Meta Issues Compiler only uses simple analyses Easier to maintain consistency between different compiler versions/implementations Programmer makes the high-level decisions Code and Data Placement Inserting SYNC Load balancing Implementation by many source-source transforms Programmer can mix high- and low-level features 90-10 rule: use high-level features when you can, low-level features when you need to Step 3a: Resolve Overloaded RPC : 48 Step 3a: Resolve Overloaded RPC int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } } } Replace RPC by architecture specific call bar(…)@P1 DE32_bar(1,…) Step 3b: Split RPCs : 49 Step 3b: Split RPCs int x[100] @ {M0}; int y0[100] @ {M0}; int *py1a; int *py1b; int z[100] @ {M1}; PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } } } RPCs have two phases start RPC wait for RPC to complete DE32_foo(0,…); start_DE32_foo(0,…); wait(semaphore_DE32[0]); Two Ways to Exploit Parallelism : 50 Two Ways to Exploit Parallelism Perform twice as much work 2 cores can perform 2x more work Perform same work for less energy DVFS (reduce current frequency) halving frequency and doubling #cores saves ~50% energy/op Shorter pipeline (reduce peak frequency) halving frequency and doubling #cores saves ~30% energy/op Techniques can be combined to give wider range of scaling Energy savings requires performance almost linear w/ #cores Parallel Speedup : 51 Parallel Speedup Efficient Same performance as hand-written code Near Linear Speedup Very efficient use of parallel hardware DVB-T Inner Receiver More realistic OFDM receiver 20 tasks, 500-7000 cycles per function, 29000 total