logging in or signing up hpnc98b Arundel0 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 95 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Bridging the Gap Between Distributed Shared Memory and Message Passing: Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft Overview: Overview Motivation Related approaches Charlotte Annotating Charlotte Some experiments SummaryMotivation: Motivation Scenario Distributed computing using the World Wide Web Use Java to overcome heterogeneity, security concerns and program installation problems Idea: volunteer computing Applets can participate in distributed applications Challenges Only make use of standard Web browsers/Java virtual machines Consider the Web’s and Java’s idiosyncrasies Inherent faultiness, machines come and go Host-of-origin imposes a star-like communication topology Provide a simple programming interface, e.g., DSM No hardware support (e.g., page protection) available Do not forget efficiency Related Approaches: Related ApproachesCharlotte - Programs: Charlotte - Programs Alternating sequential and parallel steps Sequential steps executed by a manager application Parallel steps executed by worker applets Routines are defined in parallel steps Routines are methods of objects, derived from a Charlotte class Droutine public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } } Charlotte - Memory Semantics: Charlotte - Memory Semantics Memory partitioned into private and shared parts Shared memory with CRCW-common semantics Implemented at object level Charlotte provides classes for every primitive type: int Dint, float Dfloat, etc Manager has master copy of memory Shared objects have get() and set() methods get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager set() marks data as modified to be flushed to manager at end of parallel step Updates from different routines incorporated atomically at end of parallel step Allows eager scheduling for fault tolerance Routines can be executed multiple times without violating exactly once semantics Charlotte - Matrix Multiplication: Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … } Charlotte - Pros and Cons: Charlotte - Pros and Cons Pros Simple programming model Well-defined DSM semantics Adaptive parallelism Fault tolerant with respect to worker crashes Cons Efficiency problems Shared data is accessed via method invocations Loading data from manager incurs latency Choosing good page size is difficult Sending modified data back to manager requires inspection of status flags Reason: runtime system does not know which data is read or modified by routines Make this information explicit in Charlotte programs!Annotations: Annotations Specify read and write locations for routines Use methods of class Droutine dloc_read() dloc_write() First step: Correctness-preserving annotations Only send read data at beginning of routine public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … }Relying on Annotations: Relying on Annotations Second step: Rely on annotations Data transfer between manager and worker happens only according to annotations No if-statement in get() necessary No status update in set() necessary No latencies for requesting data updates But, wrong annotations lead to wrong results Use Uint (unchecked) instead of Dint, etc. Identical interface Remaining overhead: method invocation for data access Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed Simple to switch back and forthSharing Primitive Types: Sharing Primitive Types Third step: Use primitive types (int) instead of objects (Uint) Possible since get() and set() are trivial for Uint Possible since annotations describe data movement completely Avoids method invocation overhead But: different interface No get()/set() syntactic changes Call-by-reference / call-by-value Particularly suited to arrays of distributed types Freely mix Dint, Uint and shared intAdditional Optimizations: Additional Optimizations Manager keeps track of workers’ valid data sets Use routines’ read sets to select routine to give to worker Choose routine with minimal amount of missing data Example: 3 routines, 2 workers Data can be cached at workers between parallel steps Requires data to be declared unmodified at the beginning of parallel step Worker 1 Worker 2 valid: 1-100 valid: 150-200 read_set: 20- 80 90- 150 160- 210Discussion: Discussion Gradual incorporation of semantic knowledge into Charlotte programs Large flexibility to mix different levels of annotation semantics in one program Easy to switch back and forth among these levels Close to message passing behavior Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) Is it efficient?Annotating DSM Code: Annotating DSM Code Munin Annotating data with expected access patterns, system chooses appropriate consistency protocols Aurora Object-based system in C++, different consistency models are dynamically selected at runtime CRL, Cid Library of C-functions, explicit mapping of shared memory in local memory Jade Annotations used to extract parallelism from sequential code All lack flexibility to mix different guarantee levelsExperiments - Setup: Experiments - Setup Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec.Experiments : Experiments Multiply two 200x200 integer matrices Runtimes for Standard Charlotte (Dint) Standard Charlotte plus correctness-preserving annotations (Dint+A) Unchecked classes (Uint) Shared primitive types (int) A message passing implementation (mes.pas.) Shared primitive types competitive with message passing Reasonable absolute speedups (compared to sequential runtime)Experiments (contd.): Experiments (contd.) Ratios between various optimizations Annotations/Presending give a factor of about three Getting rid of objects is another factor of two Altogether: up to nine times faster than standard Charlotte And better scalabilityExperiments (contd.): Experiments (contd.) Manager at NYU, worker at HU Berlin Similar behavior, if slightly smaller improvementExperiments (contd.): Experiments (contd.) Colocation and Caching Example: multiply matrix A by two different matrices B1 and B2 Small impact in a LAN environment Up to 25% improvement for high-latency connections Colocation results in smaller standard deviation of runtimesFuture Work: Future Work Investigating more elaborate examples Data with both regular and irregular access patterns Compiler-generated annotations New programming interface no nested parallelism parBegin() / parEnd() makes some annotations awkward Overlapping communication/computation Multiple worker threads in one virtual machine Applying annotations to Calypso (a page-based DSM system) Down the road: use annotations for QoSSummary: Summary Annotations to describe read and write sets of parallel routines Correctness-preserving Correctness-sensitive Or directly sharing primitive types Possible to mix these different levels in a single program and to switch back and forth between them Big flexibility for programmer Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance Performance competitive with message passing programs You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
hpnc98b Arundel0 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 95 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Bridging the Gap Between Distributed Shared Memory and Message Passing: Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft Overview: Overview Motivation Related approaches Charlotte Annotating Charlotte Some experiments SummaryMotivation: Motivation Scenario Distributed computing using the World Wide Web Use Java to overcome heterogeneity, security concerns and program installation problems Idea: volunteer computing Applets can participate in distributed applications Challenges Only make use of standard Web browsers/Java virtual machines Consider the Web’s and Java’s idiosyncrasies Inherent faultiness, machines come and go Host-of-origin imposes a star-like communication topology Provide a simple programming interface, e.g., DSM No hardware support (e.g., page protection) available Do not forget efficiency Related Approaches: Related ApproachesCharlotte - Programs: Charlotte - Programs Alternating sequential and parallel steps Sequential steps executed by a manager application Parallel steps executed by worker applets Routines are defined in parallel steps Routines are methods of objects, derived from a Charlotte class Droutine public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } } Charlotte - Memory Semantics: Charlotte - Memory Semantics Memory partitioned into private and shared parts Shared memory with CRCW-common semantics Implemented at object level Charlotte provides classes for every primitive type: int Dint, float Dfloat, etc Manager has master copy of memory Shared objects have get() and set() methods get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager set() marks data as modified to be flushed to manager at end of parallel step Updates from different routines incorporated atomically at end of parallel step Allows eager scheduling for fault tolerance Routines can be executed multiple times without violating exactly once semantics Charlotte - Matrix Multiplication: Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … } Charlotte - Pros and Cons: Charlotte - Pros and Cons Pros Simple programming model Well-defined DSM semantics Adaptive parallelism Fault tolerant with respect to worker crashes Cons Efficiency problems Shared data is accessed via method invocations Loading data from manager incurs latency Choosing good page size is difficult Sending modified data back to manager requires inspection of status flags Reason: runtime system does not know which data is read or modified by routines Make this information explicit in Charlotte programs!Annotations: Annotations Specify read and write locations for routines Use methods of class Droutine dloc_read() dloc_write() First step: Correctness-preserving annotations Only send read data at beginning of routine public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … }Relying on Annotations: Relying on Annotations Second step: Rely on annotations Data transfer between manager and worker happens only according to annotations No if-statement in get() necessary No status update in set() necessary No latencies for requesting data updates But, wrong annotations lead to wrong results Use Uint (unchecked) instead of Dint, etc. Identical interface Remaining overhead: method invocation for data access Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed Simple to switch back and forthSharing Primitive Types: Sharing Primitive Types Third step: Use primitive types (int) instead of objects (Uint) Possible since get() and set() are trivial for Uint Possible since annotations describe data movement completely Avoids method invocation overhead But: different interface No get()/set() syntactic changes Call-by-reference / call-by-value Particularly suited to arrays of distributed types Freely mix Dint, Uint and shared intAdditional Optimizations: Additional Optimizations Manager keeps track of workers’ valid data sets Use routines’ read sets to select routine to give to worker Choose routine with minimal amount of missing data Example: 3 routines, 2 workers Data can be cached at workers between parallel steps Requires data to be declared unmodified at the beginning of parallel step Worker 1 Worker 2 valid: 1-100 valid: 150-200 read_set: 20- 80 90- 150 160- 210Discussion: Discussion Gradual incorporation of semantic knowledge into Charlotte programs Large flexibility to mix different levels of annotation semantics in one program Easy to switch back and forth among these levels Close to message passing behavior Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) Is it efficient?Annotating DSM Code: Annotating DSM Code Munin Annotating data with expected access patterns, system chooses appropriate consistency protocols Aurora Object-based system in C++, different consistency models are dynamically selected at runtime CRL, Cid Library of C-functions, explicit mapping of shared memory in local memory Jade Annotations used to extract parallelism from sequential code All lack flexibility to mix different guarantee levelsExperiments - Setup: Experiments - Setup Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec.Experiments : Experiments Multiply two 200x200 integer matrices Runtimes for Standard Charlotte (Dint) Standard Charlotte plus correctness-preserving annotations (Dint+A) Unchecked classes (Uint) Shared primitive types (int) A message passing implementation (mes.pas.) Shared primitive types competitive with message passing Reasonable absolute speedups (compared to sequential runtime)Experiments (contd.): Experiments (contd.) Ratios between various optimizations Annotations/Presending give a factor of about three Getting rid of objects is another factor of two Altogether: up to nine times faster than standard Charlotte And better scalabilityExperiments (contd.): Experiments (contd.) Manager at NYU, worker at HU Berlin Similar behavior, if slightly smaller improvementExperiments (contd.): Experiments (contd.) Colocation and Caching Example: multiply matrix A by two different matrices B1 and B2 Small impact in a LAN environment Up to 25% improvement for high-latency connections Colocation results in smaller standard deviation of runtimesFuture Work: Future Work Investigating more elaborate examples Data with both regular and irregular access patterns Compiler-generated annotations New programming interface no nested parallelism parBegin() / parEnd() makes some annotations awkward Overlapping communication/computation Multiple worker threads in one virtual machine Applying annotations to Calypso (a page-based DSM system) Down the road: use annotations for QoSSummary: Summary Annotations to describe read and write sets of parallel routines Correctness-preserving Correctness-sensitive Or directly sharing primitive types Possible to mix these different levels in a single program and to switch back and forth between them Big flexibility for programmer Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance Performance competitive with message passing programs