hpnc98b

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Bridging the Gap Between Distributed Shared Memory and Message Passing: 

Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft

Overview: 

Overview Motivation Related approaches Charlotte Annotating Charlotte Some experiments Summary

Motivation: 

Motivation Scenario Distributed computing using the World Wide Web Use Java to overcome heterogeneity, security concerns and program installation problems Idea: volunteer computing Applets can participate in distributed applications Challenges Only make use of standard Web browsers/Java virtual machines Consider the Web’s and Java’s idiosyncrasies Inherent faultiness, machines come and go Host-of-origin imposes a star-like communication topology Provide a simple programming interface, e.g., DSM No hardware support (e.g., page protection) available Do not forget efficiency

Related Approaches: 

Related Approaches

Charlotte - Programs: 

Charlotte - Programs Alternating sequential and parallel steps Sequential steps executed by a manager application Parallel steps executed by worker applets Routines are defined in parallel steps Routines are methods of objects, derived from a Charlotte class Droutine public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } }

Charlotte - Memory Semantics: 

Charlotte - Memory Semantics Memory partitioned into private and shared parts Shared memory with CRCW-common semantics Implemented at object level Charlotte provides classes for every primitive type: int  Dint, float  Dfloat, etc Manager has master copy of memory Shared objects have get() and set() methods get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager set() marks data as modified to be flushed to manager at end of parallel step Updates from different routines incorporated atomically at end of parallel step Allows eager scheduling for fault tolerance Routines can be executed multiple times without violating exactly once semantics

Charlotte - Matrix Multiplication: 

Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … }

Charlotte - Pros and Cons: 

Charlotte - Pros and Cons Pros Simple programming model Well-defined DSM semantics Adaptive parallelism Fault tolerant with respect to worker crashes Cons Efficiency problems Shared data is accessed via method invocations Loading data from manager incurs latency Choosing good page size is difficult Sending modified data back to manager requires inspection of status flags Reason: runtime system does not know which data is read or modified by routines Make this information explicit in Charlotte programs!

Annotations: 

Annotations Specify read and write locations for routines Use methods of class Droutine dloc_read() dloc_write() First step: Correctness-preserving annotations Only send read data at beginning of routine public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … }

Relying on Annotations: 

Relying on Annotations Second step: Rely on annotations Data transfer between manager and worker happens only according to annotations No if-statement in get() necessary No status update in set() necessary No latencies for requesting data updates But, wrong annotations lead to wrong results Use Uint (unchecked) instead of Dint, etc. Identical interface Remaining overhead: method invocation for data access Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed Simple to switch back and forth

Sharing Primitive Types: 

Sharing Primitive Types Third step: Use primitive types (int) instead of objects (Uint) Possible since get() and set() are trivial for Uint Possible since annotations describe data movement completely Avoids method invocation overhead But: different interface No get()/set()  syntactic changes Call-by-reference / call-by-value Particularly suited to arrays of distributed types Freely mix Dint, Uint and shared int

Additional Optimizations: 

Additional Optimizations Manager keeps track of workers’ valid data sets Use routines’ read sets to select routine to give to worker Choose routine with minimal amount of missing data Example: 3 routines, 2 workers Data can be cached at workers between parallel steps Requires data to be declared unmodified at the beginning of parallel step Worker 1 Worker 2 valid: 1-100 valid: 150-200 read_set: 20- 80 90- 150 160- 210

Discussion: 

Discussion Gradual incorporation of semantic knowledge into Charlotte programs Large flexibility to mix different levels of annotation semantics in one program Easy to switch back and forth among these levels Close to message passing behavior Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) Is it efficient?

Annotating DSM Code: 

Annotating DSM Code Munin Annotating data with expected access patterns, system chooses appropriate consistency protocols Aurora Object-based system in C++, different consistency models are dynamically selected at runtime CRL, Cid Library of C-functions, explicit mapping of shared memory in local memory Jade Annotations used to extract parallelism from sequential code All lack flexibility to mix different guarantee levels

Experiments - Setup: 

Experiments - Setup Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec.

Experiments : 

Experiments Multiply two 200x200 integer matrices Runtimes for Standard Charlotte (Dint) Standard Charlotte plus correctness-preserving annotations (Dint+A) Unchecked classes (Uint) Shared primitive types (int) A message passing implementation (mes.pas.) Shared primitive types competitive with message passing Reasonable absolute speedups (compared to sequential runtime)

Experiments (contd.): 

Experiments (contd.) Ratios between various optimizations Annotations/Presending give a factor of about three Getting rid of objects is another factor of two Altogether: up to nine times faster than standard Charlotte And better scalability

Experiments (contd.): 

Experiments (contd.) Manager at NYU, worker at HU Berlin Similar behavior, if slightly smaller improvement

Experiments (contd.): 

Experiments (contd.) Colocation and Caching Example: multiply matrix A by two different matrices B1 and B2 Small impact in a LAN environment Up to 25% improvement for high-latency connections Colocation results in smaller standard deviation of runtimes

Future Work: 

Future Work Investigating more elaborate examples Data with both regular and irregular access patterns Compiler-generated annotations New programming interface no nested parallelism parBegin() / parEnd() makes some annotations awkward Overlapping communication/computation Multiple worker threads in one virtual machine Applying annotations to Calypso (a page-based DSM system) Down the road: use annotations for QoS

Summary: 

Summary Annotations to describe read and write sets of parallel routines Correctness-preserving Correctness-sensitive Or directly sharing primitive types Possible to mix these different levels in a single program and to switch back and forth between them Big flexibility for programmer Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance Performance competitive with message passing programs