Presentation Description

No description available.


Presentation Transcript

A non-blocking coordinated checkpoint protocol: 

A non-blocking coordinated checkpoint protocol PhD Candidate: Yijian Yang Committee: Dr. Yuan Shi (Chairman) Dr. Peter Hu Dr. Pei Wang Dr. Henry Sendaula


Outline Overview Current solutions and their problems SPP solution Performance comparison Conclusion and future work


Overview What is fault tolerance A property that enables any application to continue operating in the event of multiple component failures in the multiprocessor system. Why do we need fault tolerance? More processors means more points of failure Low-cost, custom-assembled cluster implies higher failure rate than custom multiprocessors. Applications requirement far exceeding MTBF

Fault Tolerance Categories: 

Fault Tolerance Categories Fault tolerance can be achieved through the following ways Replicated system 2PC Transaction Group communication Rollback recovery Checkpoint based Log based

Computation Models: 

Computation Models Current systems: MPI (based on message passing) Master and worker contained in the same (stateful) program and are distributed to all nodes. Application fault tolerance must involve all nodes. Proposed system: SPP (based on dataflow) Master (stateful) and worker (stateless) are separate programs. Only worker are automatically distributed to all nodes. Worker fault tolerance is done via low-cost shadow-tuples. Application fault tolerance needs only to protect (stateful) master(s).

Proposed solution for SPP: 

Proposed solution for SPP System level checkpoint based non-blocking coordinated protocol for multi-master protection Checkpoint: No need for detecting, logging or replaying event Doesn’t rely on PWD Low overhead and fast recovery Coordinated Simplifying recovery Not susceptible to domino effect One permanent checkpoint, no need for garbage collection. Non-blocking Low overhead System level Transparent to programmer and automatic

Current solution 1 - Blocking: 

Current solution 1 - Blocking

Problem with blocking protocol: 

Problem with blocking protocol Assume that the network is FIFO Flushing the network before sending the CP request takes time Blocking process when coordinate When fail, all processes have to rollback

Current solution 2 – Non – Blocking (Chandy and Lamport algorithm): 

Current solution 2 – Non – Blocking (Chandy and Lamport algorithm)

Problem with current non-blocking protocol: 

Problem with current non-blocking protocol Message replaying is done through the help of the message sender, which requires all processes to rollback to their previous checkpoint when one process fails.

SPP Solution: 

SPP Solution

SPP Solution – Synergy Implementation: 

SPP Solution – Synergy Implementation

SPP Solution – Synergy Implementation (cont): 

SPP Solution – Synergy Implementation (cont)


Improvement Non-blocking. SPP enables single process rollback.

Performance Study: 

Performance Study In order to exam the performance effects of blocking vs. non-blocking, we have the following assumptions: Time used for taking local checkpoints is constant. Work load for each single processor remains the same regarding to different cluster size. Both blocking and non-blocking protocols are implemented using the same underlying library.

Experiment Environment: 

Experiment Environment Yoda cluster Sun Blade 100 workstation 550-MHz 512MB SDRAM 100Mbps Ethernet Application Matrix multiplication G = 250 (Near optimal) Each node will get 2 chunks (500) during the computation.

Near optimal G calculation: 

Near optimal G calculation

Result 1: 

Result 1

Result 2: 

Result 2


Conclusions Proposed non-blocking protocol delivered much lower overhead compared to blocking protocol. Expect better performance during the recovery (not tested yet). The overall fault tolerance overheads can be significantly lower than MPI systems.

Future Work: 

Future Work Complete implementation of SPP system-level checkpoint and recovery. Performance comparisons with MPI fault tolerance systems. Formal discussions.


Bibliography Shi, Y. 2004 Stateless Parallel Processing Chandy, K. 1985 Distributed snapshots: determining global states of distributed systems Elnozahy, E. 2002 A survey of rollback-recovery protocols in message passing systems Camille, C 2006 Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant MPI



authorStream Live Help