logging in or signing up PrelimII Cannes Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 76 Category: Science & Tech.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 07, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript A non-blocking coordinated checkpoint protocol: A non-blocking coordinated checkpoint protocol PhD Candidate: Yijian Yang Committee: Dr. Yuan Shi (Chairman) Dr. Peter Hu Dr. Pei Wang Dr. Henry SendaulaOutline: Outline Overview Current solutions and their problems SPP solution Performance comparison Conclusion and future work Overview: Overview What is fault tolerance A property that enables any application to continue operating in the event of multiple component failures in the multiprocessor system. Why do we need fault tolerance? More processors means more points of failure Low-cost, custom-assembled cluster implies higher failure rate than custom multiprocessors. Applications requirement far exceeding MTBF Fault Tolerance Categories: Fault Tolerance Categories Fault tolerance can be achieved through the following ways Replicated system 2PC Transaction Group communication Rollback recovery Checkpoint based Log basedComputation Models: Computation Models Current systems: MPI (based on message passing) Master and worker contained in the same (stateful) program and are distributed to all nodes. Application fault tolerance must involve all nodes. Proposed system: SPP (based on dataflow) Master (stateful) and worker (stateless) are separate programs. Only worker are automatically distributed to all nodes. Worker fault tolerance is done via low-cost shadow-tuples. Application fault tolerance needs only to protect (stateful) master(s).Proposed solution for SPP: Proposed solution for SPP System level checkpoint based non-blocking coordinated protocol for multi-master protection Checkpoint: No need for detecting, logging or replaying event Doesn’t rely on PWD Low overhead and fast recovery Coordinated Simplifying recovery Not susceptible to domino effect One permanent checkpoint, no need for garbage collection. Non-blocking Low overhead System level Transparent to programmer and automaticCurrent solution 1 - Blocking: Current solution 1 - BlockingProblem with blocking protocol: Problem with blocking protocol Assume that the network is FIFO Flushing the network before sending the CP request takes time Blocking process when coordinate When fail, all processes have to rollbackCurrent solution 2 – Non – Blocking (Chandy and Lamport algorithm): Current solution 2 – Non – Blocking (Chandy and Lamport algorithm)Problem with current non-blocking protocol: Problem with current non-blocking protocol Message replaying is done through the help of the message sender, which requires all processes to rollback to their previous checkpoint when one process fails.SPP Solution: SPP SolutionSPP Solution – Synergy Implementation: SPP Solution – Synergy ImplementationSPP Solution – Synergy Implementation (cont): SPP Solution – Synergy Implementation (cont)Improvement: Improvement Non-blocking. SPP enables single process rollback.Performance Study: Performance Study In order to exam the performance effects of blocking vs. non-blocking, we have the following assumptions: Time used for taking local checkpoints is constant. Work load for each single processor remains the same regarding to different cluster size. Both blocking and non-blocking protocols are implemented using the same underlying library. Experiment Environment: Experiment Environment Yoda cluster Sun Blade 100 workstation 550-MHz 512MB SDRAM 100Mbps Ethernet Application Matrix multiplication G = 250 (Near optimal) Each node will get 2 chunks (500) during the computation.Near optimal G calculation: Near optimal G calculationResult 1: Result 1Result 2: Result 2Conclusions: Conclusions Proposed non-blocking protocol delivered much lower overhead compared to blocking protocol. Expect better performance during the recovery (not tested yet). The overall fault tolerance overheads can be significantly lower than MPI systems. Future Work: Future Work Complete implementation of SPP system-level checkpoint and recovery. Performance comparisons with MPI fault tolerance systems. Formal discussions.Bibliography: Bibliography Shi, Y. 2004 Stateless Parallel Processing Chandy, K. 1985 Distributed snapshots: determining global states of distributed systems Elnozahy, E. 2002 A survey of rollback-recovery protocols in message passing systems Camille, C 2006 Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant MPISlide23: Thanks! You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
PrelimII Cannes Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 76 Category: Science & Tech.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 07, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript A non-blocking coordinated checkpoint protocol: A non-blocking coordinated checkpoint protocol PhD Candidate: Yijian Yang Committee: Dr. Yuan Shi (Chairman) Dr. Peter Hu Dr. Pei Wang Dr. Henry SendaulaOutline: Outline Overview Current solutions and their problems SPP solution Performance comparison Conclusion and future work Overview: Overview What is fault tolerance A property that enables any application to continue operating in the event of multiple component failures in the multiprocessor system. Why do we need fault tolerance? More processors means more points of failure Low-cost, custom-assembled cluster implies higher failure rate than custom multiprocessors. Applications requirement far exceeding MTBF Fault Tolerance Categories: Fault Tolerance Categories Fault tolerance can be achieved through the following ways Replicated system 2PC Transaction Group communication Rollback recovery Checkpoint based Log basedComputation Models: Computation Models Current systems: MPI (based on message passing) Master and worker contained in the same (stateful) program and are distributed to all nodes. Application fault tolerance must involve all nodes. Proposed system: SPP (based on dataflow) Master (stateful) and worker (stateless) are separate programs. Only worker are automatically distributed to all nodes. Worker fault tolerance is done via low-cost shadow-tuples. Application fault tolerance needs only to protect (stateful) master(s).Proposed solution for SPP: Proposed solution for SPP System level checkpoint based non-blocking coordinated protocol for multi-master protection Checkpoint: No need for detecting, logging or replaying event Doesn’t rely on PWD Low overhead and fast recovery Coordinated Simplifying recovery Not susceptible to domino effect One permanent checkpoint, no need for garbage collection. Non-blocking Low overhead System level Transparent to programmer and automaticCurrent solution 1 - Blocking: Current solution 1 - BlockingProblem with blocking protocol: Problem with blocking protocol Assume that the network is FIFO Flushing the network before sending the CP request takes time Blocking process when coordinate When fail, all processes have to rollbackCurrent solution 2 – Non – Blocking (Chandy and Lamport algorithm): Current solution 2 – Non – Blocking (Chandy and Lamport algorithm)Problem with current non-blocking protocol: Problem with current non-blocking protocol Message replaying is done through the help of the message sender, which requires all processes to rollback to their previous checkpoint when one process fails.SPP Solution: SPP SolutionSPP Solution – Synergy Implementation: SPP Solution – Synergy ImplementationSPP Solution – Synergy Implementation (cont): SPP Solution – Synergy Implementation (cont)Improvement: Improvement Non-blocking. SPP enables single process rollback.Performance Study: Performance Study In order to exam the performance effects of blocking vs. non-blocking, we have the following assumptions: Time used for taking local checkpoints is constant. Work load for each single processor remains the same regarding to different cluster size. Both blocking and non-blocking protocols are implemented using the same underlying library. Experiment Environment: Experiment Environment Yoda cluster Sun Blade 100 workstation 550-MHz 512MB SDRAM 100Mbps Ethernet Application Matrix multiplication G = 250 (Near optimal) Each node will get 2 chunks (500) during the computation.Near optimal G calculation: Near optimal G calculationResult 1: Result 1Result 2: Result 2Conclusions: Conclusions Proposed non-blocking protocol delivered much lower overhead compared to blocking protocol. Expect better performance during the recovery (not tested yet). The overall fault tolerance overheads can be significantly lower than MPI systems. Future Work: Future Work Complete implementation of SPP system-level checkpoint and recovery. Performance comparisons with MPI fault tolerance systems. Formal discussions.Bibliography: Bibliography Shi, Y. 2004 Stateless Parallel Processing Chandy, K. 1985 Distributed snapshots: determining global states of distributed systems Elnozahy, E. 2002 A survey of rollback-recovery protocols in message passing systems Camille, C 2006 Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant MPISlide23: Thanks!