FTMA latest

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Fault-Tolerance Issues for Communicating Mobile Agents Keith Marzullo University of California, San Diego Department of Computer Science and Engineering … and the TACOMA group 6 October 1999

Fault-Tolerance: 

Fault-Tolerance Fault-tolerance can mean different things: Ensuring that a failure will not be visible to the application masking Detecting when a failure has occurred. detection and recovery Ensuring that a failure will not cause an inconsistent application state to arise. atomic transactions

Roadmap: 

Roadmap Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery.

Masking: 

Masking Uses sufficient replication and voting so that (independent) failures of components does not result in an incorrect state. It can be supplied as a wrapper that hides the replication of the service from the clients. Different approaches appropriate for different failure model, performance requirements, and underlying communication systems.

State Machine Approach: 

State Machine Approach

Primary-Backup Approach: 

Primary-Backup Approach

Detection and Recovery: 

Detection and Recovery Detection require less replication than masking 1 vs. f+1 for detecting vs. masking f failstop crashes f+1 vs. 2f+1 for detecting vs. masking f arbitrary failures Recovery can be rollback, roll forward, or more specific approach.

Roadmap: 

Roadmap Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery.

Replicated Agents with Voting: 

Replicated Agents with Voting S D

Replicated Agents with Voting (2): 

Replicated Agents with Voting (2) S D

Replicated Agents with Voting (3): 

Replicated Agents with Voting (3) S D

Replicated Agents with Voting (4): 

Replicated Agents with Voting (4)

Replicated Agents with Voting (5): 

Replicated Agents with Voting (5) Implements an architecture that can tolerate maliciously faulty landing pads. Rather complex and expensive. Perhaps best solved by landing pad.

Roadmap: 

Roadmap Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery.

Primary-Backup by Application: 

Primary-Backup by Application Places can crash, causing local agents to become lost. Agent code can be faulty, causing an agent to repeatedly fail. Communications can break, causing an agent’s plan to be unattainable.

Norwegian Army Protocol: 

Norwegian Army Protocol Protocol uses the places an agent has visited as a set of of potential places for recovery code to execute. The linear structure of a trajectory defines a monitoring strategy. version 1 (oldest) version 2 version 3 (youngest) version 4 current agent rear guards

Application Interaction: 

Application Interaction An agent executes a fault-tolerant action at a place Action completes with a move or exit Regular actions have an attribute failure Failure actions have attributes failedCode and failedAt If a regular action r fails then there is exactly one completed failure action f such that: f.code = r.failure f.failedCode = r.code f.failedAt = r.place f.bc = r.bc

Fail-Stop Reliable Broadcast: 

Fail-Stop Reliable Broadcast

Failure-Free Execution: 

Failure-Free Execution

Failure Execution: 

Failure Execution

NAP Details...: 

NAP Details... spawn and checkpoint operations also terminate fault-tolerant action Additional complexity arising from a mobile computation visiting same place multiple times. Can carry support for NAP along with mobile agent. scalability wrt administrative domains

Roadmap: 

Roadmap Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery.

Transactions: 

Transactions Atomicity based on atomic commit protocol and stable storage associated with each landing pad. Appears to be simple. Additional power comes from code mobility.

Transactions and Code Mobility: 

Transactions and Code Mobility account store 3 store 2 store 1 lock $100 lock $200 lock $160 buy X lock $100 lock $300 lock $460

Transactions and Code Mobility (2): 

Transactions and Code Mobility (2) account store 3 store 2 store 1 lock $100 lock $200 lock $160 buy X lock $100 lock $200 lock $200

Roadmap: 

Roadmap Review some ideas from fault-tolerance. Present some issues associated with masking and active replication in mobile agent computations. Present a protocol for detection and recovery in mobile agent computations based on primary-backup. Discuss some issues associated with transactional support. Mention a programming issue with detection and recovery.

Programming for Fault-Tolerance: 

Programming for Fault-Tolerance The kinds of problems we have been considering (so far) for NAP have to do with software installation and system maintenance. Synchronized installation of new version of package. Software license checking and upgrade. Specialized tool installation for distributed monitoring and testing. All are built around some variation of agreement or reliable broadcast.

Programming for Fault-Tolerance (2): 

Programming for Fault-Tolerance (2) A plus seems to be the separation of mobility from function. Trajectory, synchronization, security and authentication are handled by mobility. But, writing fault-tolerant actions to implement the particular version agreement/reliable broadcast is awkward . … this seems to be a good place to use a higher-level programming language. e.g., Sage

Observations: 

Observations It’s hard to do fault-tolerance without knowing the failure model! Detection and recovery is more appropriate for mobile agent computations than masking. Need work by the fault-tolerance community into detection and recovery for arbitrary failures. System management and maintenance seems to be a very rich field for problems involving fault-tolerant mobile agent computations.

Bibliography: 

Bibliography F. B. Schneider. Towards fault-tolerant and secure agentry. In 11th International Workshop, WDAG '97, Saarbrucken, Germany, 24-26 Sept. 1997), pp.1-14. Dag Johansen et. al. NAP: practical fault-tolerance for itinerant computations.In Proceedings. 19th IEEE International Conference on Distributed Computing Systems, Austin,TX, USA, 31 May-4 June 1999), pp. 180-189. M. Strasser and K. Rothermel. Reliability concepts for mobile agents. International Journal of Cooperative Information Systems, Dec. 1998, 7(4):355-382. A. Ricciardi. The Sage Project: Software Engineering for Distributed Applications. The University of Texas Department of Electrical and Computer Engineering TR-1996-007, available at http://www.bell-labs.com/user/aleta/TR-PDS-1996-007.ps.gz.