Fault Tolerance
Required Readings
Summary
- Taxonomy
- Types of Faults
- Types of Failures
- Managing Failures
- Rollback-Recovery Idea
- Basic Mechanisms
- Checkpoint
- Logging
- Checkpointing Approaches
- Uncoordinated Checkpointing
- Coordinated Checkpointing
- Communication-Induced
- Logging Mechanisms
Taxonomy
A fault
can be activated to lead to an error
which can propagate
into a failure
Types of Faults
- Transient Fault
- appears once, then goes away.
- Intermittent Fault
- appears, goes away, and appears again. This cycle anc repeat.
- Permanent Fault
- appears and never goes away
Types of Failures
- Fail-Stop
- 1 or more components stop working
- crash
- Timing
- components work, but outside of some timing expectation
- Omission
- some actions are missing
- fails to send all messages
- Byzantine
- Incorrect behaviour
Managing Failures
- Avoidance
- Predictive ability
- Detection
- error detection codes
- Removal
- Rollback state to before error
- Recovery
- Ensure correcte execution regardless of failure
- Fault-Tolerance
Rollback-Recovery Idea
If failure detected, roll back state and effects of messages to before failure, then continue executing
Needs to rollback to some previous consistent cut have to be an actual state that the system has actually been in
How far back do we go?
Try to find by progressively rolling back to earlier cuts and reexecute?
This can be done by either Checkpointing or Logging
Granularity can be different:
- Transparent
- No applicaton modification, all system-level
- High overheads
- Transaction
- Relies on transactional API
- Overhead is reduced to groups of related operations
- Application-specific
- applications know best what state is needed for Recovery
- limited applicability
Basic Mechanisms
Checkpoint
Save system or application state, flush to disk
Potentially a lot of I/O used to save state. This can be improved by only saving state deltas
Logging
Log information as it is performed
Can undo and redo
Log needs to be on some persistent storage, typically this is smaller than total state, but is written every action in happy path
Recovery takes longer
Dependent actions make this difficult
Checkpointing Approaches
Uncoordinated, Coordinated, Communication-Induced
Uncoordinated Checkpointing
Processes take checkpoints independently
We need to construct a consistent state after failure
Need to maintain dependency information
Consistent cuts could rollback to execution to beginning, usually if checkpoints are when messages are in-flight
Domino effect
Could create many usesless checkpoints
Garbage Collection is needed to find obsolete checkpoints, which adds overhead
Coordinated Checkpointing
processes coordinate when they checkpoint to get a consistent state
No longer requires dependency graph
No domino effect
Single checkpoint per process means less storage needs and no GC
How to handle delayed initiator messages?
No synchronous clock guarantee
This would be easier with bounded message time
Initiator could tell process to keep taking check points even if state never changed, causing unnecessary checkpoints
Communication-Induced
Use consensus algorithms to determine when to checkpoint
All messages are blocked during consensus process
Chandy-Lamport Algorithm a non-blocking global snapshotting algorithm
Checkpoints are made before sending marker and before processing receipt of marker
Logging Mechanisms
Pessimistic logging := a logging strategy where everything is logged to persistent storage, even before allowing event to propogate and commit
PL is most safe, but incurs highest IO penalty
Optimisitc logging := a logging strategy where logs will be persistent before failure, but make it possible to remove effects if aborted
OL must track dependencies in order to remove effects
Causality-tracking logging := a logging strategy where causally related events are deterministically recorded