Fault Tolerance

Stephen M. Reaves

2023-02-17

Notes about Lesson 7 of CS-7210

Required Readings

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Summary

Taxonomy
Types of Faults
Types of Failures
Managing Failures
Rollback-Recovery Idea
Basic Mechanisms
Checkpoint
Logging
Checkpointing Approaches
Uncoordinated Checkpointing
Coordinated Checkpointing
Communication-Induced
Logging Mechanisms

Taxonomy

A fault can be activated to lead to an error which can propagate into a failure

Types of Faults

Transient Fault
- appears once, then goes away.
Intermittent Fault
- appears, goes away, and appears again. This cycle anc repeat.
Permanent Fault
- appears and never goes away

Types of Failures

Fail-Stop
- 1 or more components stop working
- crash
Timing
- components work, but outside of some timing expectation
Omission
- some actions are missing
- fails to send all messages
Byzantine
- Incorrect behaviour

Managing Failures

Avoidance
- Predictive ability
Detection
- error detection codes
Removal
- Rollback state to before error
Recovery
- Ensure correcte execution regardless of failure
- Fault-Tolerance

Rollback-Recovery Idea

If failure detected, roll back state and effects of messages to before failure, then continue executing

Needs to rollback to some previous consistent cut have to be an actual state that the system has actually been in

How far back do we go?

~~Try to find by progressively rolling back to earlier cuts and reexecute?~~

This can be done by either Checkpointing or Logging

Granularity can be different:

Transparent
- No applicaton modification, all system-level
- High overheads
Transaction
- Relies on transactional API
- Overhead is reduced to groups of related operations
Application-specific
- applications know best what state is needed for Recovery
- limited applicability

Basic Mechanisms

Checkpoint

Save system or application state, flush to disk

Potentially a lot of I/O used to save state. This can be improved by only saving state deltas

Logging

Log information as it is performed

Can undo and redo

Log needs to be on some persistent storage, typically this is smaller than total state, but is written every action in happy path

Recovery takes longer

Dependent actions make this difficult

Checkpointing Approaches

Uncoordinated, Coordinated, Communication-Induced

Uncoordinated Checkpointing

Processes take checkpoints independently

We need to construct a consistent state after failure

Need to maintain dependency information

Consistent cuts could rollback to execution to beginning, usually if checkpoints are when messages are in-flight

Domino effect

Could create many usesless checkpoints

Garbage Collection is needed to find obsolete checkpoints, which adds overhead

Coordinated Checkpointing

processes coordinate when they checkpoint to get a consistent state

No longer requires dependency graph

No domino effect

Single checkpoint per process means less storage needs and no GC

How to handle delayed initiator messages?

No synchronous clock guarantee

This would be easier with bounded message time

Initiator could tell process to keep taking check points even if state never changed, causing unnecessary checkpoints

Communication-Induced

Use consensus algorithms to determine when to checkpoint

All messages are blocked during consensus process

Chandy-Lamport Algorithm a non-blocking global snapshotting algorithm

Checkpoints are made before sending marker and before processing receipt of marker

Logging Mechanisms

Pessimistic logging := a logging strategy where everything is logged to persistent storage, even before allowing event to propogate and commit

PL is most safe, but incurs highest IO penalty

Optimisitc logging := a logging strategy where logs will be persistent before failure, but make it possible to remove effects if aborted

OL must track dependencies in order to remove effects

Causality-tracking logging := a logging strategy where causally related events are deterministically recorded