Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

Fault Tolerance

Stephen M. Reaves

::

2023-02-17

Notes about Lesson 7 of CS-7210

Required Readings

Summary

Taxonomy

A fault can be activated to lead to an error which can propagate into a failure

Types of Faults

Types of Failures

Managing Failures

Rollback-Recovery Idea

If failure detected, roll back state and effects of messages to before failure, then continue executing

Needs to rollback to some previous consistent cut have to be an actual state that the system has actually been in

How far back do we go?

Try to find by progressively rolling back to earlier cuts and reexecute?

This can be done by either Checkpointing or Logging

Granularity can be different:

Basic Mechanisms

Checkpoint

Save system or application state, flush to disk

Potentially a lot of I/O used to save state. This can be improved by only saving state deltas

Logging

Log information as it is performed

Can undo and redo

Log needs to be on some persistent storage, typically this is smaller than total state, but is written every action in happy path

Recovery takes longer

Dependent actions make this difficult

Checkpointing Approaches

Uncoordinated, Coordinated, Communication-Induced

Uncoordinated Checkpointing

Processes take checkpoints independently

We need to construct a consistent state after failure

Need to maintain dependency information

Consistent cuts could rollback to execution to beginning, usually if checkpoints are when messages are in-flight

Domino effect

Could create many usesless checkpoints

Garbage Collection is needed to find obsolete checkpoints, which adds overhead

Coordinated Checkpointing

processes coordinate when they checkpoint to get a consistent state

No longer requires dependency graph

No domino effect

Single checkpoint per process means less storage needs and no GC

How to handle delayed initiator messages?

No synchronous clock guarantee

This would be easier with bounded message time

Initiator could tell process to keep taking check points even if state never changed, causing unnecessary checkpoints

Communication-Induced

Use consensus algorithms to determine when to checkpoint

All messages are blocked during consensus process

Chandy-Lamport Algorithm a non-blocking global snapshotting algorithm

Checkpoints are made before sending marker and before processing receipt of marker

Logging Mechanisms

Pessimistic logging := a logging strategy where everything is logged to persistent storage, even before allowing event to propogate and commit

PL is most safe, but incurs highest IO penalty

Optimisitc logging := a logging strategy where logs will be persistent before failure, but make it possible to remove effects if aborted

OL must track dependencies in order to remove effects

Causality-tracking logging := a logging strategy where causally related events are deterministically recorded