Consensus In Distributed Systems

Stephen M. Reaves

2023-02-06

Notes about Lesson 5 of CS-7210

Required Readings

Summary

What is a Consensus
System Model
Definitions
FLP Theorem
- Proof
Is Consensus Really Impossible?

What is a Consensus

Consensus := the ability for multiple separate processes in a distributeds system to agree on something.

Consensus is needed to agree on the outcome of a transaction.

3 key properties:

Terminaion/Liveness
- All non-faulty processes eventually decide on a value
Agreement
- All processes agree on a single value
Validity
- The value that’s decided on must have been proposed by some process

System Model

System is async
- Messages may be reorded and delayed, but not corrupted
At most one faulty processor
Fail-stop failure model
- All failures are indistinguishable from a message delay

Real systems are more complex

Definitions

Admissable Run := a run with 1 faulty processor and all messages eventually delivered (matches system model)

Deciding run := Admissable run where some non-faulty processor reaches a decision

Totally Correct Consensus Protocol := a system where all admissable runs are deciding runs

Uni-valent configuration := a deciding run where only one decision is possible

FLP Theorem

Dijkstra award for Proving the impossibility of consensus using async communication.

In a system with one fault, no consensus protocol can be totally correct.

Proof

Proof in a nutshell

Start with a system where nodes can either decide on 0 or 1.

Lemma 2: There is an initial configuration for which the final decision is not predetermined, but depends on the schedule of events (initial bivalent configuration)

There must be an event that changes from a bivalent to univalent configuration.

It is possible to delay/reorder messages in such a way so that the system never goes through the valence transition.

Is Consensus Really Impossible?

Faults are inevitable

Network delays are inevitable

Cannot expect a stronger system model

=> Impossible to have a correct distributed system?

There are examples of correct distributed systems:

2 Phase Commit (2PC)
3 Phase Commit (3PC)
Paxos
Raft

These do NOT contradict the FLP result, rather they change some of the assumptions and system prooperties

Under which conditions will it provide us consensus?