Consensus In Distributed Systems
Required Readings
- Impossibility of distributed consensus with one faulty process
- Replication chapter in For Funand Profit (No Link
Summary
What is a Consensus
Consensus := the ability for multiple separate processes in a distributeds system to agree on something.
Consensus is needed to agree on the outcome of a transaction.
3 key properties:
- Terminaion/Liveness
- All non-faulty processes eventually decide on a value
- Agreement
- All processes agree on a single value
- Validity
- The value that’s decided on must have been proposed by some process
System Model
- System is async
- Messages may be reorded and delayed, but not corrupted
- At most one faulty processor
- Fail-stop failure model
- All failures are indistinguishable from a message delay
Real systems are more complex
Definitions
Admissable Run := a run with 1 faulty processor and all messages eventually delivered (matches system model)
Deciding run := Admissable run where some non-faulty processor reaches a decision
Totally Correct Consensus Protocol := a system where all admissable runs are deciding runs
Uni-valent configuration := a deciding run where only one decision is possible
FLP Theorem
Dijkstra award for Proving the impossibility of consensus using async communication.
In a system with one fault, no consensus protocol can be totally correct.
Proof
Start with a system where nodes can either decide on 0 or 1.
Lemma 2: There is an initial configuration for which the final decision is not predetermined, but depends on the schedule of events (initial bivalent configuration)
There must be an event that changes from a bivalent to univalent configuration.
It is possible to delay/reorder messages in such a way so that the system never goes through the valence transition.
Is Consensus Really Impossible?
Faults are inevitable
Network delays are inevitable
Cannot expect a stronger system model
=> Impossible to have a correct distributed system?
There are examples of correct distributed systems:
- 2 Phase Commit (2PC)
- 3 Phase Commit (3PC)
- Paxos
- Raft
These do NOT contradict the FLP result, rather they change some of the assumptions and system prooperties
Under which conditions will it provide us consensus?