PAXOS and Friends
Required Readings
Optional
- In Search of an Understandable Consensus Algorithm
- Understanding of consistency in distributed systems
- The Part-Time Parliament of Paxos
- Paxos Variants
- Neat Algorithms - Paxos
- Lamport’s Publications
- Secret Lives of Data
- RAFT GitHub
Summary
- Goal of Consensus Protocol
- 2PC and 3PC
- Paxos History
- Paxos Made Simple
- Paxos Made Simple: Phases
- Paxos: Prepare Phase
- Paxos: Accept Phase
- Paxos: Learn Phase
- Paxos vs FLP
- Multi-Paxos
- Paxos in Practice
- RAFT
- RAFT Overview
- RAFT Leader Election
- RAFT Log Replication
- RAFT Safety
Goal of Consensus Protocol
Processes propose values, values are chosen, chosen values are learned
Nodes may be proposers
, acceptors
, or learners
Must guarantee safety and liveness.
Safety := Only a value that has been proposed is chosen, only a single value is chosen, and only a chosen value is learned
Liveness := Some proposed value is chosen, any chosen value is learned.
[FLP Theorem](/posts/consensusindistributedsystems/ have both safety and liveness.
2PC and 3PC
Two and three phase commit
2PC
2PC requires a coordinator node and it is assumed that the coordinator will not fail.
Vote collection phase:
- Coordinator requests vote from each participant
- Participants send their vote (value)
Decision phase:
- Coordinator finds the majority and sends decision
- Decision is acknowledged by participants
Blocks, no liveness
3PC
3PC addresses blocking
Soliciting votes:
- Coordinator asks each node if they can commit
- Each node responds
Commit authorized:
- Coordinator sends preCommit to each node
- Each node acks
Finalizing commit:
- Coordinator sends doCommit to each node
- Each node responds when haveCommited
3PC assumes fail-stop, safety issues on fail-restart
Paxos History
First Paxos Paper written by Leslie Lamport in 1990
- Not published until 1998
- Reviewers didn’t appreciate humorous depiction of algorithm
Parliament passes decrees, but:
- Work only part time
- Communicate by messages
- Messages may be delayed
- Parliamentarians may choose not to vote
- No one is malicious
Multi-Paxos := running paxos decisions on multiple decrees
Paxos was very difficult for most people to understand, so Lamport wrote Paxos Made Simple in 2001
Paxos Made Simple
Asynchronous, Non-Byzantine
Agents:
- Operate at arbitrary speed
- May fail by stopping
- May restart
- Have source of persistent memory to remember info after restart
Messages:
- Can take arbitrarily long to be delivered
- Can be duplicated
- Can be lost
- Can be reordered
- Can NOT be corrupted
Each node is a replica of the same state machine following the same rules
Every decision is based on a majority quorum, two quorums are guaranteed to have intersecting members, so consensus decision can be disseminated.
Majority quorum is needed so it can tolerate fail-restart
Everything is time stamped, so it can be ordered.
Time stamps are needed so it can tolerate arbitrary message delays
Paxos Made Simple: Phases
3 Phases
- Prepare
- Node proposes an agreement round
- Accept
- Gather votes whether an agreement is possible and value has been agreed upon
- Learn
- Agreed upon value can be learned by all
Paxos: Prepare Phase
Driven by proposer (leader)
Proposer selects proposal number n and sends a prepare request
with the
number n to a majority of acceptors.
The number n is a member of a set that is totally ordered over all processes, i.e. No two processes can have the same n and a process never uses n twice.
Once acceptors receive prepare request numbered n, it rejects all requests numbered < n
Paxos: Accept Phase
If a proposer hears back from a majority of acceptors, it sends an accept request
to each of the acceptors for proposal numbered n, with value v.
v is the highest-numbered proposal among the responses. If there were no
proposals, the proposer can choose v.
If an acceptor receives an accept request
for a proposal numbered n, it
accepts the proposal, unless it has already responded to a prepare request with
a number > n.
Paxos: Learn Phase
Prepare and Accept phases are for write operations, Learn is for read
When an accepted value receives a commit message, it becomes decided value and is communicated to learners.
Learners can also send read-request to check for a value
Read requests need to be handled by a majority quorum as well, due to fail-restarts
It’s inefficient to have each acceptor notify each learner whenever it accepts a proposal
You can choose a Distinguished Learner
that receives accepted proposals from
the acceptors
Once distinguished learner receives proposal from a majority of acceptors, THEN it informs other learners
A larger set of distinguished learners provide greater reliability
Paxos vs FLP
Paxos has liveness problem
Two proposers keep issuing a sequence of proposals with increasing numbers, and before the first value is learned from a majority, the next value is proposed
A workaround would be to put in random delays for retrying a new value
Could also designate a Distinguished Proposer
(and some timeouts to resolve
leader failure)
These make it extremely unlikely that liveness will not be reached in practice, although it’s technically still possible
Multi-Paxos
Each simple single-decree Paxos for agreeing on an individual value
Multiple Paxos protocols are executed for agreeing on the order and values of sequences of operations
Was part of the original paper
Optimizations:
- Separate executions into segments and designate a leader for current segment
- Leader is the proposer
Similar to Viewstamp Replication (VR)
Paxos in Practice
First known implementation of Paxos was in DEC System Resource Center
Google Chubby
Zookeeper Atomic Broadcast (ZAB)
RAFT
RAFT is more understandable
Implementation vs Specification
Broad commercial use
RAFT Overview
Distinct leader election
phase.
After leader is elected, is log replication
phase
Each time a new leader is elected, a new term
is introduced (similar to view)
Leader can be active for arbitrary time and arbitrary number of actions
RAFT Leader Election
All nodes exchange heartbeat messages with leader to make sure it’s still alive
Follower’s timeout triggers a leader election
To apply as a candidate, node sends info about its most current term, and most recent message in log
All nodes vote and decision is made by majority quorum
At most 1 leader for any term
Once candidate becomes leader, new term starts
Outdated leaders cannot be elected
- nodes only vote when the candidate has a newer log
- new == higher term number or same term number but longer log
- losing candidate knows its outdated and keeps following
If tie, random timeout before revote
RAFT Log Replication
Each node has a log of entries, each of which contains the operation and term number
Writes go to leader, who appends his log, then replicates
- leader pushes new log entry along with previous entry to followers during heartbeat
- Each follower checks if they have the previous log, send ack if yes
- Log entry gets committed at leader once leader gets majority ack, then leader ack client
New leader commits its uncommitted logs from previous terms after it commits some logs in new term
Garbage Collection = Snapshot + log truncation
Leader can send snapshot of current state (rather than log) to nodes that are sufficiently behind
RAFT Safety
Once committed, a log entry won’t be overwritten
Once a log entry is applied in a node, no other node will apply a different log entry in the same slot