Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

PAXOS and Friends

Stephen M. Reaves

::

2023-02-19

Notes about Lesson 8 of CS-7210

Required Readings

Optional

Summary

Goal of Consensus Protocol

Processes propose values, values are chosen, chosen values are learned

Nodes may be proposers, acceptors, or learners

Must guarantee safety and liveness.

Safety := Only a value that has been proposed is chosen, only a single value is chosen, and only a chosen value is learned

Liveness := Some proposed value is chosen, any chosen value is learned.

[FLP Theorem](/posts/consensusindistributedsystems/ have both safety and liveness.

2PC and 3PC

Two and three phase commit

2PC

2PC requires a coordinator node and it is assumed that the coordinator will not fail.

Vote collection phase:

  1. Coordinator requests vote from each participant
  2. Participants send their vote (value)

Decision phase:

  1. Coordinator finds the majority and sends decision
  2. Decision is acknowledged by participants

Blocks, no liveness

3PC

3PC addresses blocking

Soliciting votes:

  1. Coordinator asks each node if they can commit
  2. Each node responds

Commit authorized:

  1. Coordinator sends preCommit to each node
  2. Each node acks

Finalizing commit:

  1. Coordinator sends doCommit to each node
  2. Each node responds when haveCommited

3PC assumes fail-stop, safety issues on fail-restart

Paxos History

First Paxos Paper written by Leslie Lamport in 1990

Parliament passes decrees, but:

Multi-Paxos := running paxos decisions on multiple decrees

Paxos was very difficult for most people to understand, so Lamport wrote Paxos Made Simple in 2001

Paxos Made Simple

Asynchronous, Non-Byzantine

Agents:

Messages:

Each node is a replica of the same state machine following the same rules

Every decision is based on a majority quorum, two quorums are guaranteed to have intersecting members, so consensus decision can be disseminated.

Majority quorum is needed so it can tolerate fail-restart

Everything is time stamped, so it can be ordered.

Time stamps are needed so it can tolerate arbitrary message delays

Paxos Made Simple: Phases

3 Phases

Paxos: Prepare Phase

Driven by proposer (leader)

Proposer selects proposal number n and sends a prepare request with the number n to a majority of acceptors.

The number n is a member of a set that is totally ordered over all processes, i.e. No two processes can have the same n and a process never uses n twice.

Once acceptors receive prepare request numbered n, it rejects all requests numbered < n

Paxos: Accept Phase

If a proposer hears back from a majority of acceptors, it sends an accept request to each of the acceptors for proposal numbered n, with value v. v is the highest-numbered proposal among the responses. If there were no proposals, the proposer can choose v.

If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal, unless it has already responded to a prepare request with a number > n.

Paxos: Learn Phase

Prepare and Accept phases are for write operations, Learn is for read

When an accepted value receives a commit message, it becomes decided value and is communicated to learners.

Learners can also send read-request to check for a value

Read requests need to be handled by a majority quorum as well, due to fail-restarts

It’s inefficient to have each acceptor notify each learner whenever it accepts a proposal

You can choose a Distinguished Learner that receives accepted proposals from the acceptors

Once distinguished learner receives proposal from a majority of acceptors, THEN it informs other learners

A larger set of distinguished learners provide greater reliability

Paxos vs FLP

Paxos has liveness problem

Two proposers keep issuing a sequence of proposals with increasing numbers, and before the first value is learned from a majority, the next value is proposed

A workaround would be to put in random delays for retrying a new value

Could also designate a Distinguished Proposer (and some timeouts to resolve leader failure)

These make it extremely unlikely that liveness will not be reached in practice, although it’s technically still possible

Multi-Paxos

Each simple single-decree Paxos for agreeing on an individual value

Multiple Paxos protocols are executed for agreeing on the order and values of sequences of operations

Was part of the original paper

Optimizations:

Similar to Viewstamp Replication (VR)

Paxos in Practice

First known implementation of Paxos was in DEC System Resource Center

Google Chubby

Zookeeper Atomic Broadcast (ZAB)

RAFT

RAFT is more understandable

Implementation vs Specification

Broad commercial use

RAFT Overview

Distinct leader election phase.

After leader is elected, is log replication phase

Each time a new leader is elected, a new term is introduced (similar to view)

Leader can be active for arbitrary time and arbitrary number of actions

RAFT Leader Election

All nodes exchange heartbeat messages with leader to make sure it’s still alive

Follower’s timeout triggers a leader election

To apply as a candidate, node sends info about its most current term, and most recent message in log

All nodes vote and decision is made by majority quorum

At most 1 leader for any term

Once candidate becomes leader, new term starts

Outdated leaders cannot be elected

If tie, random timeout before revote

RAFT Log Replication

Each node has a log of entries, each of which contains the operation and term number

Writes go to leader, who appends his log, then replicates

  1. leader pushes new log entry along with previous entry to followers during heartbeat
  2. Each follower checks if they have the previous log, send ack if yes
  3. Log entry gets committed at leader once leader gets majority ack, then leader ack client

New leader commits its uncommitted logs from previous terms after it commits some logs in new term

Garbage Collection = Snapshot + log truncation

Leader can send snapshot of current state (rather than log) to nodes that are sufficiently behind

RAFT Safety

Once committed, a log entry won’t be overwritten

Once a log entry is applied in a node, no other node will apply a different log entry in the same slot