Distributed Transactions

Stephen M. Reaves

2023-02-20

Notes about Lesson 9 of CS-7210

Required Readings

Spanner

Optional

Summary

What are Distributed Transactions
Spanner Brief
Spanner Stack
Consistency Requirements
True Time
Ordering Write Transactions with TrueTime Timestamps
Read Transactions
- Read Now
- Read Later
No TrueTime
Another Distributed DB Example: AWS Aurora

What are Distributed Transactions

Group of operations that are applied together, over multiple nodes

ACID properties

Useful for concurrency and fault tolerance

Outcomes are commit or abort

Typically done with coordinator (leader)

Spanner Brief

Data is geographically distributed

Data is sharded within a locations

Data is replicated across multiple sites within a location

Spanner Stack

3 Colossus FS nodes, which are optimized for reads and appends

Big Table on top of Colossus which exports files into application specific data model

Data model is grouped into tablets, which are groups of related objects

Megastore, which is a Replicated State Machine (w/ Paxos) per tablet

2 phase locking concurrency control
- 2PC for cross-replica set transactions to determine lock

Consistency Requirements

Read operations on are replica of state, not current state, to prevent blocking

External consistency := order of events in the system must match the order in which events appeared in the real world clock

Requires strict serializability and common global clock

True Time

TT != absolute real time

TT is uncertainty around real time

You can define:

TT.after(t), true if time t has definitely passed
TT.before(t), true if time t has definitely not arrived
TT.now() which returns TTInterval[earliest, latest]
- Narrowest scope of what time could be

Periodic probing of master clock servers in datacenter

GPS and Atomic clocks + 2*ε

Acquire locks
Pick time s = TT.now().latest
Do operation
Wait until TT.now().earliest > s
Release locks

Commit wait may be much longer than actual operation

Ordering Write Transactions with TrueTime Timestamps

Pessimistic Locking := acquiring all locks upfront to prevent later conflicts

On a single replica set:

Acquire Locks
Pick TT Timestamp s
Leader starts consensus
Achieve consensus
Commit wait done
Release locks
Notify slaves

On transactions across multiple replica sets, we add 2PC:

Each replica acquires locks
Each replica picks TT Timestamp s
Transaction starts executing as before
Each replica set starts logging updates they are to perform
Done logging, each node sends it’s s to transaction coordinator to compute overall timestamp s for transaction
Commit wait
Release locks on transaction coordinator
Notify participants
Participants release locks

Read Transactions

2 Types of transactions, Read Now and Read Later

Read Now

Still need to be externally ordered

Use leader info to determine a safe timestamp

For single replica set, just use Paxos leader
For multi-replica, consider all at TxMgr

Potentially delay until it’s safe time is reached

Prepared but not committed transactions delay reads

Read Later

Simply read at specific timestamp

Timestamp saves us from using distributed cut algorithm

No TrueTime

GPS + atomic clocks => few ms uncertainty

~5,7 ms
Just wait it out

What if we don’t have GPS + atomic clocks?

NTP => ~100ms uncertainty

Make it visible when/if external consistency is needed, only perform expensive wait when it’s needed

[Cockroach DB](https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html does this

Snapshot isolation (or Multi-Version Concurrency Control) allows transactions to be interleaved, even when operating on same data

To guarantee correctness:

Transactions read from a snapshot of distributed state
Sequence of snapshots are serializable (i.e. No cycles)

Another Distributed DB Example: AWS Aurora

Follows Primary-Replica architecture

Primary handles reads and writes, multiple replicas handle reads only

Designed for availability

Replicated across 3 zones, 2 nodes per zone

6 replicas = I/O Amplification (each write turns into 6 writes)

Log replication over S3 solves this