Distributed Transactions
Required Readings
Optional
- Amazon Aurora
- Amazon Aurora Design Considerations
- Big Table
- Megastore
- TrueTime and External Consistency
Summary
- What are Distributed Transactions
- Spanner Brief
- Spanner Stack
- Consistency Requirements
- True Time
- Ordering Write Transactions with TrueTime Timestamps
- Read Transactions
- No TrueTime
- Another Distributed DB Example: AWS Aurora
What are Distributed Transactions
Group of operations that are applied together, over multiple nodes
ACID properties
Useful for concurrency and fault tolerance
Outcomes are commit
or abort
Typically done with coordinator (leader)
Spanner Brief
Data is geographically distributed
Data is sharded within a locations
Data is replicated across multiple sites within a location
Spanner Stack
3 Colossus FS nodes, which are optimized for reads and appends
Big Table on top of Colossus which exports files into application specific data model
Data model is grouped into tablets
, which are groups of related objects
Megastore, which is a Replicated State Machine (w/ Paxos) per tablet
- 2 phase locking concurrency control
- 2PC for cross-replica set transactions to determine lock
Consistency Requirements
Read operations on are replica of state, not current state, to prevent blocking
External consistency := order of events in the system must match the order in which events appeared in the real world clock
Requires strict serializability and common global clock
True Time
TT != absolute real time
TT is uncertainty around real time
You can define:
TT.after(t)
, true if time t has definitely passedTT.before(t)
, true if time t has definitely not arrivedTT.now()
which returnsTTInterval[earliest, latest]
- Narrowest scope of what time could be
Periodic probing of master clock servers in datacenter
- GPS and Atomic clocks + 2*ε
- Acquire locks
- Pick time s =
TT.now().latest
- Do operation
- Wait until
TT.now().earliest > s
- Release locks
Commit wait may be much longer than actual operation
Ordering Write Transactions with TrueTime Timestamps
Pessimistic Locking := acquiring all locks upfront to prevent later conflicts
On a single replica set:
- Acquire Locks
- Pick TT Timestamp s
- Leader starts consensus
- Achieve consensus
- Commit wait done
- Release locks
- Notify slaves
On transactions across multiple replica sets, we add 2PC:
- Each replica acquires locks
- Each replica picks TT Timestamp s
- Transaction starts executing as before
- Each replica set starts logging updates they are to perform
- Done logging, each node sends it’s s to transaction coordinator to compute overall timestamp s for transaction
- Commit wait
- Release locks on transaction coordinator
- Notify participants
- Participants release locks
Read Transactions
2 Types of transactions, Read Now
and Read Later
Read Now
Still need to be externally ordered
Use leader info to determine a safe timestamp
- For single replica set, just use Paxos leader
- For multi-replica, consider all at TxMgr
Potentially delay until it’s safe time is reached
Prepared but not committed transactions delay reads
Read Later
Simply read at specific timestamp
Timestamp saves us from using distributed cut algorithm
No TrueTime
GPS + atomic clocks => few ms uncertainty
- ~5,7 ms
- Just wait it out
What if we don’t have GPS + atomic clocks?
NTP => ~100ms uncertainty
Make it visible when/if external consistency is needed, only perform expensive wait when it’s needed
[Cockroach DB](https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html does this
Snapshot isolation (or Multi-Version Concurrency Control) allows transactions to be interleaved, even when operating on same data
To guarantee correctness:
- Transactions read from a snapshot of distributed state
- Sequence of snapshots are serializable (i.e. No cycles)
Another Distributed DB Example: AWS Aurora
Follows Primary-Replica architecture
Primary handles reads and writes, multiple replicas handle reads only
Designed for availability
Replicated across 3 zones, 2 nodes per zone
6 replicas = I/O Amplification (each write turns into 6 writes)
Log replication over S3 solves this