Distributed Computing Midterm Study Guide
Summary
- Readings
- 01 Intro
- 02 RPC Primer
- 03 Time
- 04 State
- 05 Consensus
- 06 Paxos and Friends
- 07 Replication
- 08 Fault Tolerance
- 09 Distributed Transactions
- Other
Readings
- Logical Time: A Way to Capture Causality in Distributed Systems
- Consistent Global States
- Distributed Snapshots
- Impossibility of distributed consensus with one faulty process
- Replication chapter in For Funand Profit
- Paxos Made Simple
- A Brief Analysis of Consensus Protocol: From Logical Clock to Raft
- Chain Replication for Supporting High Throughput and Availability
- A Survey of Rollback-Recovery Protocols in Message-Passing Systems
- Spanner
Optional
- Time, Clocks and The Ordering of Events in Distributed Systems
- In Search of an Understandable Consensus Algorithm
- The Part-Time Parliament of Paxos
- Paxos Variants
- Neat Algorithms - Paxos
- Lamport’s Publications
- Secret Lives of Data
- RAFT GitHub
- Object Storage on CRAQ
- Amazon Aurora
- Amazon Aurora Design Considerations
- Big Table
- Megastore
- TrueTime and External Consistency
01 Intro
- What are defining characteristics of a distributed system?
- A distributed system is a collection of computing resources, connected via some networks that behaves as a single system
- What are some simple models to represent a Distributed System? Why would you
choose to use a system model? Using a simple model, can you describe how
- We typically [represent distributed
systems](/posts/introtodistributedsystems/
graph. We represent nodes
as vertices and communication between nodes as edges. In more complex
models, you can even represent things like
state by labeling
the nodes and changes in state then change the label (going from
S1
toS1'
, for example).
- We typically [represent distributed
systems](/posts/introtodistributedsystems/
graph. We represent nodes
as vertices and communication between nodes as edges. In more complex
models, you can even represent things like
state by labeling
the nodes and changes in state then change the label (going from
- What do we mean by asynchronous? What about asynchrony makes distributed
computing hard?
- Asynchronous communication simply means that messages cane be sent at one time, but not replied to until much later. This introduces uncertainty and non-determinism
- What about failures makes distributed computing hard?
- There are many types of failures, but it’s impossible to tell the difference between them if you are the message sender.
- Why is consistency a hard problem in distributed computing?
- Many of the factors of consistency introduce trade offs with respect to performance and types of failures.
- Pick one of the Fallacies. Can you explain why it’s not true and how does it
make distributed computing more challenging?
- The network is reliable
- Networks can and do fail. There needs to be error handling around networks failing so requests don’t stall (and infinitely consume resources) AND error handling around retrying requests when failed networks come back.
- Latency is zero
- It takes time for the bits to physically move over a wire. That time can change for a number of reasons, even down to what kind of wire you use. If you assume latency is zero, then you could reason that a message is dropped when really it’s just taking its time to get to you.
- Bandwidth is infinite
- We are limited by the cables and protocols that we use. If we do not plan for these limitations, we can create artificial bottlenecks.
- The network is secure
- Nothing is perfectly secure. We cannot rely on secure of one layer to protect all the layers above it. Bad actors would love for us to ignore this fact.
- Topology doesn’t change
- Changes in topology result in changes in latency and bandwidth, so they need to be handled carefully.
- There is one administrator
- Nothing can realistically be made and maintained by one person at scale anymore. Adopting a set of common best practices allows the next administrator to understand the reasoning of the previous administrator.
- Transport cost is zero
- It’s not free to go between layers (like Application to Transport layers). Like latency, theses costs need to be taken into account when designing distributed systems.
- The network is homogeneous
- You will at some point need to inter-op with multiple systems/protocols. Preparing for this ahead of time often saves time in the long run.
- The network is reliable
- Can you explain “The system gives correct answers always, regardless of
failures”? What does it mean with respect to the required properties of the
system?
- A distributed system should provide answers as if it were a single coherent entity.
- How does consideration of Latency relate to the observations made by CAP?
- Latency is only considered when there is no Partition and is a trade off between Consistency.
02 RPC Primer
- Can you identify and describe the main elements and steps involved in a
distributed RPC system?
- An RPC system includes an API to specify how to call remote procedures, Stubs to handle data marshaling, and a Runtime to provide connection management and handle certain failures. Typically, a programmer writes a spec in an Interface Definition Language (IDL) like protobuf, then you use a tool (like protoc) to compile that IDL into stubs that are included in your source code. You will also need a server that implements the interfaces defined by the stubs. Then the server is added to some sort of registry (normally handled by the runtime). Then a (potentially different) programmer builds a client against the same API. When the client calls a procedure, the stub builds the message and the runtime sends the message (potentially over the network). Once the server gets the message, it’s stub will unpack the message and make the local call to the correct procedure.
- Contrast exactly once, at most once, at least once, invocation semantics –
what would be required from the RPC runtime, under which assumptions would
such invocation semantics be feasible (or not)…
- At Least Once
- If a client can’t verify that a message is executed, it will timeout and re-transmit until it gets a response.
- At Most Once
- If a server gets a request that it has already received, it will used the cached reply instead of re-executing the message.
- Exactly Once
- Exactly once semantics require the client to implement At Least Once and the server to implement At Most once semantics.
- At Least Once
03 Time
- Why is time important and hard in distributed systems?
- Time is important because it helps us determine an order of events. Determining time is hard because it’s impossible to know how long network transfers will take, there’s no guarantee that nodes have synchronized clocks, and there’s no guarantee there won’t be failures.
- What is a logical clock? What are logical clocks used for? How is a logical
clock related to physical clock?
- A logical clock is a monotonic virtual timestamp generator. All it is useful for is determining order. Logical clocks don’t necessarily relate to physical clocks, but the ordering of events should be preserved between the different types of clocks.
- What is required to implement a logical clock? What properties must be
followed for this function to behave like a clock? Why/What’s the meaning of
the Clock Conditions?
- Logical clocks must be always increasing, but really they should be incremental. If an event happened before some event , then the timestamp for should be < the timestamp for . .
- Contrast Scalar, Vector, Matrix clocks? What is the cost or overhead
associated with each? What is the benefit from each of those/what is the main
challenge addressed by introducing each of these/what are the properties
provided by each clock?
- Scalar/Lamport’s
- Each node has it’s own implementation of a clock. Each node only knows the value of the timestamp that they computed.
- Vector
- Each node maintains its own view of time as perceived by all nodes.
- Matrix
- Each node maintains its own view about every other node’s view of the global time.
- Scalar/Lamport’s
- Can you look at the example execution shown in the papers/video and explain
how these clocks are getting updated? Can you point out what you are able to
tell about some specific events based on the clock values alone?
- No.
04 State
- Understand the meaning and purpose of the concepts distributed system state,
consistent cut/snapshot, actual/observed run, …
- Process State
- Most recent event/timestamp at that node.
- Channel State
- In flight messages.
- System/Global State
- Collection of state from all of the nodes/processes and channels.
- Run
- A sub-sequence of events.
- Actual Run
- Exactly what events actually happened.
- Observed Run
- A sub-sequence of events that could have potentially happened.
- Cut
- Vertical slice of state at a given time. Snapshot.
- Prerecording Event
- Event that happens before a cut.
- Postrecording Event
- Event that happens after a cut.
- Stable Property
- A property that once it becomes true for a state, it remains true for all
reachable states.
- Deadlock
- Token loss
- Termination
- A property that once it becomes true for a state, it remains true for all
reachable states.
- Unstable Property
- A property that makes no guarantee about truthiness across states.
- Buffer overflow
- Load spike
- Race condition
- A property that makes no guarantee about truthiness across states.
- Process State
- Understand the snapshot algorithm, what are the assumptions under which it’s valid, why are those assumptions necessary/how are they reflected in the algorithm?
- Can you trace through an execution the consistent state that could be captured by the algorithm?
- By knowing the state of a property in a given state in the system, what can
we tell about that same property at the start/at the end of the execution?
Can you provide examples when this would be useful?
- If a STABLE property is true in a given state, then we know it’s true in the final state. Likewise if its false in a given state, then we know it was false in all previous states. If an UNSTABLE property is true for a given state, then we know it COULD POSSIBLY still be true in future states. This is especially useful when debugging.
05 Consensus
- What is consensus? Explain in your own words/understand all elements of the
definition in the papers/video
- The ability for multiple separate processes in a distributed system to agree on something.
- What is the goal of the FLP work, what did they want to learn/prove? Provide
intuition about the approach they took to achieve this goal.
- The FLP theorem proved that it was impossible to determine if a system would decide in an asynchronous with one failure. They proved this by proving that if you started in a bivalent configuration (a configuration with multiple outcome), then it’s always possible to reach another bivalent outcome by delaying a message.
- State the FLP theorem and provide intuition of the proof/do you understand it?
- What’s the intuition about the significance of FLP, in light of much other
work of consensus algorithms, replication protocols, …
- In order to reach a consensus, you need to change the assumptions of your model.
06 Paxos and Friends
- Main idea or 2PC and 3PC
- 2 Phase Commit
- Vote Collection Phase
- Coordinator requests votes from each participant
- Participants send their value back
- Decision Phase
- Coordinator finds the majority and sends decision out to participants
- Decision is acknowledged by participants
- Vote Collection Phase
- 3 Phase Commit
- Soliciting Votes
- Coordinator asks each node if they can commit
- Each node responds
- Commit Authorized
- Coordinator sends
preCommit
to each node - Each node acks
- Coordinator sends
- Finalizing Commit
- Coordinator sends
doCommit
to each node - Each node responds with
haveCommitted
- Coordinator sends
- Soliciting Votes
- 2 Phase Commit
- Some history behind PAXOS
- First Paxos paper written by Leslie Lamport in 1990
- Not published until 1998
- Reviewers didn’t appreciate humorous depiction of algorithm
- Parliament passes decrees, but:
- Work only part time
- Communicate by messages
- Messages may be delayed
- Parliamentarians may choose not to vote
- No one is malicious
- First Paxos paper written by Leslie Lamport in 1990
- Description of PAXOS? What’s expected from the system (the system model) for
the protocol to hold?
- Agents
- Operate at arbitrary speed
- May fail-stop
- May fail-restart
- Have source of persistent memory to remember info after restart
- Messages
- Can take arbitrarily long to be delivered
- Can be duplicated
- Can be lost
- Can be reordered
- Can NOT be corrupted
- Everything is time stamped so it can be ordered
- 3 phases
- Prepare
- Node proposes an agreement round
- Accept
- Gather votes whether an agreement is possible and value has been agreed upon
- Learn
- Agreed upon value can be learned by all
- Prepare
- Agents
- What’s hard about determining whom to follow?
- You need to reach consensus on who to lead you to consensus.
- All nodes can sign up at the same time
- Random delay between retries helps with this
- Main ideas that PAXOS assumes: state-machine replication, majority quorum, …
- State Machine Replication
- There has to be some way for nodes to have synchronized view of global state
- Majority Quorum
- 1 + 50% needed for decisions
- Two majority quorums necessitate overlapping members by definition, so state can persist across all transactions
- State Machine Replication
- Goal and description of the phases in Paxos
- What’s the functionality Paxos provides? Why we need Multi-Paxos?
- Paxos helps us decide on one value. Multiple Paxos protocols are executed for agreeing on the order and values of sequences of operations.
- Motivation for Raft, key idea and differences than Paxos
- Raft was designed to be more understandable than Paxos. Separates leader election and log replication into separate phases.
- Log management in Raft – how is info used to update, discard entries,
trim/garbage collection?
- Writes go to leader, who appends his log, then replicates
- Leader pushes new log entry along with previous entry to followers during heartbeat
- Each follower checks if they have the previous log, send ack if yes
- Log entry gets committed at leader once leader gets majority ack, then leader ack client
- Writes go to leader, who appends his log, then replicates
07 Replication
- Contrast active vs. Stand-by replication
- Active
- Each replica can handle requests and ensure replication updates
- Stand-by
- Only 1 active/primary/leader
- Updates are kept consistent so fail over can be fast
- Active
- Contrast state vs. Log/RSM replication
- State
- After handling a command, the node will send its entire state to replicas
- Ensures all nodes are perfectly in-sync at all times
- Could possibly send a lot of data over the network
- Could potentially save from all nodes re-executing the same commands and getting the same result
- Log/RSM
- After (or even before) handling a command, the node will send the command to all replicas
- Doesn’t have to send (potentially large) state over network, but can send (hopefully smaller) command.
- Useful for systems that have large state but low execution per command.
- State
- What are the problems addressed by chain replication? How are they addressed?
- What are the problems created by chain replication? How are they addressed in
CRAQ (high level)? Can you explain the result from the experimental
comparison of CR and CRAQ?
- The main benefit is that CRAQ allows all nodes to handle read requests, not
just the
tail
- Each node maintains new and old values, serves old value until new value is fully replicated down the chain
- The main benefit is that CRAQ allows all nodes to handle read requests, not
just the
08 Fault Tolerance
- What’s the main idea of rollback-recovery as a FT technique?
- If failure detected, roll back state and effects of messages to before failure, then continue executing
- What are the differences and trade offs of checkpointing vs. logging as a FT
technique? What are all the different metrics you’d need to think about when
comparing the two?
- Checkpointing
- Save system state, flush to disk
- Quick recovery
- Potentially lots of I/O
- Logging
- Log information as it is performed
- Less I/O
- Longer recovery
- Dependent actions make this difficult
- Checkpointing
- Describe and explain the pros/cons/trade offs of coordinated, uncoordinated,
communication-induced checkpointing.
- Uncoordinated
- Processes take checkpoints independently
- Need to maintain dependency information
- Garbage collection is needed
- Coordinated
- Processes coordinate and checkpoint all at once
- Each processes keeps single checkpoint
- No need for garbage collection
- Difficult to do with no synchronous clock and unbounded message time.
- A process could be told to keep taking checkpoints even if its state never changed
- Communication-induced
- Use [consensus algorithms](/posts/consensusindistributedsystems/ to determine when to checkpoint
- [Chandy-Lamport Algorithm](/posts/stateindistributedsystems/ a non-blocking algorithm
- Checkpoints are made before sending marker and before processing receipt of marker
- Uncoordinated
- We mention a number of factors which influence the choice of a FT technique. Can you reason through some examples, say considering change in storage cost, or system scale, or read/write ratio of the workload, and whether or how those changes would impact the winner among any two of the techniques we discussed?
09 Distributed Transactions
- What is the concept of TrueTime? How is that related to the Logical Clocks
discussed earlier in the semester? How can it be implemented?
- TT is uncertainty around real time
- Similar to logical time window
- How does the use of TrueTime change what is required from the system
(Spanner) in order to establish ordering among transactions?
- GPS and Atomic clocks are heavily used
- Using TrueTime how do you establish ordering of write operations? How do you
implement a read operation? What about reading a snapshot “in the past”?
- Write
- On a single replica set:
- Acquire Locks
- Pick TT Timestamp s
- Leader starts consensus
- Achieve consensus
- Commit wait done
- Release locks
- Notify slaves
- On transactions across multiple replica sets, we add 2PC:
- Each replica acquires locks
- Each replica picks TT Timestamp s
- Transaction starts executing as before
- Each replica set starts logging updates they are to perform
- Done logging, each node sends it’s s to transaction coordinator to compute overall timestamp s for transaction
- Commit wait
- Release locks on transaction coordinator
- Notify participants
- Participants release locks
- On a single replica set:
- Read
- ReadNow
- Use leader info to determine a “safe” timestamp
- For single replica set, just use Paxos leader
- For multi-replica, consider all at Transaction Manager
- Use leader info to determine a “safe” timestamp
- ReadLater
- Simply read at specific timestamp
- Timestamp saves us from using distributed cut algorithm
- ReadNow
- Write
- Describe at a high-level Spanner, how does it organize and replicate data,
who is involved in serving read or write operations…
- Brief
- Data is geographically distributed
- Data is sharded within a locations
- Data is replicated across multiple sites within a location
- Stack
- 3 Colossus FS nodes
- Big Table on top of
Colossus which exports files into application specific data model, which
is grouped into
tablets
, which are groups of related objects - Megastore, which is a Replicated State Machine (w/ Paxos) per tablet
- 2 phase locking concurrency control
- 2PC for cross-replica set transactions to determine lock
- Brief
- Describe at a high-level Aurora, how does it organize and replicate data, who
is involved in serving read or write operations…
- Follows Primary-Replica architecture
- Primary handles reads and writes, multiple replicas handle reads only
- Designed for availability
- Replicated across 3 zones, 2 nodes per zone
- 6 replicas = I/O Amplification (each write turns into 6 writes)
- Log replication over S3 solves this
- Can you explain/justify the differences in the two systems. You can consider
a number of possible dimensions such as different design goals, underlying
assumptions, performance, …
- Spanner was designed for external consistency and performance while AWS Aurora was designed for availability
- What are some other design points that can be considered in systems which
don’t have TrueTime?
- NTP can be a substitute but has higher latency
- Typically only used if and when external consistency is needed
- Cockroach DB does this
- NTP can be a substitute but has higher latency
Other
- What are the names of some of the authors of the papers we talked about? Do
you know where they’re now, what they’re famous for, what else they’ve done?
- Leslie Lamort is a genius who made LaTeX and Paxos. He also recieved the Turing Award for his contributions to Distributed Computing
- Think of some distributed service you use – daily (cloud mail, search, social networks, some service at your work, …). Make an assumption on how they implement some aspect of the distributed system (from Time to Distributed Transactions) and think through the pros/cons of that design decision based on what you assume the system and workload look like.