Summary

Readings

Optional

01 Intro

top

notes

  • What are defining characteristics of a distributed system?
    • A distributed system is a collection of computing resources, connected via some networks that behaves as a single system
  • What are some simple models to represent a Distributed System? Why would you choose to use a system model? Using a simple model, can you describe how
    • We typically represent distributed systems as a graph. We represent nodes as vertices and communication between nodes as edges. In more complex models, you can even represent things like state by labeling the nodes and changes in state then change the label (going from S1 to S1', for example).
  • What do we mean by asynchronous? What about asynchrony makes distributed computing hard?
    • Asynchronous communication simply means that messages cane be sent at one time, but not replied to until much later. This introduces uncertainty and non-determinism
  • What about failures makes distributed computing hard?
    • There are many types of failures, but it’s impossible to tell the difference between them if you are the message sender.
  • Why is consistency a hard problem in distributed computing?
    • Many of the factors of consistency introduce trade offs with respect to performance and types of failures.
  • Pick one of the Fallacies. Can you explain why it’s not true and how does it make distributed computing more challenging?
    • The network is reliable
      • Networks can and do fail. There needs to be error handling around networks failing so requests don’t stall (and infinitely consume resources) AND error handling around retrying requests when failed networks come back.
    • Latency is zero
      • It takes time for the bits to physically move over a wire. That time can change for a number of reasons, even down to what kind of wire you use. If you assume latency is zero, then you could reason that a message is dropped when really it’s just taking its time to get to you.
    • Bandwidth is infinite
      • We are limited by the cables and protocols that we use. If we do not plan for these limitations, we can create artificial bottlenecks.
    • The network is secure
      • Nothing is perfectly secure. We cannot rely on secure of one layer to protect all the layers above it. Bad actors would love for us to ignore this fact.
    • Topology doesn’t change
      • Changes in topology result in changes in latency and bandwidth, so they need to be handled carefully.
    • There is one administrator
      • Nothing can realistically be made and maintained by one person at scale anymore. Adopting a set of common best practices allows the next administrator to understand the reasoning of the previous administrator.
    • Transport cost is zero
      • It’s not “free” to go between layers (like Application to Transport layers). Like latency, theses costs need to be taken into account when designing distributed systems.
    • The network is homogeneous
      • You will at some point need to inter-op with multiple systems/protocols. Preparing for this ahead of time often saves time in the long run.
  • Can you explain “The system gives correct answers always, regardless of failures”? What does it mean with respect to the required properties of the system?
    • A distributed system should provide answers as if it were a single coherent entity.
  • How does consideration of Latency relate to the observations made by CAP?
    • Latency is only considered when there is no Partition and is a trade off between Consistency.

02 RPC Primer

top

notes

  • Can you identify and describe the main elements and steps involved in a distributed RPC system?
    • An RPC system includes an API to specify how to call remote procedures, Stubs to handle data marshaling, and a Runtime to provide connection management and handle certain failures. Typically, a programmer writes a spec in an Interface Definition Language (IDL) like protobuf, then you use a tool (like protoc) to compile that IDL into stubs that are included in your source code. You will also need a server that implements the interfaces defined by the stubs. Then the server is added to some sort of registry (normally handled by the runtime). Then a (potentially different) programmer builds a client against the same API. When the client calls a procedure, the stub builds the message and the runtime sends the message (potentially over the network). Once the server gets the message, it’s stub will unpack the message and make the local call to the correct procedure.
  • Contrast exactly once, at most once, at least once, invocation semantics – what would be required from the RPC runtime, under which assumptions would such invocation semantics be feasible (or not)…
    • At Least Once
      • If a client can’t verify that a message is executed, it will timeout and re-transmit until it gets a response.
    • At Most Once
      • If a server gets a request that it has already received, it will used the cached reply instead of re-executing the message.
    • Exactly Once
      • Exactly once semantics require the client to implement At Least Once and the server to implement At Most once semantics.

03 Time

top

notes

  • Why is time important and hard in distributed systems?
    • Time is important because it helps us determine an order of events. Determining time is hard because it’s impossible to know how long network transfers will take, there’s no guarantee that nodes have synchronized clocks, and there’s no guarantee there won’t be failures.
  • What is a logical clock? What are logical clocks used for? How is a logical clock related to physical clock?
    • A logical clock is a monotonic virtual timestamp generator. All it is useful for is determining order. Logical clocks don’t necessarily relate to physical clocks, but the ordering of events should be preserved between the different types of clocks.
  • What is required to implement a logical clock? What properties must be followed for this function to behave like a clock? Why/What’s the meaning of the Clock Conditions?
    • Logical clocks must be always increasing, but really they should be incremental. If an event e^1 “happened before” some event e^2, then the timestamp for e^1 should be < the timestamp for e^2. e^1 -> e^2 => C(e^1) < C(e^2).
  • Contrast Scalar, Vector, Matrix clocks? What is the cost or overhead associated with each? What is the benefit from each of those/what is the main challenge addressed by introducing each of these/what are the properties provided by each clock?
    • Scalar/Lamport’s
      • Each node has it’s own implementation of a clock. Each node only knows the value of the timestamp that they computed.
    • Vector
      • Each node maintains its own view of time as perceived by all nodes.
    • Matrix
      • Each node maintains its own view about every other node’s view of the global time.
  • Can you look at the example execution shown in the papers/video and explain how these clocks are getting updated? Can you point out what you are able to tell about some specific events based on the clock values alone?
    • No.

04 State

top

notes

  • Understand the meaning and purpose of the concepts distributed system state, consistent cut/snapshot, actual/observed run, …
    • Process State
      • Most recent event/timestamp at that node.
    • Channel State
      • In flight messages.
    • System/Global State
      • Collection of state from all of the nodes/processes and channels.
    • Run
      • A sub-sequence of events.
    • Actual Run
      • Exactly what events actually happened.
    • Observed Run
      • A sub-sequence of events that could have potentially happened.
    • Cut
      • “Vertical” slice of state at a given time. Snapshot.
    • Prerecording Event
      • Event that happens before a cut.
    • Postrecording Event
      • Event that happens after a cut.
    • Stable Property
      • A property that once it becomes true for a state, it remains true for all reachable states.
        • Deadlock
        • Token loss
        • Termination
    • Unstable Property
      • A property that makes no guarantee about truthiness across states.
        • Buffer overflow
        • Load spike
        • Race condition
  • Understand the snapshot algorithm, what are the assumptions under which it’s valid, why are those assumptions necessary/how are they reflected in the algorithm?
  • Can you trace through an execution the consistent state that could be captured by the algorithm?
  • By knowing the state of a property in a given state in the system, what can we tell about that same property at the start/at the end of the execution? Can you provide examples when this would be useful?
    • If a STABLE property is true in a given state, then we know it’s true in the final state. Likewise if its false in a given state, then we know it was false in all previous states. If an UNSTABLE property is true for a given state, then we know it COULD POSSIBLY still be true in future states. This is especially useful when debugging.

05 Consensus

top

notes

  • What is consensus? Explain in your own words/understand all elements of the definition in the papers/video
    • The ability for multiple separate processes in a distributed system to agree on something.
  • What is the goal of the FLP work, what did they want to learn/prove? Provide intuition about the approach they took to achieve this goal.
    • The FLP theorem proved that it was impossible to determine if a system would decide in an asynchronous with one failure. They proved this by proving that if you started in a bivalent configuration (a configuration with multiple outcome), then it’s always possible to reach another bivalent outcome by delaying a message.
  • State the FLP theorem and provide intuition of the proof/do you understand it?
  • What’s the intuition about the significance of FLP, in light of much other work of consensus algorithms, replication protocols, …
    • In order to reach a consensus, you need to change the assumptions of your model.

06 Paxos and Friends

top

notes

  • Main idea or 2PC and 3PC
    • 2 Phase Commit
      • Vote Collection Phase
        • Coordinator requests votes from each participant
        • Participants send their value back
      • Decision Phase
        • Coordinator finds the majority and sends decision out to participants
        • Decision is acknowledged by participants
    • 3 Phase Commit
      • Soliciting Votes
        • Coordinator asks each node if they can commit
        • Each node responds
      • Commit Authorized
        • Coordinator sends preCommit to each node
        • Each node acks
      • Finalizing Commit
        • Coordinator sends doCommit to each node
        • Each node responds with haveCommitted
  • Some history behind PAXOS
    • First Paxos paper written by Leslie Lamport in 1990
      • Not published until 1998
      • Reviewers didn’t appreciate humorous depiction of algorithm
    • Parliament passes decrees, but:
      • Work only part time
      • Communicate by messages
      • Messages may be delayed
      • Parliamentarians may choose not to vote
      • No one is malicious
  • Description of PAXOS? What’s expected from the system (the system model) for the protocol to hold?
    • Agents
      • Operate at arbitrary speed
      • May fail-stop
      • May fail-restart
      • Have source of persistent memory to remember info after restart
    • Messages
      • Can take arbitrarily long to be delivered
      • Can be duplicated
      • Can be lost
      • Can be reordered
      • Can NOT be corrupted
    • Everything is time stamped so it can be ordered
    • 3 phases
      • Prepare
        • Node proposes an agreement round
      • Accept
        • Gather votes whether an agreement is possible and value has been agreed upon
      • Learn
        • Agreed upon value can be learned by all
  • What’s hard about determining whom to follow?
    • You need to reach consensus on who to lead you to consensus.
    • All nodes can “sign up” at the same time
      • Random delay between retries helps with this
  • Main ideas that PAXOS assumes: state-machine replication, majority quorum, …
    • State Machine Replication
      • There has to be some way for nodes to have synchronized view of global state
    • Majority Quorum
      • 1 + 50% needed for decisions
      • Two majority quorums necessitate overlapping members by definition, so state can persist across all transactions
  • Goal and description of the phases in Paxos
  • What’s the functionality Paxos provides? Why we need Multi-Paxos?
    • Paxos helps us decide on one value. Multiple Paxos protocols are executed for agreeing on the order and values of sequences of operations.
  • Motivation for Raft, key idea and differences than Paxos
    • Raft was designed to be more understandable than Paxos. Separates leader election and log replication into separate phases.
  • Log management in Raft – how is info used to update, discard entries, trim/garbage collection?
    • Writes go to leader, who appends his log, then replicates
      • Leader pushes new log entry along with previous entry to followers during heartbeat
      • Each follower checks if they have the previous log, send ack if yes
      • Log entry gets committed at leader once leader gets majority ack, then leader ack client

07 Replication

top

notes

  • Contrast active vs. Stand-by replication
    • Active
      • Each replica can handle requests and ensure replication updates
    • Stand-by
      • Only 1 active/primary/leader
      • Updates are kept consistent so fail over can be fast
  • Contrast state vs. Log/RSM replication
    • State
      • After handling a command, the node will send its entire state to replicas
      • Ensures all nodes are perfectly in-sync at all times
      • Could possibly send a lot of data over the network
      • Could potentially save from all nodes re-executing the same commands and getting the same result
    • Log/RSM
      • After (or even before) handling a command, the node will send the command to all replicas
      • Doesn’t have to send (potentially large) state over network, but can send (hopefully smaller) command.
      • Useful for systems that have large state but low execution per command.
  • What are the problems addressed by chain replication? How are they addressed?
  • What are the problems created by chain replication? How are they addressed in CRAQ (high level)? Can you explain the result from the experimental comparison of CR and CRAQ?
    • The main benefit is that CRAQ allows all nodes to handle read requests, not just the tail
    • Each node maintains new and old values, serves old value until new value is fully replicated down the chain

08 Fault Tolerance

top

notes

  • What’s the main idea of rollback-recovery as a FT technique?
    • If failure detected, roll back state and effects of messages to before failure, then continue executing
  • What are the differences and trade offs of checkpointing vs. logging as a FT technique? What are all the different metrics you’d need to think about when comparing the two?
    • Checkpointing
      • Save system state, flush to disk
      • Quick recovery
      • Potentially lots of I/O
    • Logging
      • Log information as it is performed
      • Less I/O
      • Longer recovery
      • Dependent actions make this difficult
  • Describe and explain the pros/cons/trade offs of coordinated, uncoordinated, communication-induced checkpointing.
    • Uncoordinated
      • Processes take checkpoints independently
      • Need to maintain dependency information
      • Garbage collection is needed
    • Coordinated
      • Processes coordinate and checkpoint all at once
      • Each processes keeps single checkpoint
      • No need for garbage collection
      • Difficult to do with no synchronous clock and unbounded message time.
      • A process could be told to keep taking checkpoints even if its state never changed
    • Communication-induced
  • We mention a number of factors which influence the choice of a FT technique. Can you reason through some examples, say considering change in storage cost, or system scale, or read/write ratio of the workload, and whether or how those changes would impact the winner among any two of the techniques we discussed?

09 Distributed Transactions

top

notes

  • What is the concept of TrueTime? How is that related to the Logical Clocks discussed earlier in the semester? How can it be implemented?
    • TT is uncertainty around real time
    • Similar to logical time window
  • How does the use of TrueTime change what is required from the system (Spanner) in order to establish ordering among transactions?
    • GPS and Atomic clocks are heavily used
  • Using TrueTime how do you establish ordering of write operations? How do you implement a read operation? What about reading a snapshot “in the past”?
    • Write
      • On a single replica set:
        • Acquire Locks
        • Pick TT Timestamp s
        • Leader starts consensus
        • Achieve consensus
        • Commit wait done
        • Release locks
        • Notify slaves
      • On transactions across multiple replica sets, we add 2PC:
        • Each replica acquires locks
        • Each replica picks TT Timestamp s
        • Transaction starts executing as before
        • Each replica set starts logging updates they are to perform
        • Done logging, each node sends it’s s to transaction coordinator to compute overall timestamp s for transaction
        • Commit wait
        • Release locks on transaction coordinator
        • Notify participants
        • Participants release locks
    • Read
      • ReadNow
        • Use leader info to determine a “safe” timestamp
          • For single replica set, just use Paxos leader
          • For multi-replica, consider all at Transaction Manager
      • ReadLater
        • Simply read at specific timestamp
        • Timestamp saves us from using distributed cut algorithm
  • Describe at a high-level Spanner, how does it organize and replicate data, who is involved in serving read or write operations…
    • Brief
      • Data is geographically distributed
      • Data is sharded within a locations
      • Data is replicated across multiple sites within a location
    • Stack
      • 3 Colossus FS nodes
      • Big Table on top of Colossus which exports files into application specific data model, which is grouped into tablets, which are groups of related objects
      • Megastore, which is a Replicated State Machine (w/ Paxos) per tablet
      • 2 phase locking concurrency control
        • 2PC for cross-replica set transactions to determine lock
  • Describe at a high-level Aurora, how does it organize and replicate data, who is involved in serving read or write operations…
    • Follows Primary-Replica architecture
    • Primary handles reads and writes, multiple replicas handle reads only
    • Designed for availability
    • Replicated across 3 zones, 2 nodes per zone
    • 6 replicas = I/O Amplification (each write turns into 6 writes)
      • Log replication over S3 solves this
  • Can you explain/justify the differences in the two systems. You can consider a number of possible dimensions such as different design goals, underlying assumptions, performance, …
    • Spanner was designed for external consistency and performance while AWS Aurora was designed for availability
  • What are some other design points that can be considered in systems which don’t have TrueTime?
    • NTP can be a substitute but has higher latency
      • Typically only used if and when external consistency is needed
      • Cockroach DB does this

Other

top

  • What are the names of some of the authors of the papers we talked about? Do you know where they’re now, what they’re famous for, what else they’ve done?
    • Leslie Lamort is a genius who made LaTeX and Paxos. He also recieved the Turing Award for his contributions to Distributed Computing
  • Think of some distributed service you use – daily (cloud mail, search, social networks, some service at your work, …). Make an assumption on how they implement some aspect of the distributed system (from Time to Distributed Transactions) and think through the pros/cons of that design decision based on what you assume the system and workload look like.