Dist Comp Final Study Guide

Stephen M. Reaves

2023-04-21

My answers to the study guide of CS7210 Final Exam

Readings

Memcached
COPS - From L10
Chord
Designing distributed algs for mobile networks
Spark - From l12
Google MapReduce
Challenges and Solutions for Fast Remote Persistent Memory Access - From L13/14
LegoOS - From L13/14
Borg
Gaia
Cartel
Byzantine Generals Problem
pBFT
Dahlia Malkhi’s talk from ATC’18
The Emerging Landscape of Edge Computing and The Computing Landscape of the 21stCentury
Transactuations: Where Transactions Meet the Physical World

Distributed Data Stores - Consistency

top

notes

What is the the goal Memcached? What does it mean that it’s “lookaside” cache? Why did its creators choose this design point?
- Memcached is in-memory KV Store with a lookaside cache design.
- The lookaside cache design means the cache sits beside the db, not in front.
  - The client makes explicit calls the the cache first, then on a cache miss, makes explicit calls to the db. Any updates to the db must call DELETE on the cache.
  - Cache is clean, but non-authoritative.
- Explicit nature allows clients to decide on consistency vs speed trade off.
What is one Memcached problem addressed via Leases? Would this problem have occurred if this weren’t a “lookaside” cache, and, if so, would Leases still be an appropriate solution?
- Concurrent DBRead -> DBSet -> CacheSet operations can arrive out of order which can lead to incorrect cache status. This problem could also be avoided by using an implicit cache in-front (as opposed to beside) the db. This way all transactions would go through the cache layer which would act as an implicit lease.
Memcached relies on client-side functionality for several operations. What are those and why is client-side processing useful?
- Routing is handled client-side to allow for clients to make decisions about tradeoffs.
How is Memcached scaled to multiple clusters?
- Multiple Memcached clusters can be mirrored to help with single object performance and add failure domains clusters.
What about across geo-distributed datacenters?
- Memcached expects replication at the storage layer for cross-datacenter. Also, once a client tries to write to a remote Memcached, the value is marked and treated as deleted. Then the client writes to the master database. Memcached will poll the remote database, waiting for the storage to be replicated from the master to the remote database, then it reads from the remote database.
Why are invalidations or markers used in the multi-cluster or geo-distributed configurations but not in the single cluster design?
- In multi-cluster scenarios, one large Memcached cluster is broken up into smaller clusters and requests are sharded across the clusters. This gives the benefit of having multiple places the cache can come from, but that also adds complexity. Invalidation mechanisms are implemented to make sure the multiple clusters are in-sync. This is not needed on single cluster designs because a cluster is always in sync with itself.
What is the problem with the read/write ordering guaranteed by a system like Memcached that is addressed with the COPS work?
- Causal dependencies are not visible at system level
How does the COPS solution solve this problem? What is Causal+ consistency/how is this different than just Causal?
- Client library embeds ordering metadata on PUT requests. (PUT_AFTER)

Peer-to-Peer and Mobility

top

notes

What is the goal of Peer-to-Peer systems? What are some fundamental differences with other types of distributed systems: e.g., considering a single datacenter, or even a set of datacenters as what was described in the Spanner and other systems discussed in L9?
- Peer-to-Peer systems were created to work without the need for a centralized server. The only thing they assume is the IP stack.
What are the tradeoffs with possible strategies to find the appropriate peer in a P2P system?
- Centralized Registry
  - Single RTT to find correct peer IP
  - Centralized trust
  - Examples:
    - Napster
- Flood/Gossip based protocols
  - Remove centralized coordinator
  - No bound on lookup time
  - Examples:
    - Gnutella
    - Bitcoin
- Provide Structured Routing Trees
  - Distributed Hash Table (DHT)
  - Decentralized index
  - Probabilistic lookup time
  - Examples:
    - Chord
    - Kademlia
    - DynamoDB
What’s the basic goal and functionality of a DHT? What could be stored in the DHT elements? (i.e., What kind of information can be stored in a DHT in a P2P system?)
- Every client can lookup the server based on a hash of the request. Servers can join/leave the group at anytime.
How does Chord operate? What is stored in the DHT? What happens when an element and data is added to Chord? What happens when a lookup is performed? What about when a node fails/is removed?
- The DHT is a ring. When a client wants to look up some value, they make a request at NodeValue = HASH(req). If a node does not exist at NodeValue, you travel linearly to the next node.
What is the purpose of the fingers tables/what problem do they solve? What information do they store, how/when are they updated, how are they used?
- Lookups are O(n). Finger tables allow nodes to have information about which nodes server a particular key range. At each node n, the ith finger entry starts at n + 2i for range of 2i elements.
What are the problems that are addressed with hierarchical designs of distributed systems?
- Hierarchical designs are more easily able to take advantage of other features of the system. For example, if we know we are operating in multiple datacenters across a wide-area network, we may try to come up with a system that leverages the locality of nodes within a rack.
What are the problems with using a DHT across all mobile nodes as peers for the basic SEARCH and INSERT operations in a mobile network?
- Mobile networks typically have a central trunk of a wired network with branches of wireless networks and leaves of mobile hosts. DHT treats all nodes (and connections) as equal, so it won’t take into consideration the heterogeneity of the system.

Data Analytics

top

notes

What are the different strategies how you can scale data analytics to larger input sizes and to larger processing complexity? What are the fundamental trade-offs that they introduce?
- Common Techniques:
  - Data Parallel
    - Divide data, assign to nodes for processing
    - How do you know how to load balance?
  - Pipelining
    - Each node only does one thing
    - Data flows through a sequence of tasks
    - Increases throughput
  - Model Parallelism
    - Divide state of application across nodes
    - Each node has less to process based on its state
    - Input is passed to all nodes, output is combined from all nodes
How are these strategies combined in the original design of MapReduce presented by Google?
- MapReduce uses a mixture of parallelism and pipelining. The input is initially split up and mapped in parallel to an intermediate KV pair. Those intermediate KV pairs go down the pipeline to the reduce nodes that map them into a final combined KV pair. The fact that the reducers process part of the keyspace then combine is a form of model parallelism.
What is the goal of using intermediate files to store results from each operation in a MapReduce pipeline? What are the pros and cons of this decision?
- The goal of intermediate files was fault-tolerance. Because these files can always be re-read, if a reduce function fails, a new reducer can spin up in its places. This introduces a requirement for persistent I/O and is therefore limited by the speed of that I/O.
How are these problems addressed by Spark? What is the key idea in Spark?
- Spark was created to provide faster analytics over various data types. The main idea is to allow in-memory data sharing to avoid the I/O problem of MapReduce.
How do RDDs and lineage speed up processing of analytics applications?
- Avoid serialization overhead. Also, lineages are lazy and aren’t executed until needed.
How do RDDs and lineage provide support for fault-tolerance?
- Given that the original data is known, the lineage (which is essentially the list of transformations applied to the data) can be replayed to regenerate the expected outcome.
Do you understand the Spark log mining example: what are the different constructs, what is their purpose, how are they used?…
- Yes, similar to bash pipelining.
How does the information about or embedded in RDDs help with placement and scheduling?
- If we know data will be used together later, we can place them in the same partition which will ensure that the join operation at the end will happen when the data is on the same machine, thus avoiding network overhead.

Datacenter Services

top

notes

What are some of the emerging technology trends in datacenters? What is motivating their adoption?
- Hardware specialization has come out to combat the fact that Moore’s law is dying.
What is RDMA, what are the benefits it provides?
- Remote Direct Memory Access allows us to bypass the CPU when communicating data over an interconnect.
Why and how does RDMA change how one could perform RPC among a client and server in a datacenter? What are, if any, the performance implications of such RPC designs compared to a traditional RPC implementation?
- There are multiple communication modes that RDMA uses (One-sided vs Two-Sided, Connection vs Connection-less), so the RPC API might need to be changed in order to accommodate this. Designing RPC implementation leverage certain RDMA features (such as shared Receive Queues) can increase performance compared to a standard implementation.
What is non-volatile/persistent memory (NVM), what are the benefits it provides?
- NVM is memory that is persistent like hard drives, but performs similarly to standard DRAM. NVM also provides capacity similar to hard drives but maintains byte addressability like standard DRAM.
Why and how does NVM change how one could implement RPC over Remote Direct Non-Volatile Memory Access?
- With NVM, you no longer need to flush data in memory to disk, which ultimately speeds up synchronous RPC calls.
What is the goal of disaggregation? What are the technology assumptions that are required in order to make resource disaggregation possible?
- Disaggregation allows us to scale only the parts of the computer that are actually being stressed. This comes from workload imbalances and resource inefficiency. Workloads that need more memory than what a traditional computer have will also get extra CPU and storage when they scale, even though all they need is RAM.
Explain the Split-kernel design, what are the goals it aims to achieve?
- The OS functions are split into monitors, each of which is run on a different hardware device. All monitors will communicate over the network. This allows us to scale specific hardware components independently of other components.
Explain the design and implementation of LegoOS, and whether/how does it meet the split-kernel design goals?
- LegoOS was only emulated, using monolithic servers that ignore certain resource types. Controllers/Monitors were implemented using Linux kernel modules, connected via RDMA network, and communicated using RPC over RDMA.
Explain the result on the performance comparison of LegoOS vs. Linux vs. Infiniswap discussed in the Lesson.
- LegoOS actually performed better than standard Linux and Infiniswap systems when under (relatively extreme) memory limitations.
Comment on the complexity of resource management in datacenter systems, where are some of the contributing factors?
- Datacenters can have thousands of rackmounted servers, both generalized and specialized. They are also responsible for running different types of applications (batch job, long running service, prod vs non-prod environments, multiple tenants), and some tasks within an application can be latency and/or throughput sensitive. The goal is to deploy an application that best utilizes these resources, while still meeting SLA goals.
The design of Borg, based on Borgmaster, scheduler, and Borglet, and the manner Borg distributes functionality across these components, helps it to achieve greater scalability and efficiency. Explain how the design, mechanisms and optimizations in each of these components contribute to these goals.
- Borg groups several machines into server classes, and scores them based on a number of factors (low-latency, throughput). Then applications are grouped into application classes that select a subset of these factors to prioritize. The Borgmaster tries to schedule those jobs on Borglets that are in a server class that ranks highly for those factors.
Explain the select result from the Borg paper included in the Lesson.
- Utilizing resource pooling, Borg is able to use 20-150+% less machines.

Distributed ML

top

notes

Contrast Federated Learning to a more naïve centralized learning approach
- Federated learning decouples the synchronizing of data between workers and parameter servers with synchronizing between parameter servers. This allows for more efficient syncing over WAN.
Explain the role of Parameter Server in the distributed machine learning system
- Parameter servers handle updating model parameters between training runs.
What are the problems with naively scaling a system such as Parameter Server, designed for a single datacenter, to a geo-distributed setting?
- The entire system slows down due to network latencies.
How does Gaia address these problems? What are the main components of the system?
- Gaia performs syncs as normal from worker nodes to parameter servers within the same datacenter, but batches and periodically performs remote syncs across datacenters.
- Uses Approximate Synchronous Parallel model.
What is an Approximate Synchronous Parallel model?
- Optimizations leveraging the fact that ML models are only approximately correct anyway.
- Significance filter
  - Only send data when it makes a difference
- Selective Barrier
  - A mechanism to stall workers during remote sync
- Mirror Clock
  - To keep datacenters in sync and determine RTT/speed.
What are the performance benefits of this model – specifically be able to explain why, what is being eliminated in terms of costs, overheads… in order to get to the specific performance gain.
- The Gaia model focuses on being selective about what data to send on remote syncs. This limits the amount of data being sent over WAN connections.
Can you explain the different configurations of the experiments evaluated to generate the data in the results shown in the Lesson?
- Multiple servers were spun up in different regions around the world. The most drastic differences were noticed when servers worked across WAN boundaries, especially moving across country boundaries.
  - California -> Virginia wasn’t as bad as Singapore -> Sao Paolo
What are some advantages of a decentralized peer-to-peer model over a federated or a centralized approach?
- Peer-to-peer models tend to be smaller and send less data over the network.
- Adapts quickly to changes in workload
What are some of the challenges in realizing a peer-to-peer learning system such as Cartel?
- Less sharing of data can result in more computation duplication
Did you take a look at Ray? Any thoughts?
- Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system’s control state.

pBFT

top

notes

Describe the Byzantine Generals problem?
- How do you reach consensus when you allow:
  - failed nodes continue participating
  - potentially incorrect behavior/messages
    - maliciously or arbitrarily
Why can you not solve the Byzantine Generals problem with a consensus algorithm like Paxos?
- Neither the nodes nor the messages can be trusted
Describe the pBFT algorithm? What is the goal of each of the phases in the algorithm?
- Consensus
  - Safety
  - Liveness
  - Validity
- Tolerate f failures
- Asynchronous
Why does Paxos need 2f+1 nodes to tolerate f failures, and pBFT needs 3f+1?
- Paxos assumes all actors are benign. pBFT must have a quorum that tolerates failures AND bad actors.
Relate on a high-level a blockchain distributed ledger and a Paxos log
- Both are distributed logs of actions that need to be agreed upon by the majority of servers. A blockchain ledger must also be able to verify a correct ledger against a false ledger.
Why is Paxos or Raft consensus not used in practice in Blockchain technologies?
- It’s impossible to guarantee the 3f+1 fault tolerance because the number of nodes are not known. Plus communication costs are O(n^3) for pBFT which is wildly impractical at a blockchain scale.
What is the role of the PoW and cryptocurrency incentives used in Blockchain consensus solutions?
- Proof-of-work make it non-trivially difficult for a server to join the network, and cryptocurrency incentives reward servers for good behavior. The combination of these make byzantine attacks less likely to occur.
Did you read (take a look at at least) James Michens’ Saddest Moment?
- Yes

Edge and IoT

top

notes

In Satya et al.’s taxonomy, what are the different tiers of computing infrastructure/systems? What are the unique characteristics of these tiers?
- Clouds
- Cloudlets
  - Luggable, vehicular, mini-datacenter
- IOT
  - Drones, security cameras, smartphones
- On-Person
What is the motivation behind edge computing? What are some of the driving use cases? What distinguishes it from current Cloud Computing or even CDNs?
- Newer workloads (HD Video, AR/VR, SmartCity & Automation) increase demand for bandwidth-intensive and latency-sensitive systems.
- Working from home also shifts connectivity needs.
- Cloud computing and CDNs require centralized computing with results shared, while edge computing is focused only computing where the data is collected.
What are some assumptions of distributed systems solutions developed for datacenters, or even geo-distributed datacenters, that are no longer valid when considering edge computing? Why?
- Edge is not elastic.
  - Typically edge nodes are single servers that are deployed and not touched until they physically break.
- Chatty protocols are inappropriate.
  - Networks are not necessarily as robust with edge deployments and there is typically other traffic on those networks as well.
Why are distributed transactions insufficient to perform consistent state updates in distributed IoT-based systems at the edge?
- Actuations of a system interact more directly with the real world. It would be inappropriate to set off a fire alarm multiple times to ensure at-least-once semantics as we do with data center messages.
How does the transactuation concept relate to transactions? What are some additional features that are introduced in transactuations that make it possible to build correct IoT-based distributed applications?
- Transactuations provide atomic durability of actuations.
- Sensing invariant
  - Transactuation executes only when staleness of its sensor reads is bounded, as per specified sensing policy.
- Sensing policy
  - How much staleness is acceptable?
  - How many failed sensors are acceptable?
    - At least one CO2 sensor must be read within last 5 minutes
- Actuation invariant
  - When a transactuation commits its app states, enough actuations have succeeded as per actuation policy.
- Actuation policy
  - How many failed actuations are acceptable?
    - At least one alarm should turn on.
In the evaluation of the system, why did the authors pick the specific metrics we mention in the Lesson, what did they want to demonstrate/learn about Transactuations by using these metrics?
- They wanted to make sure their tests were sufficiently general. The picked metrics across a few areas:
  - Convenience
  - Energy Efficiency
  - Safety
  - Security