Dist Comp Final Study Guide
Readings
- Memcached
- COPS - From L10
- Chord
- Designing distributed algs for mobile networks
- Spark - From l12
- Google MapReduce
- Challenges and Solutions for Fast Remote Persistent Memory Access - From L13/14
- LegoOS - From L13/14
- Borg
- Gaia
- Cartel
- Byzantine Generals Problem
- pBFT
- Dahlia Malkhi’s talk from ATC’18
- The Emerging Landscape of Edge Computing and The Computing Landscape of the 21stCentury
- Transactuations: Where Transactions Meet the Physical World
Summary
- Distributed Data Stores - Consistency
- Peer-to-Peer and Mobility
- Data Analytics
- Datacenter Services
- Distributed ML
- pBFT
- Edge and IoT
Distributed Data Stores - Consistency
- What is the the goal Memcached? What does it mean that it’s “lookaside”
cache? Why did its creators choose this design point?
- Memcached is in-memory KV Store with a lookaside cache design.
- The lookaside cache design means the cache sits beside the db, not in
front.
- The client makes explicit calls the the cache first, then on a cache miss, makes explicit calls to the db. Any updates to the db must call DELETE on the cache.
- Cache is clean, but non-authoritative.
- Explicit nature allows clients to decide on consistency vs speed trade off.
- What is one Memcached problem addressed via Leases? Would this problem have
occurred if this weren’t a “lookaside” cache, and, if so, would Leases still
be an appropriate solution?
- Concurrent
DBRead
->DBSet
->CacheSet
operations can arrive out of order which can lead to incorrect cache status. This problem could also be avoided by using an implicit cache in-front (as opposed to beside) the db. This way all transactions would go through the cache layer which would act as an implicit lease.
- Concurrent
- Memcached relies on client-side functionality for several operations. What
are those and why is client-side processing useful?
- Routing is handled client-side to allow for clients to make decisions about tradeoffs.
- How is Memcached scaled to multiple clusters?
- Multiple Memcached clusters can be mirrored to help with single object performance and add failure domains clusters.
- What about across geo-distributed datacenters?
- Memcached expects replication at the storage layer for cross-datacenter. Also, once a client tries to write to a remote Memcached, the value is marked and treated as deleted. Then the client writes to the master database. Memcached will poll the remote database, waiting for the storage to be replicated from the master to the remote database, then it reads from the remote database.
- Why are invalidations or markers used in the multi-cluster or geo-distributed
configurations but not in the single cluster design?
- In multi-cluster scenarios, one large Memcached cluster is broken up into smaller clusters and requests are sharded across the clusters. This gives the benefit of having multiple places the cache can come from, but that also adds complexity. Invalidation mechanisms are implemented to make sure the multiple clusters are in-sync. This is not needed on single cluster designs because a cluster is always in sync with itself.
- What is the problem with the read/write ordering guaranteed by a system like
Memcached that is addressed with the COPS work?
- Causal dependencies are not visible at system level
- How does the COPS solution solve this problem? What is Causal+
consistency/how is this different than just Causal?
- Client library embeds ordering metadata on PUT requests. (PUT_AFTER)
Peer-to-Peer and Mobility
- What is the goal of Peer-to-Peer systems? What are some fundamental
differences with other types of distributed systems: e.g., considering a
single datacenter, or even a set of datacenters as what was described in the
Spanner and other systems discussed in L9?
- Peer-to-Peer systems were created to work without the need for a centralized server. The only thing they assume is the IP stack.
- What are the tradeoffs with possible strategies to find the appropriate peer
in a P2P system?
- Centralized Registry
- Single RTT to find correct peer IP
- Centralized trust
- Examples:
- Napster
- Flood/Gossip based protocols
- Remove centralized coordinator
- No bound on lookup time
- Examples:
- Gnutella
- Bitcoin
- Provide Structured Routing Trees
- Distributed Hash Table (DHT)
- Decentralized index
- Probabilistic lookup time
- Examples:
- Chord
- Kademlia
- DynamoDB
- Centralized Registry
- What’s the basic goal and functionality of a DHT? What could be stored in the
DHT elements? (i.e., What kind of information can be stored in a DHT in a P2P
system?)
- Every client can lookup the server based on a hash of the request. Servers can join/leave the group at anytime.
- How does Chord operate? What is stored in the DHT? What happens when an
element and data is added to Chord? What happens when a lookup is performed?
What about when a node fails/is removed?
- The DHT is a ring. When a client wants to look up some value, they make a
request at
NodeValue = HASH(req)
. If a node does not exist atNodeValue
, you travel linearly to the next node.
- The DHT is a ring. When a client wants to look up some value, they make a
request at
- What is the purpose of the fingers tables/what problem do they solve? What
information do they store, how/when are they updated, how are they used?
- Lookups are
O(n)
. Finger tables allow nodes to have information about which nodes server a particular key range. At each noden
, theith
finger entry starts atn + 2i
for range of2i
elements.
- Lookups are
- What are the problems that are addressed with hierarchical designs of
distributed systems?
- Hierarchical designs are more easily able to take advantage of other features of the system. For example, if we know we are operating in multiple datacenters across a wide-area network, we may try to come up with a system that leverages the locality of nodes within a rack.
- What are the problems with using a DHT across all mobile nodes as peers for
the basic SEARCH and INSERT operations in a mobile network?
- Mobile networks typically have a central trunk of a wired network with branches of wireless networks and leaves of mobile hosts. DHT treats all nodes (and connections) as equal, so it won’t take into consideration the heterogeneity of the system.
Data Analytics
- What are the different strategies how you can scale data analytics to larger
input sizes and to larger processing complexity? What are the fundamental
trade-offs that they introduce?
- Common Techniques:
- Data Parallel
- Divide data, assign to nodes for processing
- How do you know how to load balance?
- Pipelining
- Each node only does one thing
- Data flows through a sequence of tasks
- Increases throughput
- Model Parallelism
- Divide state of application across nodes
- Each node has less to process based on its state
- Input is passed to all nodes, output is combined from all nodes
- Data Parallel
- Common Techniques:
- How are these strategies combined in the original design of MapReduce
presented by Google?
- MapReduce uses a mixture of parallelism and pipelining. The input is initially split up and mapped in parallel to an intermediate KV pair. Those intermediate KV pairs go down the pipeline to the reduce nodes that map them into a final combined KV pair. The fact that the reducers process part of the keyspace then combine is a form of model parallelism.
- What is the goal of using intermediate files to store results from each
operation in a MapReduce pipeline? What are the pros and cons of this
decision?
- The goal of intermediate files was fault-tolerance. Because these files can always be re-read, if a reduce function fails, a new reducer can spin up in its places. This introduces a requirement for persistent I/O and is therefore limited by the speed of that I/O.
- How are these problems addressed by Spark? What is the key idea in Spark?
- Spark was created to provide faster analytics over various data types. The main idea is to allow in-memory data sharing to avoid the I/O problem of MapReduce.
- How do RDDs and lineage speed up processing of analytics applications?
- Avoid serialization overhead. Also, lineages are lazy and aren’t executed until needed.
- How do RDDs and lineage provide support for fault-tolerance?
- Given that the original data is known, the lineage (which is essentially the list of transformations applied to the data) can be replayed to regenerate the expected outcome.
- Do you understand the Spark log mining example: what are the different
constructs, what is their purpose, how are they used?…
- Yes, similar to bash pipelining.
- How does the information about or embedded in RDDs help with placement and
scheduling?
- If we know data will be used together later, we can place them in the same
partition which will ensure that the
join
operation at the end will happen when the data is on the same machine, thus avoiding network overhead.
- If we know data will be used together later, we can place them in the same
partition which will ensure that the
Datacenter Services
- What are some of the emerging technology trends in datacenters? What is
motivating their adoption?
- Hardware specialization has come out to combat the fact that Moore’s law is dying.
- What is RDMA, what are the benefits it provides?
- Remote Direct Memory Access allows us to bypass the CPU when communicating data over an interconnect.
- Why and how does RDMA change how one could perform RPC among a client and
server in a datacenter? What are, if any, the performance implications of
such RPC designs compared to a traditional RPC implementation?
- There are multiple communication modes that RDMA uses (One-sided vs Two-Sided, Connection vs Connection-less), so the RPC API might need to be changed in order to accommodate this. Designing RPC implementation leverage certain RDMA features (such as shared Receive Queues) can increase performance compared to a standard implementation.
- What is non-volatile/persistent memory (NVM), what are the benefits it
provides?
- NVM is memory that is persistent like hard drives, but performs similarly to standard DRAM. NVM also provides capacity similar to hard drives but maintains byte addressability like standard DRAM.
- Why and how does NVM change how one could implement RPC over Remote Direct
Non-Volatile Memory Access?
- With NVM, you no longer need to flush data in memory to disk, which ultimately speeds up synchronous RPC calls.
- What is the goal of disaggregation? What are the technology assumptions that
are required in order to make resource disaggregation possible?
- Disaggregation allows us to scale only the parts of the computer that are actually being stressed. This comes from workload imbalances and resource inefficiency. Workloads that need more memory than what a traditional computer have will also get extra CPU and storage when they scale, even though all they need is RAM.
- Explain the Split-kernel design, what are the goals it aims to achieve?
- The OS functions are split into
monitors
, each of which is run on a different hardware device. All monitors will communicate over the network. This allows us to scale specific hardware components independently of other components.
- The OS functions are split into
- Explain the design and implementation of LegoOS, and whether/how does it meet
the split-kernel design goals?
- LegoOS was only emulated, using monolithic servers that ignore certain resource types. Controllers/Monitors were implemented using Linux kernel modules, connected via RDMA network, and communicated using RPC over RDMA.
- Explain the result on the performance comparison of LegoOS vs. Linux vs.
Infiniswap discussed in the Lesson.
- LegoOS actually performed better than standard Linux and Infiniswap systems when under (relatively extreme) memory limitations.
- Comment on the complexity of resource management in datacenter systems, where
are some of the contributing factors?
- Datacenters can have thousands of rackmounted servers, both generalized and specialized. They are also responsible for running different types of applications (batch job, long running service, prod vs non-prod environments, multiple tenants), and some tasks within an application can be latency and/or throughput sensitive. The goal is to deploy an application that best utilizes these resources, while still meeting SLA goals.
- The design of Borg, based on Borgmaster, scheduler, and Borglet, and the
manner Borg distributes functionality across these components, helps it to
achieve greater scalability and efficiency. Explain how the design,
mechanisms and optimizations in each of these components contribute to these
goals.
- Borg groups several machines into server classes, and scores them based on a number of factors (low-latency, throughput). Then applications are grouped into application classes that select a subset of these factors to prioritize. The Borgmaster tries to schedule those jobs on Borglets that are in a server class that ranks highly for those factors.
- Explain the select result from the Borg paper included in the Lesson.
- Utilizing resource pooling, Borg is able to use 20-150+% less machines.
Distributed ML
- Contrast Federated Learning to a more naïve centralized learning approach
- Federated learning decouples the synchronizing of data between workers and parameter servers with synchronizing between parameter servers. This allows for more efficient syncing over WAN.
- Explain the role of Parameter Server in the distributed machine learning
system
- Parameter servers handle updating model parameters between training runs.
- What are the problems with naively scaling a system such as Parameter Server,
designed for a single datacenter, to a geo-distributed setting?
- The entire system slows down due to network latencies.
- How does Gaia address these problems? What are the main components of the
system?
- Gaia performs syncs as normal from worker nodes to parameter servers within the same datacenter, but batches and periodically performs remote syncs across datacenters.
- Uses
Approximate Synchronous Parallel
model.
- What is an Approximate Synchronous Parallel model?
- Optimizations leveraging the fact that ML models are only approximately correct anyway.
- Significance filter
- Only send data when it makes a difference
- Selective Barrier
- A mechanism to stall workers during remote sync
- Mirror Clock
- To keep datacenters in sync and determine RTT/speed.
- What are the performance benefits of this model – specifically be able to
explain why, what is being eliminated in terms of costs, overheads… in order
to get to the specific performance gain.
- The Gaia model focuses on being selective about what data to send on remote syncs. This limits the amount of data being sent over WAN connections.
- Can you explain the different configurations of the experiments evaluated to
generate the data in the results shown in the Lesson?
- Multiple servers were spun up in different regions around the world. The
most drastic differences were noticed when servers worked across WAN
boundaries, especially moving across country boundaries.
- California -> Virginia wasn’t as bad as Singapore -> Sao Paolo
- Multiple servers were spun up in different regions around the world. The
most drastic differences were noticed when servers worked across WAN
boundaries, especially moving across country boundaries.
- What are some advantages of a decentralized peer-to-peer model over a
federated or a centralized approach?
- Peer-to-peer models tend to be smaller and send less data over the network.
- Adapts quickly to changes in workload
- What are some of the challenges in realizing a peer-to-peer learning system
such as Cartel?
- Less sharing of data can result in more computation duplication
- Did you take a look at Ray? Any thoughts?
- Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system’s control state.
pBFT
- Describe the Byzantine Generals problem?
- How do you reach consensus when you allow:
- failed nodes continue participating
- potentially incorrect behavior/messages
- maliciously or arbitrarily
- How do you reach consensus when you allow:
- Why can you not solve the Byzantine Generals problem with a consensus
algorithm like Paxos?
- Neither the nodes nor the messages can be trusted
- Describe the pBFT algorithm? What is the goal of each of the phases in the
algorithm?
- Consensus
- Safety
- Liveness
- Validity
- Tolerate
f
failures - Asynchronous
- Consensus
- Why does Paxos need 2f+1 nodes to tolerate f failures, and pBFT needs 3f+1?
- Paxos assumes all actors are benign. pBFT must have a quorum that tolerates failures AND bad actors.
- Relate on a high-level a blockchain distributed ledger and a Paxos log
- Both are distributed logs of actions that need to be agreed upon by the majority of servers. A blockchain ledger must also be able to verify a correct ledger against a false ledger.
- Why is Paxos or Raft consensus not used in practice in Blockchain
technologies?
- It’s impossible to guarantee the 3f+1 fault tolerance because the number of nodes are not known. Plus communication costs are O(n^3) for pBFT which is wildly impractical at a blockchain scale.
- What is the role of the PoW and cryptocurrency incentives used in Blockchain
consensus solutions?
- Proof-of-work make it non-trivially difficult for a server to join the network, and cryptocurrency incentives reward servers for good behavior. The combination of these make byzantine attacks less likely to occur.
- Did you read (take a look at at least) James Michens’ Saddest Moment?
- Yes
Edge and IoT
- In Satya et al.’s taxonomy, what are the different tiers of computing
infrastructure/systems? What are the unique characteristics of these tiers?
- Clouds
- Cloudlets
- Luggable, vehicular, mini-datacenter
- IOT
- Drones, security cameras, smartphones
- On-Person
- What is the motivation behind edge computing? What are some of the driving
use cases? What distinguishes it from current Cloud Computing or even CDNs?
- Newer workloads (HD Video, AR/VR, SmartCity & Automation) increase demand for bandwidth-intensive and latency-sensitive systems.
- Working from home also shifts connectivity needs.
- Cloud computing and CDNs require centralized computing with results shared, while edge computing is focused only computing where the data is collected.
- What are some assumptions of distributed systems solutions developed for
datacenters, or even geo-distributed datacenters, that are no longer valid
when considering edge computing? Why?
- Edge is not elastic.
- Typically edge nodes are single servers that are deployed and not touched until they physically break.
- Chatty protocols are inappropriate.
- Networks are not necessarily as robust with edge deployments and there is typically other traffic on those networks as well.
- Edge is not elastic.
- Why are distributed transactions insufficient to perform consistent state
updates in distributed IoT-based systems at the edge?
- Actuations of a system interact more directly with the real world. It would be inappropriate to set off a fire alarm multiple times to ensure at-least-once semantics as we do with data center messages.
- How does the transactuation concept relate to transactions? What are some
additional features that are introduced in transactuations that make it
possible to build correct IoT-based distributed applications?
- Transactuations provide atomic durability of actuations.
- Sensing invariant
- Transactuation executes only when staleness of its sensor reads is bounded, as per specified sensing policy.
- Sensing policy
- How much staleness is acceptable?
- How many failed sensors are acceptable?
- At least one CO2 sensor must be read within last 5 minutes
- Actuation invariant
- When a transactuation commits its app states, enough actuations have succeeded as per actuation policy.
- Actuation policy
- How many failed actuations are acceptable?
- At least one alarm should turn on.
- How many failed actuations are acceptable?
- In the evaluation of the system, why did the authors pick the specific
metrics we mention in the Lesson, what did they want to demonstrate/learn
about Transactuations by using these metrics?
- They wanted to make sure their tests were sufficiently general. The picked
metrics across a few areas:
- Convenience
- Energy Efficiency
- Safety
- Security
- They wanted to make sure their tests were sufficiently general. The picked
metrics across a few areas: