Required Readings


Communication Support Assumed So Far

We typically have used application/service-level namespace

  • Process names
  • File names
  • Object keys

We need to map these to network level

  • IP Addresses
  • Network paths through switches and routers

Introduce intermediary metadata service, called Overlay Network

Created and updated on control plane

Need to make sure things

  • Scale
  • Geo-replicate
  • Handle failures
  • Span multiple administrative domains

This may introduce overhead on extremely dynamic systems

Interconnect Support

Hardware and drivers

Assume network has support for:

  • Broadcast/multicast
  • Gather/all-reduce
  • Barrier
  • Atomics (e.g. CompareAndSwap (CAS))
  • Timing
  • Remote Direct Memory Access (RDMA)
  • Direct Cache Injection (DDIO)

Peer to Peer Systems

P2P systems only assume that systems can communicate over IP. No other assumptions are made for things like scale, structure, topology, etc.

Connectivity in P2P

How to find right peer?

  • Centralized Registry
    • Single RTT to find correct peer IP
    • Centralized trust
    • Examples:
      • Napster
  • Flood/Gossip based protocols
    • Remove centralized coordinator
    • No bound on lookup time
    • Examples:
      • Gnutella
      • Bitcoin
  • Provide Structured Routing Trees
    • Distributed Hash Table (DHT)
    • Decentralized index
    • Probabilistic lookup time
    • Examples:
      • Chord
      • Kademlia
      • DynamoDB

Distributed Hash Table

Each client in distributed system uses same hash function

DHT is like a normal HT, except keys are sharded between nodes


DHT is represented as a ring of all numbers from 0 - N-1


N must be sufficiently large to avoid collision

N is max number of nodes

NodeValue = SHA(KEY)

If node exists at NodeValue, update. Else, go to next node that does exist

Lookups needs to be need for immediate node and successors

Each node contains Finger Table:

  • at each node n, ith finger entry starts at (n + 2i) for a range of 2i elements
  • Lookup goes from O(N) to O(log(N))

Nodes joining, departing

  • Redistribute data
  • Update finger tables
  • Improve performance with additional metadata

Probabilistic guarantees about performance

Hierarchical Systems

Cost of communication vs cost of maintaining overlay

Nodes with different properties:

  • Point to Point communication
  • Stability, failure probability, Mobility
  • Number and types of nodes
  • Communication patters, locality

Consider hybrid approaches and hierarchical designs

Heterogeneous Systems A Mobile Network Example

Two types of nodes:

  • Mobile Support Stations (MSS)
    • Stationary
    • High speed wired network
    • Power availability is not a concern
  • Mobile Host (MH)
    • Belong to a cell associated with an MSS
    • Mobile
    • Low(er) speed mobile network
    • Battery considerations


  • Fast lookup of MH
  • Low overhead updates of overlay state
    • Communication overhead
    • Batter/energy/compute overhead

Alternative Algorithms


  • SEARCH(lookup) cost
  • INSERT(add/remove node)
  • Impact of update required to support mobility

Time = Number (and type) of connections that need to be traversed