What is Distributed Machine Learning

ML is not new, but hardware and software advancements make it more practical.

Distributed Machine Learning Approaches

  1. Aggregate all data to a central location, train, then send models back
  • Data movement is expensive
  • Data sovereignty is also an issue
  1. Federated Models
  • Evaluate data locally
  • Periodically aggregate model/parameters centrally
  • Disseminate updates

Parameter Server

Data distributed to workers

Model parameters distributed across parameter servers

  1. Workers get working set of parameters
  2. Compute gradients
  3. Update parameters to servers
  4. Servers synchronize and update model
  5. Repeat until changes from one model to another is less significant than some threshold

Geo-Distributed ML

Naive Approach

Simply deploy parameter servers and workers across each datacenter.

Works, but increasingly slower, especially when across WAN boundaries

Leverage Approximation

Decouple the synchronization model within the datacenter from the synchronization of the model among data centers.

Worker nodes sync as normal with local parameter servers. Then, parameter servers perform a remote sync to remote parameter servers later.

Machine learning models always only approximately correct and converging, so this delay between remote parameter servers isn’t that bad.

What and how to sync among datacenters?

  • The majority of updates are insignificant.


Uses Approximate Synchronous Parallel model.


  • Significance filter
  • ASP selective barrier
    • A mechanism to stall workers during remote sync
  • Mirror Clock
    • To keep datacenters in sync and determine RTT and speed.

This resulted in significant increase in learning performance when running across WAN over running a system designed for a datacenter.

  • Gaia performance was close to LAN

Tradeoffs of Using Global Model

Global Models are not always needed:

  • Locality and personalization are lost
  • Tailored models == smaller, more efficient
  • Complexity of training and potential for overfitting with large models
  • Data transfer costs are still big

One alternative is isolated learning:

  • Each node builds custom model
  • May not have all the data, which can lead to inaccurate model
  • Loss of efficiency
    • No sharing of data
    • No sharing of computation
    • Especially during concept shift

Collaborative Learning with Cartel

Developed by professor (Definitely on the test)

Maintain small customized models at each location/node/context

When a change in the environment/workload pattern occurs, find another node that has observed similar patterns, transfer knowledge

Cartel provides a jump start to adapt to these changes by making it possible to react quickly, find the right peer nodes, and selectively update the model.

Relies on a (logically) centralized registry/DHT to keep metadata about logical neighbors


  • Adapts quickly to changes in workload
    • 8x
  • Reduces total data transfer
    • 1500x
  • Enables use of smaller models
    • 3x smaller
    • 5.7x faster training

Beyond Geo-Distributed Training

Training is only one step in ML pipeline

Ray (and friends) at Rise Labs (from Berkeley) created framework to handle all steps in pipeline.