Distributed Machine Learning

Stephen M. Reaves

2023-04-26

Notes about Lesson 15 of CS-7210

Readings

Gaia

Optional

Summary

What is Distributed Machine Learning
Distributed Machine Learning Approaches
- Parameter Server
Geo-Distributed ML
- Naive Approach
Leverage Approximation
Gaia
Tradeoffs of Using Global Model
Collaborative Learning with Cartel
Beyond Geo-Distributed Training

What is Distributed Machine Learning

ML is not new, but hardware and software advancements make it more practical.

Distributed Machine Learning Approaches

Aggregate all data to a central location, train, then send models back

Data movement is expensive
Data sovereignty is also an issue

Federated Models

Evaluate data locally
Periodically aggregate model/parameters centrally
Disseminate updates

Parameter Server

Data distributed to workers

Model parameters distributed across parameter servers

Workers get working set of parameters
Compute gradients
Update parameters to servers
Servers synchronize and update model
Repeat until changes from one model to another is less significant than some threshold

Geo-Distributed ML

Naive Approach

Simply deploy parameter servers and workers across each datacenter.

Works, but increasingly slower, especially when across WAN boundaries

Leverage Approximation

Decouple the synchronization model within the datacenter from the synchronization of the model among data centers.

Worker nodes sync as normal with local parameter servers. Then, parameter servers perform a remote sync to remote parameter servers later.

Machine learning models always only approximately correct and converging, so this delay between remote parameter servers isn’t that bad.

What and how to sync among datacenters?

The majority of updates are insignificant.

Gaia

Uses Approximate Synchronous Parallel model.

Requirements:

Significance filter
ASP selective barrier
- A mechanism to stall workers during remote sync
Mirror Clock
- To keep datacenters in sync and determine RTT and speed.

This resulted in significant increase in learning performance when running across WAN over running a system designed for a datacenter.

Gaia performance was close to LAN

Tradeoffs of Using Global Model

Global Models are not always needed:

Locality and personalization are lost
Tailored models == smaller, more efficient
Complexity of training and potential for overfitting with large models
Data transfer costs are still big

One alternative is isolated learning:

Each node builds custom model
May not have all the data, which can lead to inaccurate model
Loss of efficiency
- No sharing of data
- No sharing of computation
- Especially during concept shift

Collaborative Learning with Cartel

Developed by professor (Definitely on the test)

Maintain small customized models at each location/node/context

When a change in the environment/workload pattern occurs, find another node that has observed similar patterns, transfer knowledge

Cartel provides a jump start to adapt to these changes by making it possible to react quickly, find the right peer nodes, and selectively update the model.

Relies on a (logically) centralized registry/DHT to keep metadata about logical neighbors

Benefits:

Adapts quickly to changes in workload
- 8x
Reduces total data transfer
- 1500x
Enables use of smaller models
- 3x smaller
- 5.7x faster training

Beyond Geo-Distributed Training

Training is only one step in ML pipeline

Ray (and friends) at Rise Labs (from Berkeley) created framework to handle all steps in pipeline.