Distributed Machine Learning
Readings
Optional
Summary
- What is Distributed Machine Learning
- Distributed Machine Learning Approaches
- Geo-Distributed ML
- Leverage Approximation
- Gaia
- Tradeoffs of Using Global Model
- Collaborative Learning with Cartel
- Beyond Geo-Distributed Training
What is Distributed Machine Learning
ML is not new, but hardware and software advancements make it more practical.
Distributed Machine Learning Approaches
- Aggregate all data to a central location, train, then send models back
- Data movement is expensive
- Data sovereignty is also an issue
- Federated Models
- Evaluate data locally
- Periodically aggregate model/parameters centrally
- Disseminate updates
Parameter Server
Data distributed to workers
Model parameters distributed across parameter servers
- Workers get working set of parameters
- Compute gradients
- Update parameters to servers
- Servers synchronize and update model
- Repeat until changes from one model to another is less significant than some threshold
Geo-Distributed ML
Naive Approach
Simply deploy parameter servers and workers across each datacenter.
Works, but increasingly slower, especially when across WAN boundaries
Leverage Approximation
Decouple the synchronization model within
the datacenter from the
synchronization of the model among
data centers.
Worker nodes sync as normal with local parameter servers. Then, parameter servers perform a remote sync to remote parameter servers later.
Machine learning models always only approximately correct and converging, so this delay between remote parameter servers isn’t that bad.
What and how to sync among datacenters?
- The majority of updates are insignificant.
Gaia
Uses Approximate Synchronous Parallel
model.
Requirements:
- Significance filter
- ASP selective barrier
- A mechanism to stall workers during remote sync
- Mirror Clock
- To keep datacenters in sync and determine RTT and speed.
This resulted in significant increase in learning performance when running across WAN over running a system designed for a datacenter.
- Gaia performance was close to LAN
Tradeoffs of Using Global Model
Global Models are not always needed:
- Locality and personalization are lost
- Tailored models == smaller, more efficient
- Complexity of training and potential for overfitting with large models
- Data transfer costs are still big
One alternative is isolated learning
:
- Each node builds custom model
- May not have all the data, which can lead to inaccurate model
- Loss of efficiency
- No sharing of data
- No sharing of computation
- Especially during concept shift
Collaborative Learning with Cartel
Developed by professor (Definitely on the test)
Maintain small customized models at each location/node/context
When a change in the environment/workload pattern occurs, find another node that has observed similar patterns, transfer knowledge
Cartel provides a jump start to adapt to these changes by making it possible to react quickly, find the right peer nodes, and selectively update the model.
Relies on a (logically) centralized registry/DHT to keep metadata about logical neighbors
Benefits:
- Adapts quickly to changes in workload
- 8x
- Reduces total data transfer
- 1500x
- Enables use of smaller models
- 3x smaller
- 5.7x faster training
Beyond Geo-Distributed Training
Training is only one step in ML pipeline
Ray (and friends) at Rise Labs (from Berkeley) created framework to handle all steps in pipeline.