Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

Distributed Machine Learning

Stephen M. Reaves

::

2023-04-26

Notes about Lesson 15 of CS-7210

Readings

Optional

Summary

What is Distributed Machine Learning

ML is not new, but hardware and software advancements make it more practical.

Distributed Machine Learning Approaches

  1. Aggregate all data to a central location, train, then send models back
  1. Federated Models

Parameter Server

Data distributed to workers

Model parameters distributed across parameter servers

  1. Workers get working set of parameters
  2. Compute gradients
  3. Update parameters to servers
  4. Servers synchronize and update model
  5. Repeat until changes from one model to another is less significant than some threshold

Geo-Distributed ML

Naive Approach

Simply deploy parameter servers and workers across each datacenter.

Works, but increasingly slower, especially when across WAN boundaries

Leverage Approximation

Decouple the synchronization model within the datacenter from the synchronization of the model among data centers.

Worker nodes sync as normal with local parameter servers. Then, parameter servers perform a remote sync to remote parameter servers later.

Machine learning models always only approximately correct and converging, so this delay between remote parameter servers isn’t that bad.

What and how to sync among datacenters?

Gaia

Uses Approximate Synchronous Parallel model.

Requirements:

This resulted in significant increase in learning performance when running across WAN over running a system designed for a datacenter.

Tradeoffs of Using Global Model

Global Models are not always needed:

One alternative is isolated learning:

Collaborative Learning with Cartel

Developed by professor (Definitely on the test)

Maintain small customized models at each location/node/context

When a change in the environment/workload pattern occurs, find another node that has observed similar patterns, transfer knowledge

Cartel provides a jump start to adapt to these changes by making it possible to react quickly, find the right peer nodes, and selectively update the model.

Relies on a (logically) centralized registry/DHT to keep metadata about logical neighbors

Benefits:

Beyond Geo-Distributed Training

Training is only one step in ML pipeline

Ray (and friends) at Rise Labs (from Berkeley) created framework to handle all steps in pipeline.