Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

Datacenter Based Distributed Management

Stephen M. Reaves

::

2023-03-06

Notes about Lesson 14 of CS-7210

Required Readings

Optional

Summary

Management Stacks in Datacenters

Thousands of servers on racks

Generalized and specialized

Hyperscaler size is tremendous

What is the task?

Meet Service Level Objectives (SLO)

Datacenter Management at Scale

Omega -> Borg -> Kubernetes

Cell := a collection of machines

Cell is the basic unit of management in Borg

Machines in a cell belong to a single cluster

A cluster lives inside a building

A site might have multiple buildings

Perform duties of an OS scheduler, but across multiple machines

Overview of Borg Operations

1 Borgmaster per cell

Workers are called borglets

Scheduler:

  1. Scans tasks from high to low priority
  2. Runs feasibility check to determine the set of machines on which a task can be run
  3. Scores the feasible machines based on best fit
  4. Submits machine assignment to Borgmaster

Borgmaster can preempt tasks based on priority

Achieving Scalability

Borgmaster is replicated 5 times

Automatically reschedule evicted tasks

Reduces correlated failures by spreading tasks of a job across failure domains

Limits the allowed rate of task disruption

Decoupling scheduler from task assignment

Make communication efficient

Optimizations around scoring for machine/task pairs

Avoid segregation of production and non-production workloads

Score caching:

Equivalence classes:

Relaxed Randomization

Borg maintains application classes

Compressible Resources

Experimental Results

Is pooling resources a good idea?

Does it save resources over segregated configurations?

Pooling needs fewer machines