Datacenter Based Distributed Management

Stephen M. Reaves

2023-03-06

Notes about Lesson 14 of CS-7210

Required Readings

Borg

Optional

Summary

Management Stacks in Datacenters
Datacenter Management at Scale
Overview of Borg Operations
Achieving Scalability
Experimental Results

Management Stacks in Datacenters

Thousands of servers on racks

Generalized and specialized

Hyperscaler size is tremendous

What is the task?

Application
- Batch job
- Long running service
- Prod vs non-prod
- Multi-tenancy
Processes/Tasks
- Latency sensitive
- Throughput sensitive
Allocation decisions for hardware

Meet Service Level Objectives (SLO)

Defined in Service Level Agreement (SLA)

Datacenter Management at Scale

Omega -> Borg -> Kubernetes

Cell := a collection of machines

Cell is the basic unit of management in Borg

Machines in a cell belong to a single cluster

A cluster lives inside a building

A site might have multiple buildings

Perform duties of an OS scheduler, but across multiple machines

Overview of Borg Operations

1 Borgmaster per cell

Maintains entire state of cell in memory

Workers are called borglets

Scheduler:

Scans tasks from high to low priority
Runs feasibility check to determine the set of machines on which a task can be run
Scores the feasible machines based on best fit
Submits machine assignment to Borgmaster

Borgmaster can preempt tasks based on priority

Achieving Scalability

Borgmaster is replicated 5 times

1 master is elected and acquires a Chubby lock
Only master mutates the state of the cell
Elected master servers as leader for writing to the Paxos store
Each replica also saves the state of a cell to a Paxos store
Failover ~10s

Automatically reschedule evicted tasks

Reduces correlated failures by spreading tasks of a job across failure domains

Limits the allowed rate of task disruption

Decoupling scheduler from task assignment

Async update/read from pending queue
Integrate different schedules
- Like tetrisched

Make communication efficient

Borgmaster uses separate threads for read-only RPCs and talking to borglets
Use link shards to summarize info collected from borglets

Optimizations around scoring for machine/task pairs

Avoid segregation of production and non-production workloads

Score caching:

Scores for task assignment are cached until machine/task properties change
Small changes in resource quantities on machines are ignored

Equivalence classes:

Groups of similar machines are put into one class
Scores computed once per class

Relaxed Randomization

Scheduler examines machines in a random order until it has found enough feasible machines to score

Borg maintains application classes

Latency
Batch

Compressible Resources

CPU/Memory speed can be decreased without killing container

Experimental Results

Is pooling resources a good idea?

Does it save resources over segregated configurations?

Pooling needs fewer machines

20-150%