Required Readings



Management Stacks in Datacenters

Thousands of servers on racks

Generalized and specialized

Hyperscaler size is tremendous

What is the task?

  • Application
    • Batch job
    • Long running service
    • Prod vs non-prod
    • Multi-tenancy
  • Processes/Tasks
    • Latency sensitive
    • Throughput sensitive
  • Allocation decisions for hardware

Meet Service Level Objectives (SLO)

  • Defined in Service Level Agreement (SLA)

Datacenter Management at Scale

Omega -> Borg -> Kubernetes

Cell := a collection of machines

Cell is the basic unit of management in Borg

Machines in a cell belong to a single cluster

A cluster lives inside a building

A site might have multiple buildings

Perform duties of an OS scheduler, but across multiple machines

Overview of Borg Operations

1 Borgmaster per cell

  • Maintains entire state of cell in memory

Workers are called borglets


  1. Scans tasks from high to low priority
  2. Runs feasibility check to determine the set of machines on which a task can be run
  3. Scores the feasible machines based on best fit
  4. Submits machine assignment to Borgmaster

Borgmaster can preempt tasks based on priority

Achieving Scalability

Borgmaster is replicated 5 times

  • 1 master is elected and acquires a Chubby lock
  • Only master mutates the state of the cell
  • Elected master servers as leader for writing to the Paxos store
  • Each replica also saves the state of a cell to a Paxos store
  • Failover ~10s

Automatically reschedule evicted tasks

Reduces correlated failures by spreading tasks of a job across failure domains

Limits the allowed rate of task disruption

Decoupling scheduler from task assignment

  • Async update/read from pending queue
  • Integrate different schedules
    • Like tetrisched

Make communication efficient

  • Borgmaster uses separate threads for read-only RPCs and talking to borglets
  • Use link shards to summarize info collected from borglets

Optimizations around scoring for machine/task pairs

Avoid segregation of production and non-production workloads

Score caching:

  • Scores for task assignment are cached until machine/task properties change
  • Small changes in resource quantities on machines are ignored

Equivalence classes:

  • Groups of similar machines are put into one class
  • Scores computed once per class

Relaxed Randomization

  • Scheduler examines machines in a random order until it has found “enough” feasible machines to score

Borg maintains application classes

  • Latency
  • Batch

Compressible Resources

  • CPU/Memory speed can be decreased without killing container

Experimental Results

Is pooling resources a good idea?

Does it save resources over “segregated” configurations?

Pooling needs fewer machines

  • 20-150%