Datacenter Based Distributed Management
Required Readings
Optional
Summary
- Management Stacks in Datacenters
- Datacenter Management at Scale
- Overview of Borg Operations
- Achieving Scalability
- Experimental Results
Management Stacks in Datacenters
Thousands of servers on racks
Generalized and specialized
Hyperscaler size is tremendous
What is the task?
- Application
- Batch job
- Long running service
- Prod vs non-prod
- Multi-tenancy
- Processes/Tasks
- Latency sensitive
- Throughput sensitive
- Allocation decisions for hardware
Meet Service Level Objectives (SLO)
- Defined in Service Level Agreement (SLA)
Datacenter Management at Scale
Omega -> Borg -> Kubernetes
Cell := a collection of machines
Cell is the basic unit of management in Borg
Machines in a cell belong to a single cluster
A cluster lives inside a building
A site might have multiple buildings
Perform duties of an OS scheduler, but across multiple machines
Overview of Borg Operations
1 Borgmaster per cell
- Maintains entire state of cell in memory
Workers are called borglets
Scheduler:
- Scans tasks from high to low priority
- Runs feasibility check to determine the set of machines on which a task can be run
- Scores the feasible machines based on best fit
- Submits machine assignment to Borgmaster
Borgmaster can preempt tasks based on priority
Achieving Scalability
Borgmaster is replicated 5 times
- 1 master is elected and acquires a Chubby lock
- Only master mutates the state of the cell
- Elected master servers as leader for writing to the Paxos store
- Each replica also saves the state of a cell to a Paxos store
- Failover ~10s
Automatically reschedule evicted tasks
Reduces correlated failures by spreading tasks of a job across failure domains
Limits the allowed rate of task disruption
Decoupling scheduler from task assignment
- Async update/read from pending queue
- Integrate different schedules
- Like tetrisched
Make communication efficient
- Borgmaster uses separate threads for read-only RPCs and talking to borglets
- Use link shards to summarize info collected from borglets
Optimizations around scoring for machine/task pairs
Avoid segregation of production and non-production workloads
Score caching:
- Scores for task assignment are cached until machine/task properties change
- Small changes in resource quantities on machines are ignored
Equivalence classes:
- Groups of similar machines are put into one class
- Scores computed once per class
Relaxed Randomization
- Scheduler examines machines in a random order until it has found enough feasible machines to score
Borg maintains application classes
- Latency
- Batch
Compressible Resources
- CPU/Memory speed can be decreased without killing container
Experimental Results
Is pooling resources a good idea?
Does it save resources over segregated configurations?
Pooling needs fewer machines
- 20-150%