Required Readings

Optional

Summary

Fall off of Moore’s law leads to specialization and heterogeneity compute

  • GPUs
  • TPUs
  • New Memory/Storage classes

Disaggregation := independently scaled tiers of different resources

What is RDMA

Remote Direct Memory Access

Bypass CPU involvement in data access via interconnect

Higher bandwidth and lower latency, but higher cost per port

Two sided RDMA

  • Traditional send/recv semantics like sockets

One sided RDMA

  • CPU access remote memory directly
  • Memory needs to be pinned and not swapped out to disk

RDMA Specialized RPC

One sided RDMA is faster, but needs a redesign

  • Plus RPC typically has a service that needs to be accessed anyway

Instead we can create a new type of RPC that leverages RDMA features

  • Connectionless protocols
  • Shared receive queues

What If Memory Is Persistent

Intel Optane was Byte-Addressable Persistent Memory (PMEM)

Persistent data operations require flush to persistent memory

  • Must complete before client is acknowledged
  • Removes advantage of RDMA over send/recv RPC

Disaggregation

Pools of network attached but independently scaled resources

Not new, but easier with faster interconnects and smarter devices

Disaggregating CPU and Memory with LegoOS

Colocate Virtual Memory System onto MMU instead of CPU

Cache misses now have to go over the network which is much slower and latent

  • Help this by adding large CPU “Extended Cache”

LegoOS Select Experimental Result

Prototype implemented in emulation

Monolithic servers, but all but some resources ignored

  • So “Network attached Hard drives” were regular servers that didn’t utilize CPU, RAM, etc

Controllers implemented in Linux

Connected via RDMA network, communicated via RPC

Actual system designs:

  • HPE “The Machine”
  • Interconnects for Fabric Attached Memory (OpenFAM)
  • Berkeley Firebox system

Baseline Comparisons:

  • Linux with SSD Swap
  • Linux with Ramdisk Swap
  • Linux with InfiniSwap
  • LegoOS

Workloads:

  • Unmodified TensorFlow running CIFAR-10
    • Working set: 0.9G
    • 4 threads

LegoOS was much better