Memory Systems Refresher

Stephen M. Reaves

2024-01-13

Refresher about Memory Systems for CS-6210

Summary

Naive Memory Model
Cache Motivation
Data Costs
Memory Hierarchy
Locality and Cache Blocks
Direct Mapping
How many bits
Set Associative Mapping
Fully Associative Mapping
Write Policy
Virtual Memory Abstraction
Virtually Indexed, Physically Tagged Caches

Naive Memory Model

CPU can talk to memory or disk.

Memory faster than disk

Cache Motivation

for (i = 0; i < N; i++)
  sum += a[i];

sum is a local variable, in cpu register, fast to access

a[i] may be in main memory, slower to access

100 cpu cycle latency
2 bytes/cycle

We can put a faster cache, closer to cpu and store a[i] there

10 cpu cycle latency
256 bytes/cycle

If data is in cache, we call it a “hit”, otherwise its a “miss”

Data Costs

Suppose data retrieving costs

4 cycles on a hit
100 cycles on a miss

How many cycles does a process with a 99% hit rate spend on avg?

$(4 \times 0.99) + (100 \times 0.001)$
4.96

90%?

13.6

Memory Hierarchy

There’s always a trade-off between speed and capacity

Locality and Cache Blocks

Locality assumes if a program needs one particular memory address, it will need the nearby addresses soon

Temporal locality refers to the tendency to access the same memory close in time

Put data in cache right after you use it, in case you need it again soon

Spatial locality refers to the tendency to access the same memory close in terms of address

Put whole block of data in cache, not just the bits you accessed
- Higher order bits of address tell us quickly what block we’re in

Direct Mapping

Direct memory to cache mapping

Suppose a 5 bit address space, fill in the table

Index	Tag	Index	Offset
00	01	0xFC	0x31
01	10	0x12	0x82
10	10	0x57	0x48
11	00	0x63	0x99

Address	Hit/Miss	Value
10001	00
00111	00
10100	00
01111	00

answer

Address	Hit/Miss	Value	Reason
10001	Miss
00111	Hit	0x99	Index of 11, tag of 00, offset of 1
10100	Hit	0x57	Index of 10, tag of 10, offset of 0
01111	Miss

How many bits

22 bit adress space

512 byte cache

64 byte block size

How many bits are used for the index?

$512 \div 64 = 8 = 2^3$
3

How many bits are used for the tag?

$\text{Address space} - \text{index} - \text{cache offset} = \text{bits for tag}$
$22 - 3 - \log_2(64) = 13$
$22 - 3 - 6 = 13$
13

Set Associative Mapping

Associating memory addresses with more than one block in the cache.

Ignores most significant bit of index (for 2-way)

Allows a cache block to be stored in two possible places

Requires checking two tags to see if we hit
Means less likely to miss

Set assoc. cache mapping

Fully Associative Mapping

Fully assoc. cache mapping

Write Policy

Hit:

write-through
- Write to cache, then main memory
write-back
- only write to cache
- useful if no other process needs updated version from main memory

Miss:

write-allocate
- read miss + write hit
no-write-allocate
- don’t bother with cache

Virtual Memory Abstraction

Key benefit is to trick process into thinking it has a memory space all to itself

Adress Translation

A layer of indirection allows processes to not overwrite variables

Also allows for more virtual memory than phsyical memory, by using disk

Paging

Paging is the granularity to which we grant memory to processes

Consider a 16-bit virtual address space, and 512 byte page size, how many virtual page numbers are there?

$log_2(512) = 9$ bits for offset

$2^{16-9} = 2^7 = 128$

Page Table Implementation

Page tables are layered in a hierarchy of page tables, with the top one called the page directory, to save on memory space

Accelerating Address Translation

Table Lookaside Buffer (TLB) is special cache on CPUs to house page table

Most systems flush TLB on context switch, which is why they are so costly
Some systems check an address space id to avoid flushing
Since kernel address space is constant, it’s not flushed

Page Fault

If a process tries to read something that’s not in memory, a page fault happens

The scheduler sets the process back to the state before the page fault happens
The OS loads the page into memory
The scheduler replays the process

Putting it all together

Psuedocode

def load(virtual_addr):
  try:
    physical_addr = translate_addr(virtual_addr)
  except page_fault:
    handle_page_fault(virtual_addr)
    return load(virtual_addr)

  if is_in_cache(physical_addr):
    return read_from_cache(phyiscal_addr)
  else:
    return read_from_memory(physical_addr)


def translate_addr(virtual_addr):
  if is_in_tlb(virtual_addr):
    return read_from_tlb(virtual_addr)
  elif is_access_violation(virtual_addr):
    raise access_violation
  elif is_present(virtual_addr):
    return read_from_page_table(virtual_addr)
  else:
    raise page_fault

Virtually Indexed, Physically Tagged Caches

virtuallyIndexedPhysicallyTaggedCaches

Suppose 32-bit virtual and physical address spaces. Pages are 4KB and cache blocks are 64B. What’s the maximum number of entries in a virtually indexed, physically tagged cache?

4k page size => 12 bit page offset

64B block size => 6 bit cache offset

12 - 6 = 6, 2^6 = 64