Basic Model Of Locality

Stephen M. Reaves

2024-08-19

Notes about Lecture 1 for CS-6220

Required Readings

Summary

A Basic Model
- Two Rules
- Costs
of computations the processor needs to do
of L-sized slow-fast transfers
Quizes
Algorithmic Design Goals
Intensity, Balance, and Time

A Basic Model

Fast memory is small, can only hold Z words

Two Rules

Processor may only compute on data in fast memory
- “Local data rule”
Slow-Fast transers in blocks of size L[words]
- “Block transfer rule”
- If you want to move x, you have to pay an additionaly L-x to get x
  - Where x is located in the block of L is called data alignment

Costs

Work
- $W(n)$
- # of computations the processor needs to do
Transfers
- $Q(n,Z,L)$
- # of L-sized slow-fast transfers
  - “load and stores”
- “I/O” complexity

Quizes

How many transfers are needed in the worst case, assuming nothing about alignment, to sum an array of numbers?

$ceil(n/L)+1$

Give a simple (trivial) lower bound on the asymptotic number of transfers when sorting an array

$W(n) = \Omega(n\log{n})$

$Q(n,Z,L) = \Omega(ceil(n/L))$

$Q(n,Z,L) = \Omega(\frac{\frac{n}{L}\log{\frac{n}{L}})}{\log{\frac{Z}{L}}})$

What is the minimum number of transfers required to multiply two nxn matrices?

$W(n) = O(n^3)$

$Q(n,Z,L) = \Omega(\frac{n^2}{L})$

$Q(n,Z,L) = \Omega(\frac{n^3}{L\times\sqrt{Z}})$

Algorithmic Design Goals

Work Optimality
- Work shouldn’t be worse by introducing a memory-hierarchy
High Computational Intensity
- Maximize: $I(n,Z,L) \equiv \frac{W(n)}{L \times Q(n,Z,L)}$
  - operations per word
  - data reuse

Given two algorithms, one with

W(n) = \Theta(n), Q(n,Z,L) = \Theta(\frac{n}{L})

and

W(n) = \Theta(n\log{n}), Q(n,Z,L) = \Theta(\frac{n}{L\log{Z}})

, which is better?

Algorithm 1 prioritizes low work, but algorithm 2 prioritizes high intensity, so its a draw, unless we know more about the problem.

Intensity, Balance, and Time

Assume it takes $\tau$ time per operation of compute, and $\alpha$ (amortized) time to load word, The time to compute ( $T_{comp}$ ) will be $\tau W$ , while the time spent loading from memory ( $T_{mem}$ ) will be $\alpha L Q$ .

This leads the total time to equal $T \ge max(T_{comp}, T_{mem})$ (assuming perfect overlap)

$T = \tau W \times max(1, \frac{\alpha/\tau}{W/(LQ)})$

$\tau W$ is an ideal computation time, assuming data movement is free

$max(1, \frac{\alpha/\tau}{W/(LQ)})$ is the cost/penalty of moving data

$W/(LQ)$ is just intensity, I: ops/word

$\alpha/\tau$ is the how many operations can be executed in the time it takes to move a word, or machine balance, $B$

The minimum possible execution time in terms of the balance of the machine and the intensity of the algorithm is $\tau W \times max(1, \frac{B}{I})$

Normalizing the performance against the best RAM model ( $W_\star$ ) gives you $R \equiv \frac{\tau W_*}{T} \le \frac{W_\star}{W} \times min(1, \frac{I}{B})$