Distributed Dense Matrix Multiply

Stephen M. Reaves

2024-11-06

Notes about Lecture 8 for CS-6220

Required Readings

Course Notes

Summary

Matrix Multiply Basics Definitions
Algorithms
Beating Lower Bound

Matrix Multiply Basics Definitions

Given an m by k matrix A, k by n matrix B, and an m by n matrix C, compute a dot product for each row of A and each column of B and append to C

C <- C + AB

$\forall i,j: C_{ij} \leftarrow C_{ij} + \sum_{l=1}^{k} A_{il}B_{lj}$

       1  . .
       2  . .
       5  . .
3 4 -2 6
. .  .
. .  .

$(i,j) \leftarrow 6 + (3)(1) + (4)(2) + (-2)(5)$

$\phantom{(i,j) } \leftarrow 7$

for i <- 1 to m do
  for j <- 1 to n do
    for l <- 1 to k do
      C[i,j] <- C[i,j] + A[i, l] * B[l, j]

Top two loops can be parfors, and innermost loop can be a reduction

parfor i <- 1 to m do
  parfor j <- 1 to n do
    let T[1:k] = temp_array
    parfor l <- 1 to k do
      T[l] <- A[i, l] * B[l, j]
      C[i,j] <- C[i,j] + reduce(T[:])

$W(n) = O(n^3)$

$D(n) = O(\log(n))$

matrix multiply

matrix multiply geometric

Algorithms

1D Algorithms

Start by distributing work across rows to P-1 processors. Each processor gets n/P consecutive rows of each matrix operand.

The most work any one processor is to update an entire row of C, using an entire row of B, and one subblock of A. Then we can shift the row of b we are looking at to use another subblock of A.

blockRowDistribution

Swap Bs after Wait

Parallel Efficiency
speedup / P A parallel system is efficient if parallel speedup is a constant. Higher constants are better.

Isoefficiency function
The function of P that n has to satisfy in order to have constant parallel efficiency.

2D Algorithms

SUMMA

Assume a 2d mesh mesh network, Operands are distributed as normal, and each node is responsible for updating a part of the C matrix that it owns.

for (l = 0; l < N/s; l += s) do
  broadcast(A[rank][l:l+s], owner)
  broadcast(B[rank][l:l+s], owner)
  localUpdate(...)

$T_{summa}(n;P,s) = \text{FLOP time} + \text{communication time}$

$\phantom{T_{summa}(n;P,s)} = \frac{n}{s}\times\left(2\tau\frac{n^2s}{p}\right) + \left(\alpha\frac{n}{s} \times \log(P) + \beta\frac{n^2}{\sqrt{P}} \times \log(P) \right)$

$\phantom{T_{summa}(n;P,s)} = \frac{2 \tau n^3}{p} + \left(\alpha\frac{n}{s} \times \log(P) + \beta\frac{n^2}{\sqrt{P}} \times \log(P) \right)$

2D algorithm is asymptotically better than 1D, Tree is slightly better than bucket

Memory for 1D algo: $4 \frac{n^2}{p}$

Memory for 2D algo: $3 \frac{n^2}{p} + 2\frac{n}{\sqrt{p}}s$

$1 \le s \le \frac{n}{\sqrt{p}}$

$M_{summa} = \begin{cases} < 4\frac{n^2}{p} & \text{when } s < \frac{1}{2}\frac{n}{\sqrt{p}} \ = 4\frac{n^2}{p} & \text{when } s = \frac{1}{2}\frac{n}{\sqrt{p}} \ > 4\frac{n^2}{p} & \text{when } s > \frac{1}{2}\frac{n}{\sqrt{p}} \end{cases}$

Lower bound on Communication

Imagine a machine with P nodes, connected by some topology, where each node has M words of main memory, during the entire distributed matrix multiply, any node does W muliplications.

How many words MUST this node send or receive?

Imagine a timeline where a nodes actions alternate between computation and communication. That timeline can be broken up into L phases where the node sends/recvs exactly M words (except the last phase which might only have < M)

What is the largest number of multiplies that each phase can possibly do?

$\text{let } S_A \equiv \text{Set of elems of A seen in this phase}$

By Looms-Whitney:

$\text{Max num of multiplies per phase} \le \sqrt{ |S_A| \cdot |S_B| \cdot |S_C| }$

It’s possible that A already has M words in its memory, does some computations, evicts M words, then recv’s M more words and does more computations.

$|S_A| \le 2M$

$\text{Max num of multiplies per phase} \le \sqrt{ |S_A| \cdot |S_B| \cdot |S_C| } \le 2\sqrt{2}M^{3/2}$

How many phases are there?

$L \ge \text{ Num of full phases } \ge \left\lfloor\frac{W}{\text{max num of mulitplies per phase}}\right\rfloor$

$\phantom{L \ge \text{ Num of full phases }} \ge \left\lfloor\frac{W}{2\sqrt{2}M^{3/2}}\right\rfloor$

$\phantom{L \ge \text{ Num of full phases }} \ge \frac{W}{2\sqrt{2}M^{3/2}} - 1$

$\text{Num of words communicated by 1 node } \ge \text{ Num of full phases } \times M$

$\phantom{\text{Num of words communicated by 1 node }} \ge \left(\frac{W}{2\sqrt{2}M^{3/2}} - 1\right) \times M$

$\phantom{\text{Num of words communicated by 1 node }} \ge \frac{W}{2\sqrt{2}\cdot\sqrt{M}} - M$

If matrices are of size m,n,k, respectively, at least one node has to do $\frac{m \cdot n \cdot k}{P}$ multiplications, so there is some node where $W \ge \frac{m \cdot n \cdot k}{P}$

$\text{Num of words communicated by 1 node } \ge \frac{m \cdot n \cdot k}{2\sqrt{2}\cdot P \cdot\sqrt{M}} - M$

If M is distributed evenly across all nodes (and assuming $m = n = k$ ):

$\text{Distributed operands } \implies M = \Theta\left(\frac{n^2}{P}\right)$

$\text{Num of words communicated by 1 node } = \Omega \left(\frac{n^2}{\sqrt{P}} \right)$

Total time for communication

$T_{net}(n;P) = \Omega \left( \alpha \cdot \left[ \text{???} \right] + \beta \cdot \frac{n^2}{\sqrt{P}} \right)$

Since the largest message a node can send is of size M,

$\text{Num of messages } \ge \Theta\left( \frac{\frac{n^2}{\sqrt{P}}}{M\left(n;P\right)} \right) = \Theta\left( \sqrt{P} \right)$

$T_{net}(n;P) = \Omega \left( \alpha \cdot \left[ \sqrt{P} \right] + \beta \cdot \frac{n^2}{\sqrt{P}} \right)$

Beating Lower Bound

Increasing memory from $\frac{n^2}{P} \rightarrow \frac{n^3}{P}$ allows you to use a 3D algorithm that broadcasts volumes of computations, rather than surfaces.

2.5D algorithm will be on exam.