Reaves.dev

v0.1.0

built using

Phoenix v1.7.17

Distributed Dense Matrix Multiply

Stephen M. Reaves

::

2024-11-06

Notes about Lecture 8 for CS-6220

Required Readings

Summary

Matrix Multiply Basics Definitions

Given an m by k matrix A, k by n matrix B, and an m by n matrix C, compute a dot product for each row of A and each column of B and append to C

C <- C + AB

i,j:CijCij+l=1kAilBlj\forall i,j: C_{ij} \leftarrow C_{ij} + \sum_{l=1}^{k} A_{il}B_{lj}

       1  . .
       2  . .
       5  . .
3 4 -2 6
. .  .
. .  .

(i,j)6+(3)(1)+(4)(2)+(2)(5) (i,j) \leftarrow 6 + (3)(1) + (4)(2) + (-2)(5)

(i,j)7 \phantom{(i,j) } \leftarrow 7

for i <- 1 to m do
  for j <- 1 to n do
    for l <- 1 to k do
      C[i,j] <- C[i,j] + A[i, l] * B[l, j]

Top two loops can be parfors, and innermost loop can be a reduction

parfor i <- 1 to m do
  parfor j <- 1 to n do
    let T[1:k] = temp_array
    parfor l <- 1 to k do
      T[l] <- A[i, l] * B[l, j]
      C[i,j] <- C[i,j] + reduce(T[:])

W(n)=O(n3) W(n) = O(n^3)

D(n)=O(log(n)) D(n) = O(\log(n))

matrix multiply

matrix multiply geometric

Algorithms

1D Algorithms

Start by distributing work across rows to P-1 processors. Each processor gets n/P consecutive rows of each matrix operand.

The most work any one processor is to update an entire row of C, using an entire row of B, and one subblock of A. Then we can shift the row of b we are looking at to use another subblock of A.

blockRowDistribution


Parallel Efficiency
speedup / P A parallel system is efficient if parallel speedup is a constant. Higher constants are better.
Isoefficiency function
The function of P that n has to satisfy in order to have constant parallel efficiency.

2D Algorithms

SUMMA

Assume a 2d mesh mesh network, Operands are distributed as normal, and each node is responsible for updating a part of the C matrix that it owns.

for (l = 0; l < N/s; l += s) do
  broadcast(A[rank][l:l+s], owner)
  broadcast(B[rank][l:l+s], owner)
  localUpdate(...)

Tsumma(n;P,s)=FLOP time+communication time T_{summa}(n;P,s) = \text{FLOP time} + \text{communication time}

Tsumma(n;P,s)=ns×(2τn2sp)+(αns×log(P)+βn2P×log(P)) \phantom{T_{summa}(n;P,s)} = \frac{n}{s}\times\left(2\tau\frac{n^2s}{p}\right) + \left(\alpha\frac{n}{s} \times \log(P) + \beta\frac{n^2}{\sqrt{P}} \times \log(P) \right)

Tsumma(n;P,s)=2τn3p+(αns×log(P)+βn2P×log(P)) \phantom{T_{summa}(n;P,s)} = \frac{2 \tau n^3}{p} + \left(\alpha\frac{n}{s} \times \log(P) + \beta\frac{n^2}{\sqrt{P}} \times \log(P) \right)

2D algorithm is asymptotically better than 1D, Tree is slightly better than bucket

Memory for 1D algo: 4n2p 4 \frac{n^2}{p}

Memory for 2D algo: 3n2p+2nps 3 \frac{n^2}{p} + 2\frac{n}{\sqrt{p}}s

1snp 1 \le s \le \frac{n}{\sqrt{p}}

Msumma={<4n2pwhen s<12np=4n2pwhen s=12np>4n2pwhen s>12np M_{summa} = \begin{cases} < 4\frac{n^2}{p} & \text{when } s < \frac{1}{2}\frac{n}{\sqrt{p}} \ = 4\frac{n^2}{p} & \text{when } s = \frac{1}{2}\frac{n}{\sqrt{p}} \ > 4\frac{n^2}{p} & \text{when } s > \frac{1}{2}\frac{n}{\sqrt{p}} \end{cases}

Lower bound on Communication

Imagine a machine with P nodes, connected by some topology, where each node has M words of main memory, during the entire distributed matrix multiply, any node does W muliplications.

How many words MUST this node send or receive?

Imagine a timeline where a nodes actions alternate between computation and communication. That timeline can be broken up into L phases where the node sends/recvs exactly M words (except the last phase which might only have < M)

What is the largest number of multiplies that each phase can possibly do?

let SASet of elems of A seen in this phase \text{let } S_A \equiv \text{Set of elems of A seen in this phase}

By Looms-Whitney:

Max num of multiplies per phaseSASBSC \text{Max num of multiplies per phase} \le \sqrt{ |S_A| \cdot |S_B| \cdot |S_C| }

It’s possible that A already has M words in its memory, does some computations, evicts M words, then recv’s M more words and does more computations.

SA2M |S_A| \le 2M

Max num of multiplies per phaseSASBSC22M3/2 \text{Max num of multiplies per phase} \le \sqrt{ |S_A| \cdot |S_B| \cdot |S_C| } \le 2\sqrt{2}M^{3/2}

How many phases are there?

L Num of full phases Wmax num of mulitplies per phase L \ge \text{ Num of full phases } \ge \left\lfloor\frac{W}{\text{max num of mulitplies per phase}}\right\rfloor

L Num of full phases W22M3/2 \phantom{L \ge \text{ Num of full phases }} \ge \left\lfloor\frac{W}{2\sqrt{2}M^{3/2}}\right\rfloor

L Num of full phases W22M3/21 \phantom{L \ge \text{ Num of full phases }} \ge \frac{W}{2\sqrt{2}M^{3/2}} - 1

Num of words communicated by 1 node  Num of full phases ×M \text{Num of words communicated by 1 node } \ge \text{ Num of full phases } \times M

Num of words communicated by 1 node (W22M3/21)×M \phantom{\text{Num of words communicated by 1 node }} \ge \left(\frac{W}{2\sqrt{2}M^{3/2}} - 1\right) \times M

Num of words communicated by 1 node W22MM \phantom{\text{Num of words communicated by 1 node }} \ge \frac{W}{2\sqrt{2}\cdot\sqrt{M}} - M

If matrices are of size m,n,k, respectively, at least one node has to do mnkP \frac{m \cdot n \cdot k}{P} multiplications, so there is some node where WmnkP W \ge \frac{m \cdot n \cdot k}{P}

Num of words communicated by 1 node mnk22PMM \text{Num of words communicated by 1 node } \ge \frac{m \cdot n \cdot k}{2\sqrt{2}\cdot P \cdot\sqrt{M}} - M

If M is distributed evenly across all nodes (and assuming m=n=k m = n = k ):

Distributed operands     M=Θ(n2P) \text{Distributed operands } \implies M = \Theta\left(\frac{n^2}{P}\right)

Num of words communicated by 1 node =Ω(n2P) \text{Num of words communicated by 1 node } = \Omega \left(\frac{n^2}{\sqrt{P}} \right)

Total time for communication

Tnet(n;P)=Ω(α[???]+βn2P) T_{net}(n;P) = \Omega \left( \alpha \cdot \left[ \text{???} \right] + \beta \cdot \frac{n^2}{\sqrt{P}} \right)

Since the largest message a node can send is of size M,

Num of messages Θ(n2PM(n;P))=Θ(P) \text{Num of messages } \ge \Theta\left( \frac{\frac{n^2}{\sqrt{P}}}{M\left(n;P\right)} \right) = \Theta\left( \sqrt{P} \right)

Tnet(n;P)=Ω(α[P]+βn2P) T_{net}(n;P) = \Omega \left( \alpha \cdot \left[ \sqrt{P} \right] + \beta \cdot \frac{n^2}{\sqrt{P}} \right)

Beating Lower Bound

Increasing memory from n2Pn3P \frac{n^2}{P} \rightarrow \frac{n^3}{P} allows you to use a 3D algorithm that broadcasts volumes of computations, rather than surfaces.

2.5D algorithm will be on exam.