Reaves.dev

v0.1.0

built using

Phoenix v1.7.17

Final Exam Study Guide

Stephen M. Reaves

::

2024-12-07

Study guide for the final exam for CS-6220

Summary

Homework 5

Homework 5 with solutions

Definitions

Work Optimal
Does the parallel algorithm do the same amount of work as the sequential algorithm?

optimal=WWpoptimal = \frac{W_*}{W_p}

Speedup
S = sequntial time over parallel time

S=T1TpS = \frac{T_1}{T_p}

Parallel Efficiency
Speedup over number of processors

E=SpE = \frac{S}{p}

Isoefficiency
How fast n must grow as a function of p to keep efficiency constant.

Parallel Efficiency

Question A

If the work done by the parallel system is the same or better than the work best sequential system, then the we say it is work optimal. In this case, each p p processor does p \sqrt{p} calls to FOO on np \frac{n}{p} data, which works out to be the same amount of work as the sequential algorithm.

Wp(n)= processors count × loop count × work per loop  W_p(n) = \text{ processors count } \times \text{ loop count } \times \text{ work per loop }

Wp(n)=p×p×W(np) \phantom{W_p(n)} = p \times \sqrt{p} \times W_{*}\left(\frac{n}{p}\right)

Wp(n)=p×p×2n3/2p3/2 \phantom{W_p(n)} = p \times \sqrt{p} \times \frac{2n^{3/2}}{p^{3/2}}

Wp(n)=2n3/2 \phantom{W_p(n)} = 2n^{3/2}

Wp(n)=W(n) \phantom{W_p(n)} = W_*(n)

Question B

The speedup is

S=T1Tp S = \frac{T_1}{T_p}

S=2n3/2τ2n3/2pτ+p(α+npβ) \phantom{S} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)}

Question C

The efficiency is

E=Sp E = \frac{S}{p}

E=2n3/2τ2n3/2pτ+p(α+npβ)×1p \phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)} \times \frac{1}{p}

E=2n3/2τ2n3/2τ+p3/2(α+npβ) \phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}\tau + p^{3/2}\left(\alpha + \frac{n}{p}\beta\right)}

E=11+ατp3/2n3/2+βτp2n \phantom{E} = \frac{1}{1 + \frac{\alpha}{\tau}\frac{p^{3/2}}{n^{3/2}} + \frac{\beta}{\tau}\frac{\sqrt{p}}{2n}}

Question D

To find the isoefficiency, we can ignore the constants. Looking at the alpha term, n n needs to grow with p p . Looking at the beta term, n n needs to grow with p \sqrt{p} . Since this is a lower bound, we choose the larger of these terms and say that the isoefficiency function is n=Ω(p) n = \Omega(p)

Polynomial evaluation with Horner’s Rule

P(x)=a0+x(a1+x(a2++x(an1))) P(x) = a_0 + x(a_1 + x(a_2 + \cdots + x(a_{n-1})))

P(x)=+x0(a0+x2(a2+x2(a4++x2(an2))))=+x1(a1+x2(a3+x2(a5++x2(an1)))) \phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^2(a_2 + x^2(a_4 + \cdots + x^2(a_{n-2})))) \ \phantom{=} +x^1(a_1 + x^2(a_3 + x^2(a_5 + \cdots + x^2(a_{n-1})))) \end{array}

P(x)=+x0(a0+xp(a0+p+xp(a0+2p++xp(anp+0))))=+x1(a1+xp(a1+p+xp(a1+2p++xp(anp+1))))=+x2(a2+xp(a2+p+xp(a2+2p++xp(anp+2))))=+x3(a3+xp(a3+p+xp(a3+2p++xp(anp+3))))==+xp1(ap1+xp(ap1+p+xp(ap1+2p++xp(an1)))) \phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^p(a_{0+p} + x^p(a_{0+2p} + \cdots + x^p(a_{n-p+0})))) \ \phantom{=} +x^1(a_1 + x^p(a_{1+p} + x^p(a_{1+2p} + \cdots + x^p(a_{n - p+ 1})))) \ \phantom{=} +x^2(a_2 + x^p(a_{2+p} + x^p(a_{2+2p} + \cdots + x^p(a_{n - p+ 2})))) \ \phantom{=} +x^3(a_3 + x^p(a_{3+p} + x^p(a_{3+2p} + \cdots + x^p(a_{n - p+ 3})))) \ \phantom{=} \vdots \ \phantom{=} +x^{p-1}(a_{p-1} + x^p(a_{p-1 + p} + x^p(a_{p-1 + 2p} + \cdots + x^p(a_{n - 1})))) \end{array}

  1. All P processors run an exclusive prefix sum over the input value x with multiply as the operator. After this step, each processor holds xp x^p where p p is the rank of that processor.
  2. Last processor with P1 P -1 multiplies its local value xP1 x^{P-1} with x x again to obtain xP x^P .
  3. Broadcast xP x^P to all processors
  4. Compute local evaluations using received value xP x^P and local xp1 x^{p-1} serially and store into local partial evaluation.
  5. All reduce the local partial sum with addition as the operator.

Analysis:

Overall asymptotic runtime is \Theta\left(\frac{N}{P} + (\alpha + \beta)\log{P}) where α \alpha is the latency cost and β \beta is the bandwidth cost of inter-rank communication.

Cartesian Reduce

Question A

Do d rounds of allreduce, over p processors

Question B

Tree-based:

O(logp(α+nβ))×d O\left(\log{p}\left(\alpha + n\beta\right)\right) \times d

Bandwidth-optimal:

O(pα+nβ)×d O\left(p\alpha + n\beta\right) \times d

Question C

You can now do each all reduce in parallel, leading to a factor of d improvement to the latency.

MPI Barrier

Question A

procedure MPI_Barrier()
  int magic = 454;
  int magic_recv = [P];
  MPI_Request requests[P*2];
  for i <- 0 to P do
    MPI_Isend(&magic, 1, MPI_INT, i, magic, comm, requests+i);
    MPI_Irecv(magic, recv+i, 1, MPI_INT, i, magic, comm, requests+i+P);
  MPI_Waitall(P*2, requests, MPI_STATUS_IGNORE);
  return MPI_SUCCESS;

All processes need to send and recv to each other process and wait until those communications are complete before continuing. We send/recv async to reduce network latency costs.

Question B

Since it doesn’t matter what data we are sending, we are only bound by latency.

Tmsg=O(α) T_{msg} = O(\alpha)

Question C

Because all of the sends and recvs happen asynchronously, they overlap and we only have to wait for the cost of one message. Since we are sending a fixed (potentially small) amount of data and we are looking asymptotically, we can ignore the bandwidth cost.

Dense Matrix Algorithm

Question A

  1. Each processor computes partial local dot products with local A and x values and stores them in a local vector y’. Computation O(n2p) O\left(\frac{n^2}{p}\right)
  2. Using all-to-all reduction, the final distributed y vector is computed from local y’ vectors. Computation O(n) O(n) . Communication O(τlogp+μnpP)=O(τlogp+μn) O\left(\tau\log{p} + \mu\frac{n}{p}P\right) = O(\tau\log{p} + \mu n)

Total runtime = O(n2p+τlogp+μn) O\left(\frac{n^2}{p} + \tau\log{p} + \mu n\right)

Question B

In an n x n torus, the maximum distance D(i,j)(j,i) D_{(i,j)\rightarrow(j,i)} an element A[i,j] A[i,j] has to travel to destination (j,i) (j,i) is D(i,j)(j,i)=2×min((ji)modn,(ij)modn) D_{(i,j)\rightarrow(j,i)} = 2 \times \min\left(\left(j-i\right)\mod{n}, \left(i - j\right)\mod{n}\right) because of the wraparound connections. This distance needs to be covered twice, once by rows and once by columns. This means the number of steps to transpose any element is bounded by 2×n2=O(n) 2 \times \left\lfloor\frac{n}{2}\right\rfloor = O(n) .

Note (ij)modn+(ji)modnn (i - j) \mod{n} + (j - i) \mod{n} \le n so the algorithm terminates in n n steps. Each processor receives at most 4 elements (one from each direction) before passing elements along, so each processor holds constant number of matrix elements.

BFS Cache Analysis

Question A

According to the planar separator theorem, the size of a frontier can’t be larger than O(n) O(\sqrt{n}) . So we need enough cachelines to hold a frontier.

nn=O(ZL) \phantom{n}\sqrt{n} = O\left(\frac{Z}{L}\right)

nn=O((ZL)2) \phantom{\sqrt{n}}n = O\left(\left(\frac{Z}{L}\right)^2\right)

Question B

If the graph was 3-D, then the separator would scale like O(n2/3) O\left(n^{2/3}\right) so the size of the cache would be n=O((ZL)3/2) n = O\left(\left(\frac{Z}{L}\right)^{3/2}\right) .

Distributed 3D FFTs

Question A

Simply arithmetic work divided by machine peak:

Tcomp=O(n3lognRmax) T_{comp} = O\left(\frac{n^3\log{n}}{R_{max}}\right)

Question B

If the peak is Rmax R_{max} , then P=RmaxR0 P = \frac{R_{max}}{R_0} . Thus,

Tmem=O(n3lognPβmemlogZ) T_{mem} = O\left( \frac{n^3\log{n}}{P\beta_{mem}\log{Z}} \right)

Tmem=O(n3lognRmaxR0βmemlogZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ \frac{R_{max}}{R_0} \beta_{mem}\log{Z}} \right)

Tmem=O(n3lognRmax1R0βmemlogZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{R_0} \beta_{mem}\log{Z}} \right)

Tmem=O(n3lognRmaxβmemR0logZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{\beta_{mem}}{R_0} \log{Z}} \right)

Tmem=O(n3lognRmax1B0logZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{B_0}\log{Z}} \right)

Tmem=O(B0logZn3lognRmax) \phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot \frac{n^3\log{n}}{R_{max}} \right)

Tmem=O(B0logZTcomp) \phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot T_{comp} \right)

Question C

This is operations divided by network bandwidth which basically tells you how many operations you can compute in the time it takes to move data over the network.

Question D

Bnet=R02/3βlink B_{net} = \frac{R_0^{2/3}}{\beta_{link}}

Tnet=O(n3P2/3βlink) T_{net} = O\left( \frac{n^3}{P^{2/3}\beta_{link}} \right)

Tnet=O(n3(RmaxR0)2/3βlink) \phantom{T_{net}} = O\left( \frac{n^3}{\left(\frac{R_{max}}{R_0}\right)^{2/3}\beta_{link}} \right)

Tnet=O(n3Rmax2/31R02/3βlink) \phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{R_0^{2/3}} \cdot \beta_{link}} \right)

Tnet=O(n3Rmax2/31Bnet) \phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{B_{net}}} \right)

Tnet=O(Bnetn3Rmax2/3) T_{net} = O\left( B_{net} \cdot \frac{n^3}{R_{max}^{2/3}} \right)

Question E

Since communication dominates and both TmemB0 T_{mem} \propto B_0 and TnetBnet T_{net} \propto B_{net} , the systems with the lower execution time will have smaller B0 B_0 and Bnet B_{net} values. We should actually use the mobile phone systems.

Six Degrees of HPC

Question A

Red vs Blue

Communication Madness

Question A

Need to split problem up based on on-chiplet and ring communication. Since we have a lot of nodes, broadcast/scatter won’t work. We don’t need to do ring communication for each node, only for each chiplet.

point-to-point on chiplet
for i = 1 to P/4
  shift data from 0->4, 4-> 8
point-to-point on chiplet

Question B

O(αP4+βP4n+6(αchip+βchip)) O\left( \alpha\frac{P}{4} + \beta\frac{P}{4}n + 6\left( \alpha_{chip} + \beta_{chip}\right) \right)

Projection

Question A

4nknk 4nk - n - k

Question B

Using the SUMMA algorithm, we divide the n x n grid into p processors that are arranged in a p×p \sqrt{p} \times \sqrt{p} grid. Each processor is responsible for np×np \frac{n}{p} \times \frac{n}{p} elements. During each iteration, the ith row broadcast all values of A to every processor in the row, and the ith column broadcast their blocks of B to every processor in the column. Then local computation happens to compute the respective region of C.

Question C

T(n,1)pT(n,p)=Θ(1)    Θ(nk)Θ(nk+kplogp) \frac{T(n,1)}{pT(n,p)} = \Theta(1) \implies \frac{\Theta(nk)}{\Theta(nk + kp\log{p})}

Θ(nk)Θ(nk+kplogp)=Θ(1)    Θ(plogp)Θ(n) \frac{\Theta(nk)}{\Theta(nk + kp\log{p})} = \Theta(1) \implies \Theta(p\log{p}) \in \Theta(n)

Θ(plogp)Θ(n)    pΘ(nlogn) \Theta(p\log{p}) \in \Theta(n) \implies p \in \Theta\left( \frac{n}{\log{n}} \right)

Sort and Solve

Question A

Just follow example

Question B

WLOG, m >= n. Reversing the second sorted sequence results in a bitonic sequence of length m + n. To reverse the second sorted sequence, each processor needs to communicate with at most 2 processors and the communication time will be O(τ+μ(m+n)) O\left( \tau + \mu(m+n) \right)

Question C

Bitonic merging of a sequence of length m + n using p processors takes O((τ+μ(m+n)/p)logp) O((\tau + \mu(m+n)/p)\log{p}) communication time (span) and O((nm)/logp) O((n m)/\log{p}) computation time (work)

Parallel Distributed Scan

Question A

  1. Each node does a linear scan of its data
  2. Each node performs the recursive scan, accumulating partial sums until we hit the base case. One processor on each node now has the accumulated sum, σi \sigma_i , of all its local elements.
  3. Perform a scan across all the nodes. At the end of this step, each node will have the +-scan, Si S_i , up and including itself. Siσi S_i - \sigma_i is the prefix sum of the values in the nodes prior to node i i .
  4. Continue recursing on each node, applying the second half of the local scan.
  5. In parallel, each processor applies the node-wise prefix sum to its partition of the local data.

Question B

On each node, the addScan takes time TPl=O(nPdPl+log2Pl) T_{P_l} = O\left( \frac{n}{P_d P_l} + \log^2{P_l} \right) . The node-wise scan takes time O(log2Pd) O\left( \log^2{P_d} \right) , so the time taken overall is Tcomp(n,Pd,Pl)=τO(nPdPl+log2Pd+log2Pl) T_{\text{comp}}\left( n, P_d, P_l \right) = \tau \cdot O\left( \frac{n}{P_d P_l} + \log^2{P_d} + \log^2{P_l} \right) te

Communication only happens during the nodewise scan in the middle. Each round, half of the nodes send a message to, and later receive a message from, the other half of the nodes. Then the overall maximum of messages sent (only two nodes make it to the final round) is 2logPd 2\log{P_d} , Similarly, since each of the messages is a single number, the final number of words sent is 2logPd 2\log{P_d} . The overall communication time is then Tcomm(n,Pd,Pl)=2αlogPd+2βlogPd T_{\text{comm}}(n, P_d, P_l) = 2\alpha\log{P_d} + 2\beta\log{P_d}

Question C

Look at the paper

Question D

Since the linear network can be logically embedded in a 2d grid, the algorithm (and its timings) does not change.

BeastMode Returns

Question A

You don’t need to sort and communicate everything, just they kewords and their counts.

  1. Across P nodes, collect keyword counts locally
  2. Do sample sort to distribute keywords and their counts across all nodes.
  3. Local scan to find the most common values.

Alternatively, tree-like accumulation of keywords/counts

You can’t just reduce all keywords onto the root as there is too much data

Question B

αP+β(P×k) \alpha P + \beta(P \times k)

Analyzing a Distributed Memory Problem

Question A

1D:

NP×N \frac{N}{P} \times N

2D:

NP×NP \frac{N}{\sqrt{P}} \times \frac{N}{\sqrt{P}}

Question B

We can ignore the boundary elements since they don’t need to communicate with all other nodes and won’t dominate the time. For 1D partitioning, we’ll send N element blocks to the neighbor (either side). So the communication time will be Tcomm=2(α+βN) T_{\text{comm}} = 2(\alpha + \beta N) . For the 2D partitioning, NP \frac{N}{\sqrt{P}} blocks need to be communicated to each of the 4 blocks. So the total comm time will be Tcomm=4(α+βNP) T_{\text{comm}} = 4\left( \alpha + \beta\frac{N}{\sqrt{P}} \right) .

The computation time will be the same for both. The number of elements per process assuming constant times for division and multiplications is N2P \frac{N^2}{P}

Question C

S1DPn=Ω(p) \frac{S_{1D}}{Pn} = \Omega(p)