Final Exam Study Guide

Stephen M. Reaves

2024-12-07

Study guide for the final exam for CS-6220

Summary

Homework 5

Homework 5 with solutions

Definitions
Parallel Efficiency
Polynomial evaluation with Horner’s Rule
Cartesian Reduce
MPI Barrier
Dense Matrix Algorithm
BFS Cache Analysis
Distributed 3D FFTs
Six Degrees of HPC
Red vs Blue
Communication Madness
Projection
Sort and Solve
Parallel Distributed Scan
BeastMode Returns
Analyzing a Distributed Memory Problem

Definitions

Work Optimal
Does the parallel algorithm do the same amount of work as the sequential algorithm?

$optimal = \frac{W_*}{W_p}$

Speedup
S = sequntial time over parallel time

$S = \frac{T_1}{T_p}$

Parallel Efficiency
Speedup over number of processors

$E = \frac{S}{p}$

Isoefficiency
How fast n must grow as a function of p to keep efficiency constant.

Parallel Efficiency

Question A

If the work done by the parallel system is the same or better than the work best sequential system, then the we say it is work optimal. In this case, each $p$ processor does $\sqrt{p}$ calls to FOO on $\frac{n}{p}$ data, which works out to be the same amount of work as the sequential algorithm.

$W_p(n) = \text{ processors count } \times \text{ loop count } \times \text{ work per loop }$

$\phantom{W_p(n)} = p \times \sqrt{p} \times W_{*}\left(\frac{n}{p}\right)$

$\phantom{W_p(n)} = p \times \sqrt{p} \times \frac{2n^{3/2}}{p^{3/2}}$

$\phantom{W_p(n)} = 2n^{3/2}$

$\phantom{W_p(n)} = W_*(n)$

Question B

The speedup is

$S = \frac{T_1}{T_p}$

$\phantom{S} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)}$

Question C

The efficiency is

$E = \frac{S}{p}$

$\phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)} \times \frac{1}{p}$

$\phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}\tau + p^{3/2}\left(\alpha + \frac{n}{p}\beta\right)}$

$\phantom{E} = \frac{1}{1 + \frac{\alpha}{\tau}\frac{p^{3/2}}{n^{3/2}} + \frac{\beta}{\tau}\frac{\sqrt{p}}{2n}}$

Question D

To find the isoefficiency, we can ignore the constants. Looking at the alpha term, $n$ needs to grow with $p$ . Looking at the beta term, $n$ needs to grow with $\sqrt{p}$ . Since this is a lower bound, we choose the larger of these terms and say that the isoefficiency function is $n = \Omega(p)$

Polynomial evaluation with Horner’s Rule

$P(x) = a_0 + x(a_1 + x(a_2 + \cdots + x(a_{n-1})))$

$\phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^2(a_2 + x^2(a_4 + \cdots + x^2(a_{n-2})))) \ \phantom{=} +x^1(a_1 + x^2(a_3 + x^2(a_5 + \cdots + x^2(a_{n-1})))) \end{array}$

$\phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^p(a_{0+p} + x^p(a_{0+2p} + \cdots + x^p(a_{n-p+0})))) \ \phantom{=} +x^1(a_1 + x^p(a_{1+p} + x^p(a_{1+2p} + \cdots + x^p(a_{n - p+ 1})))) \ \phantom{=} +x^2(a_2 + x^p(a_{2+p} + x^p(a_{2+2p} + \cdots + x^p(a_{n - p+ 2})))) \ \phantom{=} +x^3(a_3 + x^p(a_{3+p} + x^p(a_{3+2p} + \cdots + x^p(a_{n - p+ 3})))) \ \phantom{=} \vdots \ \phantom{=} +x^{p-1}(a_{p-1} + x^p(a_{p-1 + p} + x^p(a_{p-1 + 2p} + \cdots + x^p(a_{n - 1})))) \end{array}$

All P processors run an exclusive prefix sum over the input value x with multiply as the operator. After this step, each processor holds $x^p$ where $p$ is the rank of that processor.
Last processor with $P -1$ multiplies its local value $x^{P-1}$ with $x$ again to obtain $x^P$ .
Broadcast $x^P$ to all processors
Compute local evaluations using received value $x^P$ and local $x^{p-1}$ serially and store into local partial evaluation.
All reduce the local partial sum with addition as the operator.

Analysis:

Prefix sum for calculating powers of x takes $\Theta\left(\left(\alpha + \beta\right)\log{P}\right)$ time.
Broadcast of $x^P$ takes another $\Theta\left(\left(\alpha + \beta\right)\log{P}\right)$ time.
Local evaluation time takes $\Theta\left(N/P\right)$ time.
All reduce takes another $\Theta\left(\left(\alpha + \beta\right)\log{P}\right)$ time.

Overall asymptotic runtime is \Theta\left(\frac{N}{P} + (\alpha + \beta)\log{P}) where $\alpha$ is the latency cost and $\beta$ is the bandwidth cost of inter-rank communication.

Cartesian Reduce

Question A

Do d rounds of allreduce, over p processors

Question B

Tree-based:

$O\left(\log{p}\left(\alpha + n\beta\right)\right) \times d$

Bandwidth-optimal:

$O\left(p\alpha + n\beta\right) \times d$

Question C

You can now do each all reduce in parallel, leading to a factor of d improvement to the latency.

MPI Barrier

Question A

procedure MPI_Barrier()
  int magic = 454;
  int magic_recv = [P];
  MPI_Request requests[P*2];
  for i <- 0 to P do
    MPI_Isend(&magic, 1, MPI_INT, i, magic, comm, requests+i);
    MPI_Irecv(magic, recv+i, 1, MPI_INT, i, magic, comm, requests+i+P);
  MPI_Waitall(P*2, requests, MPI_STATUS_IGNORE);
  return MPI_SUCCESS;

All processes need to send and recv to each other process and wait until those communications are complete before continuing. We send/recv async to reduce network latency costs.

Question B

Since it doesn’t matter what data we are sending, we are only bound by latency.

$T_{msg} = O(\alpha)$

Question C

Because all of the sends and recvs happen asynchronously, they overlap and we only have to wait for the cost of one message. Since we are sending a fixed (potentially small) amount of data and we are looking asymptotically, we can ignore the bandwidth cost.

Dense Matrix Algorithm

Question A

Each processor computes partial local dot products with local A and x values and stores them in a local vector y’. Computation $O\left(\frac{n^2}{p}\right)$
Using all-to-all reduction, the final distributed y vector is computed from local y’ vectors. Computation $O(n)$ . Communication $O\left(\tau\log{p} + \mu\frac{n}{p}P\right) = O(\tau\log{p} + \mu n)$

Total runtime = $O\left(\frac{n^2}{p} + \tau\log{p} + \mu n\right)$

Question B

In an n x n torus, the maximum distance $D_{(i,j)\rightarrow(j,i)}$ an element $A[i,j]$ has to travel to destination $(j,i)$ is $D_{(i,j)\rightarrow(j,i)} = 2 \times \min\left(\left(j-i\right)\mod{n}, \left(i - j\right)\mod{n}\right)$ because of the wraparound connections. This distance needs to be covered twice, once by rows and once by columns. This means the number of steps to transpose any element is bounded by $2 \times \left\lfloor\frac{n}{2}\right\rfloor = O(n)$ .

Move each element $A[i,j]$
- North/South $(i - j) \mod{n}$ steps.
- East/West $(j - i) \mod{n}$ steps.

Note $(i - j) \mod{n} + (j - i) \mod{n} \le n$ so the algorithm terminates in $n$ steps. Each processor receives at most 4 elements (one from each direction) before passing elements along, so each processor holds constant number of matrix elements.

BFS Cache Analysis

Question A

According to the planar separator theorem, the size of a frontier can’t be larger than $O(\sqrt{n})$ . So we need enough cachelines to hold a frontier.

$\phantom{n}\sqrt{n} = O\left(\frac{Z}{L}\right)$

$\phantom{\sqrt{n}}n = O\left(\left(\frac{Z}{L}\right)^2\right)$

Question B

If the graph was 3-D, then the separator would scale like $O\left(n^{2/3}\right)$ so the size of the cache would be $n = O\left(\left(\frac{Z}{L}\right)^{3/2}\right)$ .

Distributed 3D FFTs

Question A

Simply arithmetic work divided by machine peak:

$T_{comp} = O\left(\frac{n^3\log{n}}{R_{max}}\right)$

Question B

If the peak is $R_{max}$ , then $P = \frac{R_{max}}{R_0}$ . Thus,

$T_{mem} = O\left( \frac{n^3\log{n}}{P\beta_{mem}\log{Z}} \right)$

$\phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ \frac{R_{max}}{R_0} \beta_{mem}\log{Z}} \right)$

$\phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{R_0} \beta_{mem}\log{Z}} \right)$

$\phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{\beta_{mem}}{R_0} \log{Z}} \right)$

$\phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{B_0}\log{Z}} \right)$

$\phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot \frac{n^3\log{n}}{R_{max}} \right)$

$\phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot T_{comp} \right)$

Question C

This is operations divided by network bandwidth which basically tells you how many operations you can compute in the time it takes to move data over the network.

Question D

$B_{net} = \frac{R_0^{2/3}}{\beta_{link}}$

$T_{net} = O\left( \frac{n^3}{P^{2/3}\beta_{link}} \right)$

$\phantom{T_{net}} = O\left( \frac{n^3}{\left(\frac{R_{max}}{R_0}\right)^{2/3}\beta_{link}} \right)$

$\phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{R_0^{2/3}} \cdot \beta_{link}} \right)$

$\phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{B_{net}}} \right)$

$T_{net} = O\left( B_{net} \cdot \frac{n^3}{R_{max}^{2/3}} \right)$

Question E

Since communication dominates and both $T_{mem} \propto B_0$ and $T_{net} \propto B_{net}$ , the systems with the lower execution time will have smaller $B_0$ and $B_{net}$ values. We should actually use the mobile phone systems.

Six Degrees of HPC

Question A

Red vs Blue

Communication Madness

Question A

Need to split problem up based on on-chiplet and ring communication. Since we have a lot of nodes, broadcast/scatter won’t work. We don’t need to do ring communication for each node, only for each chiplet.

point-to-point on chiplet
for i = 1 to P/4
  shift data from 0->4, 4-> 8
point-to-point on chiplet

Question B

$O\left( \alpha\frac{P}{4} + \beta\frac{P}{4}n + 6\left( \alpha_{chip} + \beta_{chip}\right) \right)$

Projection

Question A

$4nk - n - k$

Question B

Using the SUMMA algorithm, we divide the n x n grid into p processors that are arranged in a $\sqrt{p} \times \sqrt{p}$ grid. Each processor is responsible for $\frac{n}{p} \times \frac{n}{p}$ elements. During each iteration, the ith row broadcast all values of A to every processor in the row, and the ith column broadcast their blocks of B to every processor in the column. Then local computation happens to compute the respective region of C.

Question C

$\frac{T(n,1)}{pT(n,p)} = \Theta(1) \implies \frac{\Theta(nk)}{\Theta(nk + kp\log{p})}$

$\frac{\Theta(nk)}{\Theta(nk + kp\log{p})} = \Theta(1) \implies \Theta(p\log{p}) \in \Theta(n)$

$\Theta(p\log{p}) \in \Theta(n) \implies p \in \Theta\left( \frac{n}{\log{n}} \right)$

Sort and Solve

Question A

Just follow example

Question B

WLOG, m >= n. Reversing the second sorted sequence results in a bitonic sequence of length m + n. To reverse the second sorted sequence, each processor needs to communicate with at most 2 processors and the communication time will be $O\left( \tau + \mu(m+n) \right)$

Question C

Bitonic merging of a sequence of length m + n using p processors takes $O((\tau + \mu(m+n)/p)\log{p})$ communication time (span) and $O((n m)/\log{p})$ computation time (work)

Parallel Distributed Scan

Question A

Each node does a linear scan of its data
Each node performs the recursive scan, accumulating partial sums until we hit the base case. One processor on each node now has the accumulated sum, $\sigma_i$ , of all its local elements.
Perform a scan across all the nodes. At the end of this step, each node will have the +-scan, $S_i$ , up and including itself. $S_i - \sigma_i$ is the prefix sum of the values in the nodes prior to node $i$ .
Continue recursing on each node, applying the second half of the local scan.
In parallel, each processor applies the node-wise prefix sum to its partition of the local data.

Question B

On each node, the addScan takes time $T_{P_l} = O\left( \frac{n}{P_d P_l} + \log^2{P_l} \right)$ . The node-wise scan takes time $O\left( \log^2{P_d} \right)$ , so the time taken overall is $T_{\text{comp}}\left( n, P_d, P_l \right) = \tau \cdot O\left( \frac{n}{P_d P_l} + \log^2{P_d} + \log^2{P_l} \right)$ te

Communication only happens during the nodewise scan in the middle. Each round, half of the nodes send a message to, and later receive a message from, the other half of the nodes. Then the overall maximum of messages sent (only two nodes make it to the final round) is $2\log{P_d}$ , Similarly, since each of the messages is a single number, the final number of words sent is $2\log{P_d}$ . The overall communication time is then $T_{\text{comm}}(n, P_d, P_l) = 2\alpha\log{P_d} + 2\beta\log{P_d}$

Question C

Look at the paper

Question D

Since the linear network can be logically embedded in a 2d grid, the algorithm (and its timings) does not change.

BeastMode Returns

Question A

You don’t need to sort and communicate everything, just they kewords and their counts.

Across P nodes, collect keyword counts locally
Do sample sort to distribute keywords and their counts across all nodes.
Local scan to find the most common values.

Alternatively, tree-like accumulation of keywords/counts

You can’t just reduce all keywords onto the root as there is too much data

Question B

$\alpha P + \beta(P \times k)$

Analyzing a Distributed Memory Problem

Question A

1D:

$\frac{N}{P} \times N$

2D:

$\frac{N}{\sqrt{P}} \times \frac{N}{\sqrt{P}}$

Question B

We can ignore the boundary elements since they don’t need to communicate with all other nodes and won’t dominate the time. For 1D partitioning, we’ll send N element blocks to the neighbor (either side). So the communication time will be $T_{\text{comm}} = 2(\alpha + \beta N)$ . For the 2D partitioning, $\frac{N}{\sqrt{P}}$ blocks need to be communicated to each of the 4 blocks. So the total comm time will be $T_{\text{comm}} = 4\left( \alpha + \beta\frac{N}{\sqrt{P}} \right)$ .

The computation time will be the same for both. The number of elements per process assuming constant times for division and multiplications is $\frac{N^2}{P}$

Question C

$\frac{S_{1D}}{Pn} = \Omega(p)$