Reaves.dev
v0.1.0
built using
Phoenix v1.7.17
We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
::
Study guide for the final exam for CS-6220
Homework 5
Homework 5 with solutions
Work OptimalDoes the parallel algorithm do the same amount of work as the sequential algorithm?
optimal=W∗Wpoptimal = \frac{W_*}{W_p}
SpeedupS = sequntial time over parallel time
S=T1TpS = \frac{T_1}{T_p}
Parallel EfficiencySpeedup over number of processors
E=SpE = \frac{S}{p}
IsoefficiencyHow fast n must grow as a function of p to keep efficiency constant.
If the work done by the parallel system is the same or better than the work best sequential system, then the we say it is work optimal. In this case, each p p processor does p \sqrt{p} calls to FOO on np \frac{n}{p} data, which works out to be the same amount of work as the sequential algorithm.
FOO
Wp(n)= processors count × loop count × work per loop W_p(n) = \text{ processors count } \times \text{ loop count } \times \text{ work per loop }
Wp(n)=p×p×W∗(np) \phantom{W_p(n)} = p \times \sqrt{p} \times W_{*}\left(\frac{n}{p}\right)
Wp(n)=p×p×2n3/2p3/2 \phantom{W_p(n)} = p \times \sqrt{p} \times \frac{2n^{3/2}}{p^{3/2}}
Wp(n)=2n3/2 \phantom{W_p(n)} = 2n^{3/2}
Wp(n)=W∗(n) \phantom{W_p(n)} = W_*(n)
The speedup is
speedup
S=T1Tp S = \frac{T_1}{T_p}
S=2n3/2τ2n3/2pτ+p(α+npβ) \phantom{S} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)}
The efficiency is
efficiency
E=Sp E = \frac{S}{p}
E=2n3/2τ2n3/2pτ+p(α+npβ)×1p \phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}{p}\tau + \sqrt{p}\left(\alpha + \frac{n}{p}\beta\right)} \times \frac{1}{p}
E=2n3/2τ2n3/2τ+p3/2(α+npβ) \phantom{E} = \frac{2n^{3/2}\tau}{\frac{2n^{3/2}}\tau + p^{3/2}\left(\alpha + \frac{n}{p}\beta\right)}
E=11+ατp3/2n3/2+βτp2n \phantom{E} = \frac{1}{1 + \frac{\alpha}{\tau}\frac{p^{3/2}}{n^{3/2}} + \frac{\beta}{\tau}\frac{\sqrt{p}}{2n}}
To find the isoefficiency, we can ignore the constants. Looking at the alpha term, n n needs to grow with p p . Looking at the beta term, n n needs to grow with p \sqrt{p} . Since this is a lower bound, we choose the larger of these terms and say that the isoefficiency function is n=Ω(p) n = \Omega(p)
isoefficiency
P(x)=a0+x(a1+x(a2+⋯+x(an−1))) P(x) = a_0 + x(a_1 + x(a_2 + \cdots + x(a_{n-1})))
P(x)=+x0(a0+x2(a2+x2(a4+⋯+x2(an−2))))=+x1(a1+x2(a3+x2(a5+⋯+x2(an−1)))) \phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^2(a_2 + x^2(a_4 + \cdots + x^2(a_{n-2})))) \ \phantom{=} +x^1(a_1 + x^2(a_3 + x^2(a_5 + \cdots + x^2(a_{n-1})))) \end{array}
P(x)=+x0(a0+xp(a0+p+xp(a0+2p+⋯+xp(an−p+0))))=+x1(a1+xp(a1+p+xp(a1+2p+⋯+xp(an−p+1))))=+x2(a2+xp(a2+p+xp(a2+2p+⋯+xp(an−p+2))))=+x3(a3+xp(a3+p+xp(a3+2p+⋯+xp(an−p+3))))=⋮=+xp−1(ap−1+xp(ap−1+p+xp(ap−1+2p+⋯+xp(an−1)))) \phantom{P(x)} \begin{array}{l} = \phantom{+} x^0(a_0 + x^p(a_{0+p} + x^p(a_{0+2p} + \cdots + x^p(a_{n-p+0})))) \ \phantom{=} +x^1(a_1 + x^p(a_{1+p} + x^p(a_{1+2p} + \cdots + x^p(a_{n - p+ 1})))) \ \phantom{=} +x^2(a_2 + x^p(a_{2+p} + x^p(a_{2+2p} + \cdots + x^p(a_{n - p+ 2})))) \ \phantom{=} +x^3(a_3 + x^p(a_{3+p} + x^p(a_{3+2p} + \cdots + x^p(a_{n - p+ 3})))) \ \phantom{=} \vdots \ \phantom{=} +x^{p-1}(a_{p-1} + x^p(a_{p-1 + p} + x^p(a_{p-1 + 2p} + \cdots + x^p(a_{n - 1})))) \end{array}
Analysis:
Overall asymptotic runtime is \Theta\left(\frac{N}{P} + (\alpha + \beta)\log{P}) where α \alpha is the latency cost and β \beta is the bandwidth cost of inter-rank communication.
Do d rounds of allreduce, over p processors
Tree-based:
O(logp(α+nβ))×d O\left(\log{p}\left(\alpha + n\beta\right)\right) \times d
Bandwidth-optimal:
O(pα+nβ)×d O\left(p\alpha + n\beta\right) \times d
You can now do each all reduce in parallel, leading to a factor of d improvement to the latency.
procedure MPI_Barrier() int magic = 454; int magic_recv = [P]; MPI_Request requests[P*2]; for i <- 0 to P do MPI_Isend(&magic, 1, MPI_INT, i, magic, comm, requests+i); MPI_Irecv(magic, recv+i, 1, MPI_INT, i, magic, comm, requests+i+P); MPI_Waitall(P*2, requests, MPI_STATUS_IGNORE); return MPI_SUCCESS;
All processes need to send and recv to each other process and wait until those communications are complete before continuing. We send/recv async to reduce network latency costs.
Since it doesn’t matter what data we are sending, we are only bound by latency.
Tmsg=O(α) T_{msg} = O(\alpha)
Because all of the sends and recvs happen asynchronously, they overlap and we only have to wait for the cost of one message. Since we are sending a fixed (potentially small) amount of data and we are looking asymptotically, we can ignore the bandwidth cost.
Total runtime = O(n2p+τlogp+μn) O\left(\frac{n^2}{p} + \tau\log{p} + \mu n\right)
In an n x n torus, the maximum distance D(i,j)→(j,i) D_{(i,j)\rightarrow(j,i)} an element A[i,j] A[i,j] has to travel to destination (j,i) (j,i) is D(i,j)→(j,i)=2×min((j−i)mod n,(i−j)mod n) D_{(i,j)\rightarrow(j,i)} = 2 \times \min\left(\left(j-i\right)\mod{n}, \left(i - j\right)\mod{n}\right) because of the wraparound connections. This distance needs to be covered twice, once by rows and once by columns. This means the number of steps to transpose any element is bounded by 2×⌊n2⌋=O(n) 2 \times \left\lfloor\frac{n}{2}\right\rfloor = O(n) .
Note (i−j)mod n+(j−i)mod n≤n (i - j) \mod{n} + (j - i) \mod{n} \le n so the algorithm terminates in n n steps. Each processor receives at most 4 elements (one from each direction) before passing elements along, so each processor holds constant number of matrix elements.
According to the planar separator theorem, the size of a frontier can’t be larger than O(n) O(\sqrt{n}) . So we need enough cachelines to hold a frontier.
planar separator theorem
frontier
nn=O(ZL) \phantom{n}\sqrt{n} = O\left(\frac{Z}{L}\right)
nn=O((ZL)2) \phantom{\sqrt{n}}n = O\left(\left(\frac{Z}{L}\right)^2\right)
If the graph was 3-D, then the separator would scale like O(n2/3) O\left(n^{2/3}\right) so the size of the cache would be n=O((ZL)3/2) n = O\left(\left(\frac{Z}{L}\right)^{3/2}\right) .
Simply arithmetic work divided by machine peak:
Tcomp=O(n3lognRmax) T_{comp} = O\left(\frac{n^3\log{n}}{R_{max}}\right)
If the peak is Rmax R_{max} , then P=RmaxR0 P = \frac{R_{max}}{R_0} . Thus,
Tmem=O(n3lognPβmemlogZ) T_{mem} = O\left( \frac{n^3\log{n}}{P\beta_{mem}\log{Z}} \right)
Tmem=O(n3lognRmaxR0βmemlogZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ \frac{R_{max}}{R_0} \beta_{mem}\log{Z}} \right)
Tmem=O(n3lognRmax1R0βmemlogZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{R_0} \beta_{mem}\log{Z}} \right)
Tmem=O(n3lognRmaxβmemR0logZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{\beta_{mem}}{R_0} \log{Z}} \right)
Tmem=O(n3lognRmax1B0logZ) \phantom{T_{mem}} = O\left( \frac{n^3\log{n}}{ R_{max} \frac{1}{B_0}\log{Z}} \right)
Tmem=O(B0logZ⋅n3lognRmax) \phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot \frac{n^3\log{n}}{R_{max}} \right)
Tmem=O(B0logZ⋅Tcomp) \phantom{T_{mem}} = O\left( \frac{B_0}{\log{Z}} \cdot T_{comp} \right)
This is operations divided by network bandwidth which basically tells you how many operations you can compute in the time it takes to move data over the network.
Bnet=R02/3βlink B_{net} = \frac{R_0^{2/3}}{\beta_{link}}
Tnet=O(n3P2/3βlink) T_{net} = O\left( \frac{n^3}{P^{2/3}\beta_{link}} \right)
Tnet=O(n3(RmaxR0)2/3βlink) \phantom{T_{net}} = O\left( \frac{n^3}{\left(\frac{R_{max}}{R_0}\right)^{2/3}\beta_{link}} \right)
Tnet=O(n3Rmax2/3⋅1R02/3⋅βlink) \phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{R_0^{2/3}} \cdot \beta_{link}} \right)
Tnet=O(n3Rmax2/3⋅1Bnet) \phantom{T_{net}} = O\left( \frac{n^3}{R_{max}^{2/3} \cdot \frac{1}{B_{net}}} \right)
Tnet=O(Bnet⋅n3Rmax2/3) T_{net} = O\left( B_{net} \cdot \frac{n^3}{R_{max}^{2/3}} \right)
Since communication dominates and both Tmem∝B0 T_{mem} \propto B_0 and Tnet∝Bnet T_{net} \propto B_{net} , the systems with the lower execution time will have smaller B0 B_0 and Bnet B_{net} values. We should actually use the mobile phone systems.
Need to split problem up based on on-chiplet and ring communication. Since we have a lot of nodes, broadcast/scatter won’t work. We don’t need to do ring communication for each node, only for each chiplet.
point-to-point on chiplet for i = 1 to P/4 shift data from 0->4, 4-> 8 point-to-point on chiplet
O(αP4+βP4n+6(αchip+βchip)) O\left( \alpha\frac{P}{4} + \beta\frac{P}{4}n + 6\left( \alpha_{chip} + \beta_{chip}\right) \right)
4nk−n−k 4nk - n - k
Using the SUMMA algorithm, we divide the n x n grid into p processors that are arranged in a p×p \sqrt{p} \times \sqrt{p} grid. Each processor is responsible for np×np \frac{n}{p} \times \frac{n}{p} elements. During each iteration, the ith row broadcast all values of A to every processor in the row, and the ith column broadcast their blocks of B to every processor in the column. Then local computation happens to compute the respective region of C.
T(n,1)pT(n,p)=Θ(1) ⟹ Θ(nk)Θ(nk+kplogp) \frac{T(n,1)}{pT(n,p)} = \Theta(1) \implies \frac{\Theta(nk)}{\Theta(nk + kp\log{p})}
Θ(nk)Θ(nk+kplogp)=Θ(1) ⟹ Θ(plogp)∈Θ(n) \frac{\Theta(nk)}{\Theta(nk + kp\log{p})} = \Theta(1) \implies \Theta(p\log{p}) \in \Theta(n)
Θ(plogp)∈Θ(n) ⟹ p∈Θ(nlogn) \Theta(p\log{p}) \in \Theta(n) \implies p \in \Theta\left( \frac{n}{\log{n}} \right)
Just follow example
WLOG, m >= n. Reversing the second sorted sequence results in a bitonic sequence of length m + n. To reverse the second sorted sequence, each processor needs to communicate with at most 2 processors and the communication time will be O(τ+μ(m+n)) O\left( \tau + \mu(m+n) \right)
Bitonic merging of a sequence of length m + n using p processors takes O((τ+μ(m+n)/p)logp) O((\tau + \mu(m+n)/p)\log{p}) communication time (span) and O((nm)/logp) O((n m)/\log{p}) computation time (work)
On each node, the addScan takes time TPl=O(nPdPl+log2Pl) T_{P_l} = O\left( \frac{n}{P_d P_l} + \log^2{P_l} \right) . The node-wise scan takes time O(log2Pd) O\left( \log^2{P_d} \right) , so the time taken overall is Tcomp(n,Pd,Pl)=τ⋅O(nPdPl+log2Pd+log2Pl) T_{\text{comp}}\left( n, P_d, P_l \right) = \tau \cdot O\left( \frac{n}{P_d P_l} + \log^2{P_d} + \log^2{P_l} \right) te
Communication only happens during the nodewise scan in the middle. Each round, half of the nodes send a message to, and later receive a message from, the other half of the nodes. Then the overall maximum of messages sent (only two nodes make it to the final round) is 2logPd 2\log{P_d} , Similarly, since each of the messages is a single number, the final number of words sent is 2logPd 2\log{P_d} . The overall communication time is then Tcomm(n,Pd,Pl)=2αlogPd+2βlogPd T_{\text{comm}}(n, P_d, P_l) = 2\alpha\log{P_d} + 2\beta\log{P_d}
Look at the paper
Since the linear network can be logically embedded in a 2d grid, the algorithm (and its timings) does not change.
You don’t need to sort and communicate everything, just they kewords and their counts.
Alternatively, tree-like accumulation of keywords/counts
You can’t just reduce all keywords onto the root as there is too much data
αP+β(P×k) \alpha P + \beta(P \times k)
1D:
NP×N \frac{N}{P} \times N
2D:
NP×NP \frac{N}{\sqrt{P}} \times \frac{N}{\sqrt{P}}
We can ignore the boundary elements since they don’t need to communicate with all other nodes and won’t dominate the time. For 1D partitioning, we’ll send N element blocks to the neighbor (either side). So the communication time will be Tcomm=2(α+βN) T_{\text{comm}} = 2(\alpha + \beta N) . For the 2D partitioning, NP \frac{N}{\sqrt{P}} blocks need to be communicated to each of the 4 blocks. So the total comm time will be Tcomm=4(α+βNP) T_{\text{comm}} = 4\left( \alpha + \beta\frac{N}{\sqrt{P}} \right) .
The computation time will be the same for both. The number of elements per process assuming constant times for division and multiplications is N2P \frac{N^2}{P}
S1DPn=Ω(p) \frac{S_{1D}}{Pn} = \Omega(p)