Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

Work Span Model

Stephen M. Reaves

::

2024-08-29

Notes about Lecture 3 for CS-6220

Required Readings

Optional

Summary

Multithreaded DAG Model

Each vertex is an operation, each edge is a dependency.

We assume there is a PRAM model with P processors with shared memory. We can assign up to P operations to complete at a time, assuming all the dependencies are met.

Our cost model assumes:

Example Sample Reduction

let A = array of length n

s <- 0
for i <- 1 to n do
  s <- s + A[i]
Ga1A[1]p1+a1->p1p2+p1->p2a2A[2]a2->p2p3+p2->p3a3A[3]a3->p3pn+p3->pnanA[n]an->pn
Time to execute this DAG? T(p)nT(p) \ge n
Ga1A[1]p1+a1->p1a2A[2]a2->p1p12+p1->p12a3A[3]p2+a3->p2a4A[4]a4->p2p2->p12an1A[n-1]pn+an1->pnanA[n]an->pnpn1+pn->pn1p+p12->ppn1->pi->pn1
Minimum time to execute this DAG on a PRAM with p=n procs? T(p)=O(log(n))T(p) = O(log(n))

Work and Span

Work := total # of nodes

Span := # of vertices on the longest path

Given work and span, what can we say?

Average available parallelism := W(n)D(n) \frac{W(n)}{D(n)}

Tp(n)max{D(n),W(n)p}T_p(n) \ge max \left{ D(n), \left\lceil\frac{W(n)}{p}\right\rceil \right}

Brent’s Theorem

Break execution into phases:

For any phase (k), W(k) is the total amount of work for that phase.

k=1DWk=W\sum_{k=1}^{D}W_k = W

How long will it take to execute phase k? (tk t_k )?

tk=wkp    Tp=k=1Dtk t_k = \left\lceil\frac{w_k}{p}\right\rceil \implies T_p = \sum_{k=1}^{D}t_k

tk=wkp    Tp=k=1Dwkp \phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p = \sum_{k=1}^{D}\left\lceil\frac{w_k}{p}\right\rceil

tk=wkp    Tp=k=1Dwk1p+1 \phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p = \sum_{k=1}^{D}\left\lfloor\frac{w_k - 1}{p}\right\rfloor + 1

tk=wkp    Tpk=1Dwk1p+1 \phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p \le \sum_{k=1}^{D}\frac{w_k - 1}{p} + 1

tk=wkp    TpWDP+D \phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p \le \frac{W-D}{P} + D

The time to execute a DAG is no more than the time to execute the critical path (D) plus the time to execute everything off of the critical path using the P processors
Brent's Thoerem

max{D,WP}TpWDP+Dmax\left{D, \left\lceil\frac{W}{P}\right\rceil\right} \le T_p \le \frac{W-D}{P} + D

Speedup

Speedup := Best sequential time divided by the best parallel time

Sp(n)T(n)Tp(n) Sp(n) \equiv \frac{T_*(n)}{T_p(n)}

Sp(n)W(n)f(w,d,n,p) \phantom{Sp(n)} \equiv \frac{W_*(n)}{f(w,d,n,p)}

Sp(n)W(n)f(w,d,n,p)WWDP+D \phantom{Sp(n)} \equiv \frac{W_* (n)}{f(w,d,n,p)} \ge \frac{W_*}{\frac{W-D}{P} + D}

Sp(n)W(n)f(w,d,n,p)PWW+P1W/D \phantom{Sp(n) \equiv \frac{W_* (n)}{f(w,d,n,p)}} \ge \frac{P}{\frac{W}{W_* } + \frac{P-1}{W_* / D}}

Ideal speedup is linear in P. In order for speedup to be linear, I need the denominator to be a constant. For that to happen the work per processor has to grow as some function of n.

Sp(n)T(n)Tp(n)=Θ(p)Sp(n) \equiv \frac{T_* (n)}{Tp(n)} = \Theta(p)
Speedup
W(n)=O(W(n))W(n) = O(W_* (n))
Work optimality
p=O(WD) or Wp=Ω(D)p = O\left(\frac{W_* }{D}\right) \text{ or } \frac{W_* }{p} = \Omega(D)
Weak scalability

Basic Concurrency Primitives

Spawn and Sync

reduce(A[0:n-1])
  if n > 2 then
    a <- reduce(A[0:n/2-1])
    b <- reduce(A[n/2:n-1])
    return a+b
  // else n = 1
  return A[0]
reduce(A[0:n-1])
  if n > 2 then
    a <- spawn reduce(A[0:n/2-1])
    b <- reduce(A[n/2:n-1])
    sync
    return a+b
  // else n = 1
  return A[0]

Par-for

par-for i <- 1 to n do
  foo(i)

That’s O(n)

ParForT(foo, a, b)
  let n = b - a + 1
  if n == 1 then foo(a)
  else
    let m = a + n/2 // floor/integer division
    spawn ParForT(foo, a, m-1)
    ParForT(foo, m, b)
    sync

D(n) = O(log(n)) (span)