GrdrBeam · Phoenix Framework

Required Readings

Optional

Summary

Multithreaded DAG Model
Example Sample Reduction
Work and Span
of nodes
of vertices on the longest path
Brent’s Theorem
Speedup
Basic Concurrency Primitives
- Spawn and Sync
- Par-for

Multithreaded DAG Model

Each vertex is an operation, each edge is a dependency.

We assume there is a PRAM model with P processors with shared memory. We can assign up to P operations to complete at a time, assuming all the dependencies are met.

Our cost model assumes:

All processors run at the same speed
1 op = 1 unit of time
No edge cost

Example Sample Reduction

let A = array of length n

s <- 0
for i <- 1 to n do
  s <- s + A[i]

Time to execute this DAG?

T(p) \ge n

Minimum time to execute this DAG on a PRAM with p=n procs?

T(p) = O(log(n))

Work and Span

Work := total # of nodes

Span := # of vertices on the longest path

Given work and span, what can we say?

Given only 1 proc, Time should equal Work
- $T_1(n) = W(n)$
Given only infinite proc, Time should equal Span
- $T_\infin(n) = D(n)$

Average available parallelism := $\frac{W(n)}{D(n)}$

$T_p(n) \ge max \left{ D(n), \left\lceil\frac{W(n)}{p}\right\rceil \right}$

Brent’s Theorem

Break execution into phases:

Each phase has 1 critical path vertex
All non-critical path vertices are independent
- vertices in the same phase cannot depend on each other
Every vertex should appear in some phase

For any phase (k), W(k) is the total amount of work for that phase.

$\sum_{k=1}^{D}W_k = W$

How long will it take to execute phase k? ( $t_k$ )?

$t_k = \left\lceil\frac{w_k}{p}\right\rceil \implies T_p = \sum_{k=1}^{D}t_k$

$\phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p = \sum_{k=1}^{D}\left\lceil\frac{w_k}{p}\right\rceil$

$\phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p = \sum_{k=1}^{D}\left\lfloor\frac{w_k - 1}{p}\right\rfloor + 1$

$\phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p \le \sum_{k=1}^{D}\frac{w_k - 1}{p} + 1$

$\phantom{t_k = \left\lceil\frac{w_k}{p}\right\rceil} \implies T_p \le \frac{W-D}{P} + D$

The time to execute a DAG is no more than the time to execute the critical path (D) plus the time to execute everything off of the critical path using the P processors
Brent's Thoerem

$max\left{D, \left\lceil\frac{W}{P}\right\rceil\right} \le T_p \le \frac{W-D}{P} + D$

Speedup

Speedup := Best sequential time divided by the best parallel time

$Sp(n) \equiv \frac{T_*(n)}{T_p(n)}$

$\phantom{Sp(n)} \equiv \frac{W_*(n)}{f(w,d,n,p)}$

$\phantom{Sp(n)} \equiv \frac{W_* (n)}{f(w,d,n,p)} \ge \frac{W_*}{\frac{W-D}{P} + D}$

$\phantom{Sp(n) \equiv \frac{W_* (n)}{f(w,d,n,p)}} \ge \frac{P}{\frac{W}{W_* } + \frac{P-1}{W_* / D}}$

Ideal speedup is linear in P. In order for speedup to be linear, I need the denominator to be a constant. For that to happen the work per processor has to grow as some function of n.

Weak scalability
- If you want good scaling, you might need to increase the problem size

$Sp(n) \equiv \frac{T_* (n)}{Tp(n)} = \Theta(p)$
Speedup

$W(n) = O(W_* (n))$
Work optimality

$p = O\left(\frac{W_* }{D}\right) \text{ or } \frac{W_* }{p} = \Omega(D)$
Weak scalability

Basic Concurrency Primitives

Spawn and Sync

reduce(A[0:n-1])
  if n > 2 then
    a <- reduce(A[0:n/2-1])
    b <- reduce(A[n/2:n-1])
    return a+b
  // else n = 1
  return A[0]

reduce(A[0:n-1])
  if n > 2 then
    a <- spawn reduce(A[0:n/2-1])
    b <- reduce(A[n/2:n-1])
    sync
    return a+b
  // else n = 1
  return A[0]

Par-for

par-for i <- 1 to n do
  foo(i)

That’s O(n)

ParForT(foo, a, b)
  let n = b - a + 1
  if n == 1 then foo(a)
  else
    let m = a + n/2 // floor/integer division
    spawn ParForT(foo, a, m-1)
    ParForT(foo, m, b)
    sync

D(n) = O(log(n)) (span)

Work Span Model