Reaves.dev

v0.1.0

built using

Phoenix v1.7.21

Io Avoiding Algorithms

Stephen M. Reaves

::

2024-08-22

Notes about Lecture 2 for CS-6220

Required Readings

Summary

Sense of Scale

Given:

  • Volume of data to sort:
    • rn=1PiB(250) r \cdot n = 1 PiB (2^{50})
  • Record (item) size
    • r=256B r = 256 B
  • Fast memory size
    • rZ=64GiB(236) r \cdot Z = 64GiB (2^{36})
  • Memory transfer size
    • rL=32KiB(215) r \cdot L = 32KiB (2^{15})

Calculate the following number of transfer operations

scratchspace

n=rnn=250256=4.4e12 n = \frac{r \cdot n}{n} = \frac{2^{50}}{256} = 4.4e12

Z=rZn=236256=2.68e8 Z = \frac{r \cdot Z}{n} = \frac{2^{36}}{256} = 2.68e8

L=rLn=215256=128=27 L = \frac{r \cdot L}{n} = \frac{2^{15}}{256} = 128 = 2^7

nlog2n n \cdot log_2 n

185

nlog2nL n \cdot log_2 \frac{n}{L}

154

n n

4.40

nLlog2nL \frac{n}{L} \cdot log_2 \frac{n}{L}

1.203

nLlog2nZ \frac{n}{L} \cdot log_2 \frac{n}{Z}

0.481

nLlogZLnL \frac{n}{L} \cdot log_{\frac{Z}{L}} \frac{n}{L}

0.0572

Lower is better (since this is the required number of transactions)

  • Big speed ups come from:
    • Multiplying by n/L
      • This corresponds to processing data in L-sized chunks as much as possible
    • logZL log_{\frac{Z}{L}}
      • This corresponds to use as much of the capacity of fast memory as possible

External Memory Merge Sort

Assuming a slow/fast memory hierarchy

Phase 1

  • Partition input into nfz \frac{n}{fz} chunks
  • foreach chunk i <- 1 to nfz \frac{n}{fz} do
    • read chunk i
    • sort it into a (sorted) run
    • write run i
  • Transfers = nL \frac{n}{L}
  • Comparisons = nlog2(Z) nlog_2(Z)

Phase 2

  • Read L-sized blocks of A,B -> A’,B’
  • while any unmerged items in A&B do
    • Merge A’,B’ -> C’ as possible
    • if A’ or B’ empty, then read more
    • if C’ full, then flush
  • Flush any unmerged in A or B
  • Transfers = 2nLlog2nZ 2\frac{n}{L}log_2\frac{n}{Z}
  • Comparisons = Θ(nlog2nZ) \Theta(nlog_2\frac{n}{Z})

Overall

  • Transfers = (n/L)log(n/Z) (n/L)*log(n/Z)
  • Comparisons = nlog(n) nlog(n)

Quiz

Given:

# of transfers in external memory merge sort with 2-way merging:

Q(n,Z,L)=O(nLlog2nZ)=O(nL[log2nLlog2ZL])Q(n,Z,L) = O\left(\frac{n}{L}log_2\frac{n}{Z}\right) = O\left(\frac{n}{L}\left[log_2\frac{n}{L} - log_2\frac{Z}{L}\right]\right)

Lower bound:

Q(n,Z,L)=Ω(nLlogZLnZ)=Ω(nLlog2nLlog2ZL)Q(n,Z,L) = \Omega(\frac{n}{L}log_{\frac{Z}{L}}\frac{n}{Z}) = \Omega\left(\frac{n}{L}\frac{log_2\frac{n}{L}}{log_2\frac{Z}{L}}\right)

Why doesn't 2-way merge do better?

2-way merge “under-utilizes” Z, i.e. fast memory

External Memory Multiway Merge Sort

2-way merge only uses 3 L-sized blocks. There can be up to Z/L blocks of size L <= ( (K+1)LZ (K+1)L \le Z )

k input buffers, the (k+1)th is the output buffer. For each buffer, maintain which is the next element to be able to be read. Choose the smallest element from all the available elements, copy it into the output buffer. All the semantics for full/empty buffers carries over.

Choosing which buffer contains the next smallest element can be tricky. If k is small enough, a linear search will do. If k is large enough, you may need something like a min-heap.

  • Min heaps have the following cost
    • Build: O(k)
    • ExtractMin: O(log(k))
    • insert: O(log(k))

Cost of 1 k-way merge (assuming a min-heap):

  • Transfers:
    • 2ks/L
      • You only ever read input blocks once, and only ever write output blocks once.
  • Comparisons
    • O(k + (ks)log(k))
      • Cost to build head, then every ks items are either inserted or extracted

Lower Bound on External Memory Sorting

Mergesort with θ(ZL) \theta(\frac{Z}{L}) -way merges:

Q(n,Z,L)=θ(nLlogZLnL)Q(n,Z,L) = \theta \left( \frac{n}{L} log_{\frac{Z}{L}}\frac{n}{L} \right)

K(t-1) = # of orderings after t-1 trans

K(0) = n!

K(t)K(t1)(ZL)L!K(t)n![(ZL)L!]tK(t)\begin{align}\ge \frac{K(t-1)}{\left(\frac{Z}{L}\right)L!} \\phantom{K(t)} \ge \frac{n!}{\left[\left(\frac{Z}{L}\right)L!\right]^t}\end{align}

K(t)K(t1)(ZL)L! K(t) \ge \frac{K(t-1)}{\left(\frac{Z}{L}\right)L!}

K(t)n![(ZL)L!]t \phantom{K(t)} \ge \frac{n!}{\left[\left(\frac{Z}{L}\right)L!\right]^t}

The above is only true if we’ve never seen any of the items before. But we can only perform nL \frac{n}{L} reads of items we haven’t seen before

K(t)n!(ZL)t(L!)nL \phantom{K(t)} \ge \frac{n!}{{Z\choose{L}}t\cdot\left(L!\right){\frac{n}{L}}}

When does right-hand side equal 1?

K(t)n!(ZL)t(L!)nL1 \phantom{K(t) \ge} \frac{n!}{{Z\choose{L}}t\cdot\left(L!\right){\frac{n}{L}}} \le 1

The smallest value t such that this happens is the lower bound on the number of transfers

tnLlogZLnL t \gtrsim \frac{n}{L}log_{\frac{Z}{L}}\frac{n}{L}

Binary Search Trees

Given:

  • a sorted array (A) of n elements in slow memory
  • a target value v
  • a binary-search algorithm trying to find the largest i such that A[i] <= v
How many transfers does the binary search do?

O(log2nL) O\left(log_2\frac{n}{L}\right)

Size of index i = log(n) + 1 bits

Let x(L) = maximum number of bits “learned” per L-sized read

Qsearch(n,Z,L)=Ω(log(N)x(L)) Q_{search}(n,Z,L) = \Omega\left(\frac{log(N)}{x(L)}\right)

What is x?

log(L)

Qbinary search(n,Z,L)Qsearch(n,Z,L)=Ω(log(N)log(L)) \phantom{Q_{\text{binary search}}(n,Z,L)}Q_{search}(n,Z,L) = \Omega\left(\frac{log(N)}{log(L)}\right)

Qsearch(n,Z,L)Qbinary search(n,Z,L)=Ω(log(N)log(L)) \phantom{Q_{search}(n,Z,L)}Q_{\text{binary search}}(n,Z,L) = \Omega\left(\frac{log(N)}{log(L)}\right)

Qsearch(n,Z,L)Qbinary search(n,Z,L)=Ω(log(N)log(L)) \phantom{Q_{search}(n,Z,L)}\phantom{Q_{\text{binary search}}(n,Z,L)} = \Omega\left(log(N) - log(L)\right)

There is about a log(L) speed up need to get from binary search to theoretical max

B-trees can be IO-efficient if their branching factor is made equal to L (cache transfer size)