Io Avoiding Algorithms

Stephen M. Reaves

2024-08-22

Notes about Lecture 2 for CS-6220

Required Readings

Summary

Sense of Scale
External Memory Merge Sort
- Phase 1
- Phase 2
- Overall
- Quiz
of transfers in external memory merge sort with 2-way merging:
External Memory Multiway Merge Sort
- Cost of 1 k-way merge (assuming a min-heap):
- Lower Bound on External Memory Sorting
of orderings after t-1 trans
Binary Search Trees

Sense of Scale

Given:

Volume of data to sort:
- $r \cdot n = 1 PiB (2^{50})$
Record (item) size
- $r = 256 B$
Fast memory size
- $r \cdot Z = 64GiB (2^{36})$
Memory transfer size
- $r \cdot L = 32KiB (2^{15})$

Calculate the following number of transfer operations

scratchspace

$n = \frac{r \cdot n}{n} = \frac{2^{50}}{256} = 4.4e12$

$Z = \frac{r \cdot Z}{n} = \frac{2^{36}}{256} = 2.68e8$

$L = \frac{r \cdot L}{n} = \frac{2^{15}}{256} = 128 = 2^7$

n \cdot log_2 n

185

n \cdot log_2 \frac{n}{L}

154

n

4.40

\frac{n}{L} \cdot log_2 \frac{n}{L}

1.203

\frac{n}{L} \cdot log_2 \frac{n}{Z}

0.481

\frac{n}{L} \cdot log_{\frac{Z}{L}} \frac{n}{L}

0.0572

Lower is better (since this is the required number of transactions)

Big speed ups come from:
- Multiplying by n/L
  - This corresponds to processing data in L-sized chunks as much as possible
- $log_{\frac{Z}{L}}$
  - This corresponds to use as much of the capacity of fast memory as possible

External Memory Merge Sort

Assuming a slow/fast memory hierarchy

Phase 1

Partition input into $\frac{n}{fz}$ chunks
foreach chunk i <- 1 to $\frac{n}{fz}$ do
- read chunk i
- sort it into a (sorted) run
- write run i
Transfers = $\frac{n}{L}$
Comparisons = $nlog_2(Z)$

Phase 2

Read L-sized blocks of A,B -> A’,B’
while any unmerged items in A&B do
- Merge A’,B’ -> C’ as possible
- if A’ or B’ empty, then read more
- if C’ full, then flush
Flush any unmerged in A or B
Transfers = $2\frac{n}{L}log_2\frac{n}{Z}$
Comparisons = $\Theta(nlog_2\frac{n}{Z})$

Overall

Transfers = $(n/L)*log(n/Z)$
Comparisons = $nlog(n)$

Quiz

Given:

# of transfers in external memory merge sort with 2-way merging:

$Q(n,Z,L) = O\left(\frac{n}{L}log_2\frac{n}{Z}\right) = O\left(\frac{n}{L}\left[log_2\frac{n}{L} - log_2\frac{Z}{L}\right]\right)$

Lower bound:

$Q(n,Z,L) = \Omega(\frac{n}{L}log_{\frac{Z}{L}}\frac{n}{Z}) = \Omega\left(\frac{n}{L}\frac{log_2\frac{n}{L}}{log_2\frac{Z}{L}}\right)$

Why doesn't 2-way merge do better?

2-way merge “under-utilizes” Z, i.e. fast memory

External Memory Multiway Merge Sort

2-way merge only uses 3 L-sized blocks. There can be up to Z/L blocks of size L <= ( $(K+1)L \le Z$ )

k input buffers, the (k+1)th is the output buffer. For each buffer, maintain which is the next element to be able to be read. Choose the smallest element from all the available elements, copy it into the output buffer. All the semantics for full/empty buffers carries over.

Choosing which buffer contains the next smallest element can be tricky. If k is small enough, a linear search will do. If k is large enough, you may need something like a min-heap.

Min heaps have the following cost
- Build: O(k)
- ExtractMin: O(log(k))
- insert: O(log(k))

Cost of 1 k-way merge (assuming a min-heap):

Transfers:
- 2ks/L
  - You only ever read input blocks once, and only ever write output blocks once.
Comparisons
- O(k + (ks)log(k))
  - Cost to build head, then every ks items are either inserted or extracted

Lower Bound on External Memory Sorting

Mergesort with $\theta(\frac{Z}{L})$ -way merges:

$Q(n,Z,L) = \theta \left( \frac{n}{L} log_{\frac{Z}{L}}\frac{n}{L} \right)$

K(t-1) = # of orderings after t-1 trans

K(0) = n!

$K(t)\begin{align}\ge \frac{K(t-1)}{\left(\frac{Z}{L}\right)L!} \\phantom{K(t)} \ge \frac{n!}{\left[\left(\frac{Z}{L}\right)L!\right]^t}\end{align}$

$K(t) \ge \frac{K(t-1)}{\left(\frac{Z}{L}\right)L!}$

$\phantom{K(t)} \ge \frac{n!}{\left[\left(\frac{Z}{L}\right)L!\right]^t}$

The above is only true if we’ve never seen any of the items before. But we can only perform $\frac{n}{L}$ reads of items we haven’t seen before

$\phantom{K(t)} \ge \frac{n!}{{Z\choose{L}}^t\cdot\left(L!\right)^{\frac{n}{L}}}$

When does right-hand side equal 1?

$\phantom{K(t) \ge} \frac{n!}{{Z\choose{L}}^t\cdot\left(L!\right)^{\frac{n}{L}}} \le 1$

The smallest value t such that this happens is the lower bound on the number of transfers

$t \gtrsim \frac{n}{L}log_{\frac{Z}{L}}\frac{n}{L}$

Binary Search Trees

Given:

a sorted array (A) of n elements in slow memory
a target value v
a binary-search algorithm trying to find the largest i such that A[i] <= v

How many transfers does the binary search do?

$O\left(log_2\frac{n}{L}\right)$

Size of index i = log(n) + 1 bits

Let x(L) = maximum number of bits “learned” per L-sized read

$Q_{search}(n,Z,L) = \Omega\left(\frac{log(N)}{x(L)}\right)$

What is x?

log(L)

$\phantom{Q_{\text{binary search}}(n,Z,L)}Q_{search}(n,Z,L) = \Omega\left(\frac{log(N)}{log(L)}\right)$

$\phantom{Q_{search}(n,Z,L)}Q_{\text{binary search}}(n,Z,L) = \Omega\left(\frac{log(N)}{log(L)}\right)$

$\phantom{Q_{search}(n,Z,L)}\phantom{Q_{\text{binary search}}(n,Z,L)} = \Omega\left(log(N) - log(L)\right)$

There is about a log(L) speed up need to get from binary search to theoretical max

B-trees can be IO-efficient if their branching factor is made equal to L (cache transfer size)