GrdrBeam · Phoenix Framework

Summary

Computer Architecture
Why should you care
Disk
Records
Assumptions
File Organization

Computer Architecture

Main memory is fast, volatile, small, and (relatively) expensive

Disks are permanent, slow, big, and cheap

Why should you care

Time:

Main memory access time is 30ns ( $0.3\times10^{-7}s$ )
Disk access time is 10ms ( $1\times10^{-2}s$ )

That would be like reading a Phonebook:

Read a page in 1 minute
Open a page in 200 days

Cost cmputation:

Only I/O cost counts
CPU cost is ignored

Disk

Typical high-end pack capacity:

4 platters
8 heads
~150K tracks/surface
~1000KB/track
1,2000GB
512 bytes/sector
4, 8, or 16KB/block
600MB/s transfer rate
10,000 RPMs
3-4ms latency

Records

RegularUser(
  Email varchar(50),
  Sex char(1),
  Birthdate datetime,
  CurrentCity varchar(50),
  Hometown varchar(50),
);

Storage used for one record:

Record size: 159 bytes
block size: 4k
filled: 80%
records/block: ~20
num of records: 4 million
num of blocks: ~200,000
file size: ~800MB

Assumptions

Page fault:

Seek time 3-8ms
Rotational delay: 2-3ms
Transfer time: 0.5-1ms
Total: 5-12 ms

Bulk/Extent transfers (e.g. 250 blocks) save seek time and rotational delay, but require more buffer space.

Page faults of 10 ms each would cost about 0.260s
- compared to 2.5s of individual transfers

buffer needs to be managed

LRU strategies
- excellent for merge joins
- kill nested loop joins

File Organization

Heap

Unsorted File

Lookup time:

N/2
- where N = num of data blocks
- 200,000/2 * 0.01s = 16.6 min

Block pointer: 4 bytes

num of records: 4 million

num of data blocks: ~200,000

file size: ~800MB

Sorting

Sorting helps

Allows us to use binary search

$log_2(N)$
18 * 0.01s = 0.18s

Primary Index

Skip list holding pointers to the beginning of blocks

This reduces the size of data we search in.

a dense key means all the key values in the data are in the index.

a sparse index means only a subset are.

Lookup time:

$log_2(n) +1$
- where n = num of index blocks
- +1 is to move from index to data block

Secondary Index

Same thing as a primary index, but values in the data are not necessarily sorted

This means all secondary indexes are dense indexes

This is tricky for non-unique fields

Multi-leveled Index

Lookup Time:

$log_k(n)+1$
- where n is num of index values
- where k is num of layers (3 in this case)

Multi-level Index B-Tree

Multi-level indexes can be arranged in a B-Tree.

Insertion, deletion, and udpate operations are implemented to keep tree balanced.

Only rarely does an overflow at a lower level of the tree propagate more than 1-2 levels up.

Static Hashing

Instead of indexing, we can have a hash table