Db Index
Summary
Computer Architecture
Main memory is fast, volatile, small, and (relatively) expensive
Disks are permanent, slow, big, and cheap
Why should you care
Time:
- Main memory access time is 30ns ()
- Disk access time is 10ms ()
That would be like reading a Phonebook:
- Read a page in 1 minute
- Open a page in 200 days
Cost cmputation:
- Only I/O cost counts
- CPU cost is ignored
Disk
Typical high-end pack capacity:
- 4 platters
- 8 heads
- ~150K tracks/surface
- ~1000KB/track
- 1,2000GB
- 512 bytes/sector
- 4, 8, or 16KB/block
- 600MB/s transfer rate
- 10,000 RPMs
- 3-4ms latency
Records
RegularUser(
Email varchar(50),
Sex char(1),
Birthdate datetime,
CurrentCity varchar(50),
Hometown varchar(50),
);
Storage used for one record:
- Record size: 159 bytes
- block size: 4k
- filled: 80%
- records/block: ~20
- num of records: 4 million
- num of blocks: ~200,000
- file size: ~800MB
Assumptions
Page fault:
- Seek time 3-8ms
- Rotational delay: 2-3ms
- Transfer time: 0.5-1ms
- Total: 5-12 ms
Bulk/Extent transfers (e.g. 250 blocks) save seek time and rotational delay, but require more buffer space.
- Page faults of 10 ms each would cost about 0.260s
- compared to 2.5s of individual transfers
buffer needs to be managed
- LRU strategies
- excellent for merge joins
- kill nested loop joins
File Organization
Heap
Unsorted File
Lookup time:
- N/2
- where N = num of data blocks
- 200,000/2 * 0.01s = 16.6 min
Block pointer: 4 bytes
num of records: 4 million
num of data blocks: ~200,000
file size: ~800MB
Sorting
Sorting helps
Allows us to use binary search
- 18 * 0.01s = 0.18s
Primary Index
Skip list holding pointers to the beginning of blocks
This reduces the size of data we search in.
a dense
key means all the key values in the data are in the index.
a sparse
index means only a subset are.
Lookup time:
-
- where n = num of index blocks
+1
is to move from index to data block
Secondary Index
Same thing as a primary index, but values in the data are not necessarily sorted
- This means all secondary indexes are dense indexes
This is tricky for non-unique fields
Multi-leveled Index
Lookup Time:
-
- where n is num of index values
- where k is num of layers (3 in this case)
Multi-level Index B-Tree
Multi-level indexes can be arranged in a B-Tree.
Insertion, deletion, and udpate operations are implemented to keep tree balanced.
Only rarely does an overflow at a lower level of the tree propagate more than 1-2 levels up.
Static Hashing
Instead of indexing, we can have a hash table
Lookup Time:
- Constant if bucket directory is in main memory
- 1-2 * 0.01s
- Dynamic hashing expands bucket directory when needed