We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Db Index
Summary
Computer Architecture
Main memory is fast, volatile, small, and (relatively) expensive
Disks are permanent, slow, big, and cheap
Why should you care
Time:
- Main memory access time is 30ns ()
- Disk access time is 10ms ()
That would be like reading a Phonebook:
- Read a page in 1 minute
- Open a page in 200 days
Cost cmputation:
- Only I/O cost counts
- CPU cost is ignored
Disk
Typical high-end pack capacity:
- 4 platters
- 8 heads
- ~150K tracks/surface
- ~1000KB/track
- 1,2000GB
- 512 bytes/sector
- 4, 8, or 16KB/block
- 600MB/s transfer rate
- 10,000 RPMs
- 3-4ms latency
Records
RegularUser(
Email varchar(50),
Sex char(1),
Birthdate datetime,
CurrentCity varchar(50),
Hometown varchar(50),
);
Storage used for one record:
- Record size: 159 bytes
- block size: 4k
- filled: 80%
- records/block: ~20
- num of records: 4 million
- num of blocks: ~200,000
- file size: ~800MB
Assumptions
Page fault:
- Seek time 3-8ms
- Rotational delay: 2-3ms
- Transfer time: 0.5-1ms
- Total: 5-12 ms
Bulk/Extent transfers (e.g. 250 blocks) save seek time and rotational delay, but require more buffer space.
- Page faults of 10 ms each would cost about 0.260s
- compared to 2.5s of individual transfers
buffer needs to be managed
- LRU strategies
- excellent for merge joins
- kill nested loop joins
File Organization
Heap
Unsorted File
Lookup time:
- N/2
- where N = num of data blocks
- 200,000/2 * 0.01s = 16.6 min
Block pointer: 4 bytes
num of records: 4 million
num of data blocks: ~200,000
file size: ~800MB
Sorting
Sorting helps
Allows us to use binary search
- 18 * 0.01s = 0.18s
Primary Index
Skip list holding pointers to the beginning of blocks
This reduces the size of data we search in.
a dense
key means all the key values in the data are in the index.
a sparse
index means only a subset are.
Lookup time:
-
- where n = num of index blocks
+1
is to move from index to data block
Secondary Index
Same thing as a primary index, but values in the data are not necessarily sorted
- This means all secondary indexes are dense indexes
This is tricky for non-unique fields
Multi-leveled Index
Lookup Time:
-
- where n is num of index values
- where k is num of layers (3 in this case)
Multi-level Index B-Tree
Multi-level indexes can be arranged in a B-Tree.
Insertion, deletion, and udpate operations are implemented to keep tree balanced.
Only rarely does an overflow at a lower level of the tree propagate more than 1-2 levels up.
Static Hashing
Instead of indexing, we can have a hash table
Lookup Time:
- Constant if bucket directory is in main memory
- 1-2 * 0.01s
- Dynamic hashing expands bucket directory when needed