MapReduce
Summary
MapReduce
- Input and Output to each of map and reduce
- <key, value> pairs
Why MapReduce?
- Several processing steps in giant-scale services expressible
- Domain Expert only writes two functions
- Runtime does the rest
- Instantiating number of mappers/reducers
- Data movement
Heavy Lifting done by runtime
- Map Phase
- Read Local Disk
- Parse
- Call
map
- Intermediate files on local disk
- Waits until all maps are done
- Reduce Phase
- Remote read
- Sort
- Call
reduce
Issues to be Handled by the Runtime
Master data structures:
- Location of files created by completed mappers
- Scoreboard of mapper/reducer assignment
- Fault tolerance
- Start new instances if no timely response
- Completion message from redundant stragglers
- Locality management
- Task granularity
- Backup tasks