Distributed Data Analytics
Required Readings
Optional
Summary
- Data Processing at Scale
- MapReduce Brief
- Design Decisions MapReduce
- Limitations of MapReduce
- Spark
- Resilient Distributed Datasets
- RDDs Through Example
- RDD Transformations
- Did Spark Achieve Its Goal
Data Processing at Scale
Common Techniques:
- Data Parallel
- Divide data, assign to nodes for processing
- How do you know how to load balance?
- Pipelining
- Each node only does one thing
- Data flows through a sequence of tasks
- Increases throughput
- Model Parallelism
- Divide state of application across nodes
- Each node has less to process based on its state
- Input is passed to all nodes, output is combined from all nodes
MapReduce Brief
Input is split into chunks of KV pairs
Each worker gets a map function:
- Input:
- Unique KV pairs
- Output:
- Intermediate new KV pair
Reduce function:
- Input:
- Intermediate KV pair
- Output:
- Combined KV pair
Keep running Reduce function until you get final reduced output
Orchestrated by master process
Design Decisions MapReduce
Master data structures:
- For tracking progress
Locality:
- Scheduling
- Placement of intermediate data
Task granularity:
- Fine -> more flexibility
- Coarse -> lower overhead
Fault Tolerance:
- Master -> Standby replication?
- Worker -> Detect failure/straggler?
Semantics in the presence of failures
Backup tasks
Paper describes these decisions and Google’s optimizations
Limitations of MapReduce
Depends on persistent IO, including for intermediate data
Allows pipelines to be resumed, rather than restarted
Every operations has to pay (de)serialization cost
Iterative executions mean intermediate data might need to be read multiple times
Replication needs to be at the storage level
Spark
10x > Hadoop
Multiple workloads
Goal of Spark is to allow in-memory data sharing
Provides different fault tolerance mechanisms
Resilient Distributed Datasets
Read-only partitioned collection of records, can be only multiple machines
Records are created only via transformations
- deterministic lazy operations on data in stable storage or other RDD
Used via actions
- count
- collect
- save
RDDs are aware of lineage
- This makes recomputing easier
RDDs Through Example
lines = spark.textFile(hdfs://...)
errors = lines.filter(_.startsWith(ERROR))
errors.persist()
errors.count()
// Return the time fields of errors mentioning HDFS as an array (assuming time
// is field number 3 in a tab-separated format):
errors.filter(-contains(HDFS))
.mapC.split('\t')(3))
.collect()
Lineage = lines -> errors -> HDFS errors -> time fields
RDD Transformations
Data dependencies are implied by transformations
Actions are executed as a DAG of stages
- Each with as many narrow dependencies as possible
- 1:1 mappings as apposed to N:1
- Achieve parallelism, limit IO contention
Tasks are assigned to machines based on data locality
Did Spark Achieve Its Goal
Once data brought in memory:
- Distributed shared memory runtime
- Just track data updates
- Log coarse grain operations applied to all items in RDD elements
Pros:
- Less data to persist in execution critical path
- Read data as low as once
- Less storage IO costs
- More control on locality Cons:
- Longer recovery time
RDDs best for batch workloads
Spark can be up to 2x faster than Hadoop
- then another 2x faster when optimized for things like locality