Distributed Data Analytics

Stephen M. Reaves

2023-03-05

Notes about Lesson 12 of CS-7210

Required Readings

Resilient Distributed Datasets

Optional

MapReduce

Summary

Data Processing at Scale
MapReduce Brief
Design Decisions MapReduce
Limitations of MapReduce
Spark
Resilient Distributed Datasets
RDDs Through Example
RDD Transformations
Did Spark Achieve Its Goal

Data Processing at Scale

Common Techniques:

Data Parallel
- Divide data, assign to nodes for processing
- How do you know how to load balance?
Pipelining
- Each node only does one thing
- Data flows through a sequence of tasks
- Increases throughput
Model Parallelism
- Divide state of application across nodes
- Each node has less to process based on its state
- Input is passed to all nodes, output is combined from all nodes

MapReduce Brief

Input is split into chunks of KV pairs

Each worker gets a map function:

Input:
- Unique KV pairs
Output:
- Intermediate new KV pair

Reduce function:

Input:
- Intermediate KV pair
Output:
- Combined KV pair

Keep running Reduce function until you get final reduced output

Orchestrated by master process

Design Decisions MapReduce

Master data structures:

For tracking progress

Locality:

Scheduling
Placement of intermediate data

Task granularity:

Fine -> more flexibility
Coarse -> lower overhead

Fault Tolerance:

Master -> Standby replication?
Worker -> Detect failure/straggler?

Semantics in the presence of failures

Backup tasks

Paper describes these decisions and Google’s optimizations

Limitations of MapReduce

Depends on persistent IO, including for intermediate data

Allows pipelines to be resumed, rather than restarted

Every operations has to pay (de)serialization cost

Iterative executions mean intermediate data might need to be read multiple times

Replication needs to be at the storage level

Spark

10x > Hadoop

Multiple workloads

Goal of Spark is to allow in-memory data sharing

Provides different fault tolerance mechanisms

Resilient Distributed Datasets

Read-only partitioned collection of records, can be only multiple machines

Records are created only via transformations

deterministic lazy operations on data in stable storage or other RDD

Used via actions

count
collect
save

RDDs are aware of lineage

This makes recomputing easier

RDDs Through Example

lines = spark.textFile(hdfs://...)
errors = lines.filter(_.startsWith(ERROR))
errors.persist()

errors.count()

// Return the time fields of errors mentioning HDFS as an array (assuming time
// is field number 3 in a tab-separated format):
errors.filter(-contains(HDFS))
    .mapC.split('\t')(3))
    .collect()

Lineage = lines -> errors -> HDFS errors -> time fields

RDD Transformations

Data dependencies are implied by transformations

Actions are executed as a DAG of stages

Each with as many narrow dependencies as possible
- 1:1 mappings as apposed to N:1
Achieve parallelism, limit IO contention

Tasks are assigned to machines based on data locality

Did Spark Achieve Its Goal

Once data brought in memory:

Distributed shared memory runtime
Just track data updates
Log coarse grain operations applied to all items in RDD elements

Pros:

Less data to persist in execution critical path
Read data as low as once
- Less storage IO costs
More control on locality Cons:
Longer recovery time

RDDs best for batch workloads

Spark can be up to 2x faster than Hadoop

then another 2x faster when optimized for things like locality