Reaves.dev

v0.1.0

built using

Phoenix v1.7.12

Distributed Data Analytics

Stephen M. Reaves

::

2023-03-05

Notes about Lesson 12 of CS-7210

Required Readings

Optional

Summary

Data Processing at Scale

Common Techniques:

MapReduce Brief

Input is split into chunks of KV pairs

Each worker gets a map function:

Reduce function:

Keep running Reduce function until you get final reduced output

Orchestrated by master process

Design Decisions MapReduce

Master data structures:

Locality:

Task granularity:

Fault Tolerance:

Semantics in the presence of failures

Backup tasks

Paper describes these decisions and Google’s optimizations

Limitations of MapReduce

Depends on persistent IO, including for intermediate data

Allows pipelines to be resumed, rather than restarted

Every operations has to pay (de)serialization cost

Iterative executions mean intermediate data might need to be read multiple times

Replication needs to be at the storage level

Spark

10x > Hadoop

Multiple workloads

Goal of Spark is to allow in-memory data sharing

Provides different fault tolerance mechanisms

Resilient Distributed Datasets

Read-only partitioned collection of records, can be only multiple machines

Records are created only via transformations

Used via actions

RDDs are aware of lineage

RDDs Through Example

lines = spark.textFile(hdfs://...)
errors = lines.filter(_.startsWith(ERROR))
errors.persist()

errors.count()

// Return the time fields of errors mentioning HDFS as an array (assuming time
// is field number 3 in a tab-separated format):
errors.filter(-contains(HDFS))
    .mapC.split('\t')(3))
    .collect()

Lineage = lines -> errors -> HDFS errors -> time fields

RDD Transformations

Data dependencies are implied by transformations

Actions are executed as a DAG of stages

Tasks are assigned to machines based on data locality

Did Spark Achieve Its Goal

Once data brought in memory:

Pros:

RDDs best for batch workloads

Spark can be up to 2x faster than Hadoop