Required Readings

Optional

Summary

Data Processing at Scale

Common Techniques:

  • Data Parallel
    • Divide data, assign to nodes for processing
    • How do you know how to load balance?
  • Pipelining
    • Each node only does one thing
    • Data flows through a sequence of tasks
    • Increases throughput
  • Model Parallelism
    • Divide state of application across nodes
    • Each node has less to process based on its state
    • Input is passed to all nodes, output is combined from all nodes

MapReduce Brief

Input is split into chunks of KV pairs

Each worker gets a map function:

  • Input:
    • Unique KV pairs
  • Output:
    • Intermediate new KV pair

Reduce function:

  • Input:
    • Intermediate KV pair
  • Output:
    • Combined KV pair

Keep running Reduce function until you get final reduced output

Orchestrated by master process

Design Decisions MapReduce

Master data structures:

  • For tracking progress

Locality:

  • Scheduling
  • Placement of intermediate data

Task granularity:

  • Fine -> more flexibility
  • Coarse -> lower overhead

Fault Tolerance:

  • Master -> Standby replication?
  • Worker -> Detect failure/straggler?

Semantics in the presence of failures

Backup tasks

Paper describes these decisions and Google’s optimizations

Limitations of MapReduce

Depends on persistent IO, including for intermediate data

Allows pipelines to be resumed, rather than restarted

Every operations has to pay (de)serialization cost

Iterative executions mean intermediate data might need to be read multiple times

Replication needs to be at the storage level

Spark

10x > Hadoop

Multiple workloads

Goal of Spark is to allow in-memory data sharing

Provides different fault tolerance mechanisms

Resilient Distributed Datasets

Read-only partitioned collection of records, can be only multiple machines

Records are created only via “transformations”

  • deterministic “lazy” operations on data in stable storage or other RDD

Used via actions

  • count
  • collect
  • save

RDDs are aware of lineage

  • This makes recomputing easier

RDDs Through Example

lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist()

errors.count()

// Return the time fields of errors mentioning HDFS as an array (assuming time
// is field number 3 in a tab-separated format):
errors.filter(-contains("HDFS"))
    .mapC.split('\t')(3))
    .collect()

Lineage = lines -> errors -> HDFS errors -> time fields

RDD Transformations

Data dependencies are implied by transformations

Actions are executed as a DAG of stages

  • Each with as many narrow dependencies as possible
    • 1:1 mappings as apposed to N:1
  • Achieve parallelism, limit IO contention

Tasks are assigned to machines based on data locality

Did Spark Achieve Its Goal

Once data brought in memory:

  • Distributed shared memory runtime
  • Just track data updates
  • Log coarse grain operations applied to all items in RDD elements

Pros:

  • Less data to persist in execution critical path
  • Read data as low as once
    • Less storage IO costs
  • More control on locality Cons:
  • Longer recovery time

RDDs best for batch workloads

Spark can be up to 2x faster than Hadoop

  • then another 2x faster when optimized for things like locality