Resilient Distributed Datasets

I took CS 245 in this January which was taught by Matei Zaharia, the creator of Apache Spark. Recently, I need to use Spark for my project, so I revisited this paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing . Here are some notes:

Resilient Distributed Datasets (RDD):
- Read-only, partitioned collection of records.
- Can be only created through deterministic operations on either
  (1) data in stable storage or (2) other RDDs. [Transformation - map, filter, join]
- User can control: (1) persistence (2) partitioning.
Life of a RDDs:
- Creation (through coarse-grained transformations)
- Perform actions (e.g. count, collect)
- Persist (in memory or disk)
Advantages:
- RDDs do not need to incur the overhead of checkpointing, as they can be recovered using lineage. Only the lost partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different nodes, without having to roll back the whole program.
- Its immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks.
- In bulk operations, a runtime can schedule tasks based on data locality to improve performance.
- RDDs degrade gracefully when there is not enough memory to store them, can be stored on disk.
Applications:
- batch operations that apply the same operation to all elements (not suitable for applications that make aync fine-grained updates)
Usage: Write driver program that connects to a cluster of workers
- Driver: defines one or more RDDs, tracks the RDDs’ lineage.
- Worker: long-lived processes that can store RDD partitions in RAM across operations.
- Users provide arguments to RDD operations like map by passing closures (function literals)
  - Scala represents each closure as a Java object
  - The object can be serialized and loaded on another node to pass the closure across the network.
  - Variables bound in the closure as fields in the object
- During execution:
  - Driver finds variables attached to closure, wraps them into an object, then serializes the object.
  - The object will then be passed to worker nodes.
  - Worker deserializes and executes.
Example applications:
- Logistic Regression
- PageRank
Representing RDDs
- Common interface w/ 5 pieces of information:
  - partitions()
  - preferredLocations(p)
  - dependencies()
  - iterator(p, parentIters)
  - partitioner()
- Dependencies:
  - Narrow v.s. Wide: whether each parition of the parent RDD is used by at most one partition of the child RDD.
  - Differences:
    - Narrow dependencies allow for pipelined execution on one cluster node.
    - Recovery after a node failure is more efficient with a narrow dependency.
Implementation
- Job scheduling:
  - Similar to Dryad’s but consider which RDDs are available in memory.
  - Our scheduler assigns tasks to machines based on data locality using delay scheduling.
- Intepreter Integration
  - Class shipping： over http
  - Modified code generation: reference the instance of each line directly
- Memory Management
  - 3 options for storage of persistent RDDs:
    - in-memory storage as deserialized Java objects: fastest performance
    - in-memory storage as serialized data: more memory efficient
    - on-disk storage: for large RDDs
  - LRU eviction
Evaluation
- Outperforms Hadoop by up to 20x (The speedup comes from avoiding I/O and deserialization costs by storing data in memory as Java objects).
- Can recover quickly by rebuilding only the lost RDD partitions.
- Query a 1TB dataset with 5~7 sec latencies
User applications
- In-memory Analytics
- Traffic Modeling
- Twitter Spam Classification