A look under the hood at Apache Sparks API and engine evolutions

56
A look under the hood at Apache Spark's API and engine evolutions Reynold Xin @rxin 2017-02-08, Amsterdam Meetup

Transcript of A look under the hood at Apache Sparks API and engine evolutions

Page 1: A look under the hood at Apache Sparks API and engine evolutions

A look under the hood at Apache Spark's API and engine evolutions

Reynold Xin @rxin2017-02-08, Amsterdam Meetup

Page 2: A look under the hood at Apache Sparks API and engine evolutions

About Databricks

Founded by creators of Spark

Cloud data platform- Spark- Interactive analysis- Cluster management- Production pipelines- Data governance & security

Page 3: A look under the hood at Apache Sparks API and engine evolutions

Databricks Amsterdam R&D Center

Started in January

Hiring distributed systems &database engineers!

Email me: [email protected]

Page 4: A look under the hood at Apache Sparks API and engine evolutions

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram

Page 5: A look under the hood at Apache Sparks API and engine evolutions

Frontend(user facing APIs)

Backend(execution)

Spark stack diagram(a different take)

Page 6: A look under the hood at Apache Sparks API and engine evolutions

Frontend(RDD, DataFrame, ML pipelines, …)

Backend(scheduler, shuffle, operators, …)

Spark stack diagram(a different take)

Page 7: A look under the hood at Apache Sparks API and engine evolutions

Today’s Talk

Some archaeology- IMS, relational databases- MapReduce- data frames

Last 6 years of Spark evolution

Page 8: A look under the hood at Apache Sparks API and engine evolutions

Databases

Page 9: A look under the hood at Apache Sparks API and engine evolutions

IBM IMS hierarchical database (1966)

Image from https://stratechery.com/2016/oracles-cloudy-future/

Page 10: A look under the hood at Apache Sparks API and engine evolutions

Hierarchical Database

- Improvement over file system: query language & catalog

- Lack of flexibility- Difficult to query items in different parts of the hierarchy- Relationships are pre-determined and difficult to change

Page 11: A look under the hood at Apache Sparks API and engine evolutions
Page 12: A look under the hood at Apache Sparks API and engine evolutions

“Future users of large data banks must be protected from having to know how the data is organized in the machine. …

most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Page 13: A look under the hood at Apache Sparks API and engine evolutions

Era of relational databases (late 60s)

Two “new” important ideas

Physical Data Independence: The ability to change the physical data layout without having to change the logical schema.

Declarative Query Language: Programmer specifies “what” rather than“how”.

Page 14: A look under the hood at Apache Sparks API and engine evolutions

Why?

Business applications outlive the environments they were created in:- New requirements might surface- Underlying hardware might change- Require physical layout changes (indexing, different storage medium, etc)

Enabled tremendous amount of innovation:- Indexes, compression, column stores, etc

Page 15: A look under the hood at Apache Sparks API and engine evolutions

Relational Database Pros vs Cons

- Declarative and data independent- SQL is the universal interface everybody knows

- Difficult to compose & build complex applications- Too opinionated and inflexible

- Require data modeling before putting any data in- SQL is the only programming language

Page 16: A look under the hood at Apache Sparks API and engine evolutions

Big Data, MapReduce, Hadoop

Page 17: A look under the hood at Apache Sparks API and engine evolutions

Challenges Google faced

Data size growing (volume & velocity)- Processing has to scale out over large clusters

Complexity of analysis increasing (variety)- Massive ETL (web crawling)- Machine learning, graph processing

Page 18: A look under the hood at Apache Sparks API and engine evolutions

The Big Data Problem

Semi-/Un-structured data doesn’t fit well with databases

Single machine can no longer process or even store all the data!

Only solution is to distribute general storage & processing over clusters.

Page 19: A look under the hood at Apache Sparks API and engine evolutions

Google Datacenter

How do we program this thing?

19

Page 20: A look under the hood at Apache Sparks API and engine evolutions

Data-Parallel Models

Restrict the programming interface so that the system can do more automatically

“Here’s an operation, run it on all of the data”- I don’t care where it runs (you schedule that)- In fact, feel free to run it twice on different nodes- Siimlar to “declarative programming” in databases

Page 21: A look under the hood at Apache Sparks API and engine evolutions
Page 22: A look under the hood at Apache Sparks API and engine evolutions

MapReduce Pros vs Cons

- Massively parallel- Very flexible programming model & schema-on-read

- Extremely verbose & difficult to learn- Most real applications require multiple MR steps

- 21 MR steps -> 21 mapper and reducer classes- Lots of boilerplate code per step

- Bad performance

Page 23: A look under the hood at Apache Sparks API and engine evolutions

R, Python, data frame

Page 24: A look under the hood at Apache Sparks API and engine evolutions

Data frames in R / Python

> head(filter(df, df$waiting < 50)) # an example in R## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48

Developed by stats community & concise syntax for ad-hoc analysis

Procedural (not declarative)

Page 25: A look under the hood at Apache Sparks API and engine evolutions

R data frames Pros and Cons

- Easy to learn- Pretty fast on a laptop (or one server)

- No parallelism & doesn’t work well on big data- Lack sophisticated query optimization

Page 26: A look under the hood at Apache Sparks API and engine evolutions

“Are you going to talk about Spark at all tonight!?”

Page 27: A look under the hood at Apache Sparks API and engine evolutions

Which one is better?Databases, R, MapReduce?Declarative, procedural, data independence?

Page 28: A look under the hood at Apache Sparks API and engine evolutions

Spark’s initial focus: a better MapReduce

Language-integrated API (RDD): similar to Scala’s collection library using functional programming; incredibly powerful and composable

lines = spark.textFile(“hdfs://...”) // RDD[String]

points = lines.map(line => parsePoint(line)) // RDD[Point]

points.filter(p => p.x > 100).count()

Better performance: through a more general DAG abstraction, faster scheduling, and in-memory caching

Page 29: A look under the hood at Apache Sparks API and engine evolutions

Programmability

WordCount in 50+ lines of Java MR

WordCount in 3 lines of Spark

Page 30: A look under the hood at Apache Sparks API and engine evolutions

Challenge with Functional API

Looks high-level, but hides many semantics of computation• Functions are arbitrary blocks of Java bytecode• Data stored is arbitrary Java objects

Users can mix APIs in suboptimal ways

Page 31: A look under the hood at Apache Sparks API and engine evolutions

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

groupByKey

Which Operator Causes Most Tickets?

Page 32: A look under the hood at Apache Sparks API and engine evolutions

Example Problem

pairs = data.map(word => (word, 1))

groups = pairs.groupByKey()

groups.map((k, vs) => (k, vs.sum))

Physical API: Materializes all groupsas Seq[Int] objects

Then promptlyaggregates them

Page 33: A look under the hood at Apache Sparks API and engine evolutions

Challenge: Data Representation

Java objects often many times larger than underlying fields

class User(name: String, friends: Array[Int])

new User(“Bobby”, Array(1, 2))

User 0x… 0x…

String

3

0

1 2

Bobby

5 0x…

int[]

char[] 5

Page 34: A look under the hood at Apache Sparks API and engine evolutions

Recap: two primary issues

1. Many APIs specifies the “physical” behavior, rather than the “logical” intent, aka not declarative enough.

2. Closures (user-defined functions and types) are opaque to the engine, an as a result little room for improvement.

Page 35: A look under the hood at Apache Sparks API and engine evolutions

Sort Benchmark

Originally sponsored by Jim Gray to measure advancements in software and hardware in 1987

Participants often used purpose-built hardware/software to compete• Large companies: IBM, Microsoft, Yahoo, …• Academia: UC Berkeley, UCSD, MIT, …

Page 36: A look under the hood at Apache Sparks API and engine evolutions

Sort Benchmark

• Past winners: Microsoft, Yahoo, Samsung, UCSD, …

1MB -> 100MB -> 1TB (1998) -> 100TB (2009)

Page 37: A look under the hood at Apache Sparks API and engine evolutions

Winning Attempt

Built on low-level Spark API and:

- Put all data in off-heap memory using sun.misc.Unsafe- Use tight low level while loops rather than iterators

~ 3000 lines of low level code on Spark written by Reynold

Page 38: A look under the hood at Apache Sparks API and engine evolutions

On-Disk Sort Record:Time to sort 100TB

2100 machines2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Page 39: A look under the hood at Apache Sparks API and engine evolutions

How do we enable the average users to win a world record, using a few lines of code?

Page 40: A look under the hood at Apache Sparks API and engine evolutions

Goals of last two year’s API evolution

1. Simpler APIs bridging the gap between big data engineering and data science.

2. Higher level, declarative APIs that are future proof (engine can optimize programs automatically).

Taking the best ideas from databases, big data, and data science

Page 41: A look under the hood at Apache Sparks API and engine evolutions

Structured APIs:DataFrames + Spark SQL

Page 42: A look under the hood at Apache Sparks API and engine evolutions

DataFrames and Spark SQL

Efficient library for structured data (data with a known schema)• Two interfaces: SQL for analysts + apps, DataFrames for programmers

Optimized computation and storage, similar to RDBMS

SIGMOD 2015

Page 43: A look under the hood at Apache Sparks API and engine evolutions

Execution Steps

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

DataSource

API

SQL

Code

Generator

Data Frames

Page 44: A look under the hood at Apache Sparks API and engine evolutions

DataFrame API

DataFrames hold rows with a known schema and offer relational operations on them through a DSL

val users = spark.sql(“select * from users”)

val massUsers = users(users(“country”) === “NL”)

massUsers.count()

massUsers.groupBy(“name”).avg(“age”)

Expression AST

Page 45: A look under the hood at Apache Sparks API and engine evolutions

Spark RDD Execution

Java/Scalafrontend

JVMbackend

Pythonfrontend

Pythonbackend

opaque closures(user-defined functions)

Page 46: A look under the hood at Apache Sparks API and engine evolutions

Spark DataFrame Execution

DataFramefrontend

Logical Plan

Physical execution

Catalystoptimizer

Intermediate representation for computation

Page 47: A look under the hood at Apache Sparks API and engine evolutions

Spark DataFrame Execution

PythonDF

Logical Plan

Physicalexecution

Catalystoptimizer

Java/ScalaDF

RDF

Intermediate representation for computation

Simple wrappers to create logical plan

Page 48: A look under the hood at Apache Sparks API and engine evolutions

Structured API Example

events =sc.read.json(“/logs”)

stats =events.join(users).groupBy(“loc”,“status”).avg(“duration”)

errors = stats.where(stats.status == “ERR”)

DataFrame API Optimized Plan Specialized Code

SCAN logs SCAN users

JOIN

AGG

FILTER

while(logs.hasNext) {e = logs.nextif(e.status == “ERR”) {

u = users.get(e.uid)key = (u.loc, e.status)sum(key) += e.durationcount(key) += 1

}}...

Page 49: A look under the hood at Apache Sparks API and engine evolutions

Benefit of Logical Plan: Simpler Frontend

Python : ~2000 line of code (built over a weekend)

R : ~1000 line of code

i.e. much easier to add new language bindings (Julia, Clojure, …)

Page 50: A look under the hood at Apache Sparks API and engine evolutions

Performance

0 2 4 6 8 10

Java/Scala

Python

Runtime for an example aggregation workload

RDD

Page 51: A look under the hood at Apache Sparks API and engine evolutions

Benefit of Logical Plan:Performance Parity Across Languages

0 2 4 6 8 10

Java/Scala

Python

Java/Scala

Python

R

SQL

Runtime for an example aggregation workload (secs)

DataFrame

RDD

Page 52: A look under the hood at Apache Sparks API and engine evolutions

What are Spark’s structured APIs?

Combination of:- data frame from R as the “interface” – easy to learn- declarativity & data independence from databases -- easy to optimize &

future-proof- flexibility & parallelism from MapReduce -- massively scalable & flexible

Page 53: A look under the hood at Apache Sparks API and engine evolutions

Future possibilities

Spark as a fast, multi-core data collection library

Spark as a performant streaming engine

Spark as a GPU computation framework

All using the same API

Page 54: A look under the hood at Apache Sparks API and engine evolutions

Python Java/Scala RSQL …

DataFrameLogical Plan

LLVMJVM SIMD GPUs

Unified API, One Engine, Automatically Optimized

Tungstenbackend

languagefrontend

Page 55: A look under the hood at Apache Sparks API and engine evolutions

Recap

We learn from previous generation systems to understand what works, and what can be improved on, and evolve Spark

Latest APIs take the best ideas out of earlier systems- data frame from R as the “interface” – easy to learn- declarativity & data independence from databases -- easy to optimize &

future-proof- flexibility & parallelism from MapReduce -- massively scalable & flexible

Page 56: A look under the hood at Apache Sparks API and engine evolutions

Dank je wel@rxin