What's New in the Berkeley Data Analytics Stack

What's New in the Berkeley Data Analytics StackTathagata Das, Reynold Xin (AMPLab, UC Berkeley)

Hadoop Summit 2013 UC BERKELEY

Berkeley Data Analytics Stack

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos / YARN Resource Manager

Spark Streamin

gGraphX MLBase

Today’s Talk

Spark

SharkSQL


Mesos / YARN Resource Manager

Spark Streamin

gGraphX MLBase

Project History

2010: Spark (core execution engine) open sourced

2012: Shark open sourced

Feb 2013: Spark Streaming alpha open sourced

Jun 2013: Spark entered Apache Incubator

Community

3000+ people online training

800+ meetup members

60+ developers contributing

17 companies contributing

Hadoop and continuous computing: looking beyond MapReduceBruno Fernandez-Ruiz, Senior Fellow & VP Platforms, Yahoo!Hadoop Summit 2013 Keynote

2012 Hadoop Summit

2012 Hadoop Summit (Future of Apache Hadoop)


2013 Hadoop Summit


2013 Hadoop Summit (Hadoop Economics)

Today’s Talk

Spark

SharkSQL


Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Why a New Framework?

MapReduce greatly simplified big data analysis

But as soon as it got popular, users wanted more:

»More complex, multi-pass analytics (e.g. ML, graph)

»More interactive ad-hoc queries»More real-time stream processing

Spark Programming Model

Key idea: resilient distributed datasets (RDDs)

»Distributed collections of objects»Can optionally be cached in memory across

cluster»Manipulated through parallel operators»Automatically recomputed on failure

Programming interface»Functional APIs in Scala, Java, Python»Interactive use from Scala and Python shell

Example: Log Mining

Exposes RDDs through a functional API in Java, Python, Scala

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

errors.persist()

Block 1

Block 2

Block 3

Worker

errors.filter(_.contains(“foo”)).count()

errors.filter(_.contains(“bar”)).count()

tasks

results

Errors 2

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)

Worker

Errors 3

Worker

Errors 1

Master

Spark: Expressive API

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Machine Learning Algorithms

Logistic Regression

0 25 50 75 100 125

0.96

110

K-Means Clustering

0 30 60 90 120 150 180

4.1

155

Hadoop MRSpark

Time per Iteration (s)

Spark in Java and Python

Python APIlines = spark.textFile(…)

errors = lines.filter( lambda s: "ERROR" in s)

errors.count()

Java APIJavaRDD<String> lines = spark.textFile(…);

errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });

errors.count()

Projects Building on Spark

Spark

SharkSQL



Spark Streamin

gGraphX MLBase

GraphX

Combining data-parallel and graph-parallel

»Run graph analytics and ETL in the same engine

»Consume graph computation output in Spark

»Interactive shell

Programmability»Support GraphLab / Pregel APIs in 20 LOC»Implement PageRank in 5 LOC

Coming this summer as a Spark module

Scalable Machine Learning

Build a Classifier for X

What you want to do What you have to do• Learn the internals of ML

classification algorithms, sampling, feature selection, X-validation,….

• Potentially learn Spark/Hadoop/…• Implement 3-4 algorithms• Implement grid-search to find the

right algorithm parameters• Implement validation algorithms• Experiment with different

sampling-sizes, algorithms, features

• ….

and in the end

Ask For Help21

MLBase

Making large scale machine learning easy

»User specifies the task (e.g. “classify this dataset”)

»MLBase picks the best algorithm and best parameters for the task

Develop scalable, high-quality ML algorithms

»Naïve Bayes»Logistic/Least Squares Regression (L1/L2

Regularization)»Matrix Factorization (ALS, CCD)»K-Means & DP-Means»Optimization Library: SGD, FISTA, ADMM

First release (summer): collection of scalable algorithms

Today’s Talk

Spark

SharkSQL



Spark Streamin

gGraphX MLBase

Shark

Hive compatible: HiveQL, UDFs, metadata, etc.

»Works in existing Hive warehouses without changing queries or data!

Fast execution engine»Uses Spark as the underlying execution

engine»Low-latency, interactive queries»Scales out and tolerate worker failures

Easy to combine with Spark»Process data with SQL queries as well as

raw Spark ops

Real-world Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes

Comparisonhttp://tinyurl.com/bigdata-benchmark

Today’s Talk

Spark

SharkSQL



Spark Streamin

gGraphX MLBase

Spark Streaming

Extends Spark for large scale stream processing

»Receive data directly from Kafka, Flume, Twitter, etc.

»Fast, scalable, and fault-tolerant

Simple, yet rich batch-like API»Easy to express your complex streaming

computation»Fault-tolerant, stateful stream processing

out of the box»Easy to inter-mix batch and stream

processing

Extends Spark for doing large scale stream processing

Scales to 100s of nodes and achieves second scale latencies

Efficient and fault-tolerant stateful stream processing

Integrates with Spark’s batch and interactive processing

Provides a simple batch-like API for implementing complex algorithms

Motivation

Many important applications must process large streams of live data and provide results in near-real-time

»Social network trends»Website statistics» Intrusion detection systems»Etc.

Challenges

Require large clusters

Require latencies of few seconds

Require fault-tolerance

Require integration with batch processing

Integration with Batch Processing

Many environments require processing same data in live streaming as well as batch post-processing

Hard for any existing single framework to achieve both

»Provide low latency for streaming workloads»Handle large volumes of data for batch

workloads

Extremely painful to maintain two stacks »Different programming models»Double the implementation effort»Double the number of bugs

Existing Streaming Systems

Storm – Limited fault-tolerance guarantee

»Replays records if not processed»Processes each record at least once»May double count events!»Mutable state can be lost due to failure!

Trident – Use transactions to update state

»Processes each record exactly once»Per state transaction to external database

is slow

Neither integrate well with batch processing systems

Spark Streaming

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD operations

• Finally, the processed results of the RDD operations are returned in batches

Spark

SparkStreami

ng

batches of X seconds

live data stream

processed results

Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs

Spark Streaming

Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs• Batch sizes as low as

½ second, latency ~ 1 second

• Potential for combining batch processing and streaming processing in the same system

Spark

SparkStreami

ng

batches of X seconds

live data stream

processed results

Example: Get Twitter Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

DStream: a sequence of RDDs representing a stream of data

batch @ t+1batch @ t

batch @ t+2

tweets DStream

stored in memory as an RDD (immutable,

distributed)

Twitter Streaming API



val hashTags = tweets.flatMap (status => getTags(status))


batch @ t+2

tweets DStream

Twitter Streaming API

transformation: modify data in one DStream to create

another DStream

new DStream

flatMap flatMap

flatMap

… new RDDs created for every

batch

hashTags Dstream[#cat, #dog, … ]




hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

flatMap flatMap flatMap

save save save


batch @ t+2

tweets DStream

hashTags DStream

every batch saved to

HDFS




hashTags.foreach(hashTagRDD => { … })

foreach: do whatever you want with the processed data

flatMap flatMap flatMap

foreach foreach foreach


batch @ t+2

tweets DStream

hashTags DStream

Write to database, update analytics UI, do whatever

you want

Window-based Transformations



val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

DStream of data

sliding window operation window length sliding interval

window length

sliding interval

Arbitrary Stateful Computations

Specify function to generate new state based on previous state and new data

»Example: Maintain per-user mood as state, and update it with their tweets

updateMood(newTweets, lastMood) => newMood

moods = tweets.updateStateByKey(tweets => updateMood(tweets))

»Exactly-once semantics even under worker failures

Arbitrary Combination of Batch and Streaming

ComputationsInter-mix RDD and DStream operations!

»Example: Join incoming tweets with a spam HDFS file to filter out bad tweets

tweets.transform(tweetsRDD => {

tweetsRDD.join(spamHDFSFile).filter(...)

})

DStream Input Sources

Out of the box we provide»Kafka»Twitter»HDFS»Flume»Raw TCP sockets

Very simple API to write a receiver for your own data source!

PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

» Tested with 100 text streams on 100 EC2 instances with 4 cores each

0 20 40 60 80 100

0

1

2

3

4

5

6

7Grep

1 sec

# Nodes in Cluster

Clu

ster

Th

hro

ug

hp

ut

(GB

/s)

0 50 1000

0.5

1

1.5

2

2.5

3

3.5WordCount

1 sec2 sec

# Nodes in Cluster

Clu

ster

Th

rou

gh

pu

t (G

B/s

)

High Throughputand

Low Latency

Comparison with Storm

Higher throughput than Storm»Spark Streaming: 670k

records/second/node»Storm: 115k records/second/node

100 10000

40

80

120Grep

Spark

Storm

Record Size (bytes)

Th

rou

gh

pu

t p

er

nod

e (

MB

/s)

100 100005

1015202530

WordCount

Spark

Storm

Record Size (bytes)

Th

rou

gh

pu

t p

er

nod

e (

MB

/s)

Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

Real Applications: Traffic Sensing

Traffic transit time estimation using online machine learning on GPS observations

• Markov chain Monte Carlo simulations on GPS observations

• Very CPU intensive, requires dozens of machines for useful computation

• Scales linearly with cluster size

0 20 40 60 800

400

800

1200

1600

2000

# Nodes in Cluster

GP

S o

bse

rvati

on

s p

er

seco

nd

Unifying Batch and Stream Models

Spark program on Twitter log file using RDDsval tweets = sc.hadoopFile("hdfs://...")


hashTags.saveAsHadoopFile("hdfs://...")

Spark Streaming program on Twitter stream using DStreamsval tweets = ssc.twitterStream(<username>, <password>)


hashTags.saveAsHadoopFiles("hdfs://...")

Same code base works for both batch processing and stream

processing

ConclusionBerkeley Data Analytics Stack

»Next generation of data analytics stack with speed and functionality

More information: www.spark-project.org

Hands-on Tutorials: ampcamp.berkeley.edu

»Video tutorials, EC2 exercises »AMP Camp 2 – August 29-30, 2013

http://www.spark-project.org/

http://ampcamp.berkeley.edu/

What's New in the Berkeley Data Analytics Stack

Technology

Transcript of What's New in the Berkeley Data Analytics Stack