What's New in the Berkeley Data Analytics Stack

48
What's New in the Berkeley Data Analytics Stack Tathagata Das, Reynold Xin (AMPLab, UC Berkeley) Hadoop Summit 2013 UC BERKELEY

description

The Berkeley Data Analytics Stack (BDAS) aims to address emerging challenges in data analysis through a set of systems, including Spark, Shark and Mesos, that enable faster and more powerful analytics. In this talk, we’ll cover two recent additions to BDAS: * Spark Streaming is an extension of Spark that enables high-speed, fault-tolerant stream processing through a high-level API. It uses a new processing model called “discretized streams” to enable fault-tolerant stateful processing with exactly-once semantics, without the costly transactions required by existing systems. This lets applications process much higher rates of data per node. It also makes programming streaming applications easier by providing a set of high-level operators on streams (e.g. maps, filters, and windows) in Java and Scala. * Shark is a Spark-based data warehouse system compatible with Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions. It employs a number of novel and traditional database optimization techniques, including column-oriented storage and mid-query replanning, to efficiently execute SQL on top of Spark. The system is in early use at companies including Yahoo! and Conviva.

Transcript of What's New in the Berkeley Data Analytics Stack

Page 1: What's New in the Berkeley Data Analytics Stack

What's New in the Berkeley Data Analytics StackTathagata Das, Reynold Xin (AMPLab, UC Berkeley)

Hadoop Summit 2013 UC BERKELEY

Page 2: What's New in the Berkeley Data Analytics Stack

Berkeley Data Analytics Stack

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos / YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 3: What's New in the Berkeley Data Analytics Stack

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos / YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 4: What's New in the Berkeley Data Analytics Stack

Project History

2010: Spark (core execution engine) open sourced

2012: Shark open sourced

Feb 2013: Spark Streaming alpha open sourced

Jun 2013: Spark entered Apache Incubator

Page 5: What's New in the Berkeley Data Analytics Stack

Community

3000+ people online training

800+ meetup members

60+ developers contributing

17 companies contributing

Page 6: What's New in the Berkeley Data Analytics Stack

Hadoop and continuous computing: looking beyond MapReduceBruno Fernandez-Ruiz, Senior Fellow & VP Platforms, Yahoo!Hadoop Summit 2013 Keynote

Page 7: What's New in the Berkeley Data Analytics Stack

2012 Hadoop Summit

Page 8: What's New in the Berkeley Data Analytics Stack

2012 Hadoop Summit (Future of Apache Hadoop)

Page 9: What's New in the Berkeley Data Analytics Stack

2012 Hadoop Summit (Future of Apache Hadoop)

2013 Hadoop Summit

Page 10: What's New in the Berkeley Data Analytics Stack

2012 Hadoop Summit (Future of Apache Hadoop)

2013 Hadoop Summit (Hadoop Economics)

Page 11: What's New in the Berkeley Data Analytics Stack

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 12: What's New in the Berkeley Data Analytics Stack

Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:»In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Page 13: What's New in the Berkeley Data Analytics Stack

Why a New Framework?

MapReduce greatly simplified big data analysis

But as soon as it got popular, users wanted more:

»More complex, multi-pass analytics (e.g. ML, graph)

»More interactive ad-hoc queries»More real-time stream processing

Page 14: What's New in the Berkeley Data Analytics Stack

Spark Programming Model

Key idea: resilient distributed datasets (RDDs)

»Distributed collections of objects»Can optionally be cached in memory across

cluster»Manipulated through parallel operators»Automatically recomputed on failure

Programming interface»Functional APIs in Scala, Java, Python»Interactive use from Scala and Python shell

Page 15: What's New in the Berkeley Data Analytics Stack

Example: Log Mining

Exposes RDDs through a functional API in Java, Python, Scala

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

errors.persist()

Block 1

Block 2

Block 3

Worker

errors.filter(_.contains(“foo”)).count()

errors.filter(_.contains(“bar”)).count()

tasks

results

Errors 2

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)

Worker

Errors 3

Worker

Errors 1

Master

Page 16: What's New in the Berkeley Data Analytics Stack

Spark: Expressive API

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 17: What's New in the Berkeley Data Analytics Stack

Machine Learning Algorithms

Logistic Regression

0 25 50 75 100 125

0.96

110

K-Means Clustering

0 30 60 90 120 150 180

4.1

155

Hadoop MRSpark

Time per Iteration (s)

Page 18: What's New in the Berkeley Data Analytics Stack

Spark in Java and Python

Python APIlines = spark.textFile(…)

errors = lines.filter( lambda s: "ERROR" in s)

errors.count()

Java APIJavaRDD<String> lines = spark.textFile(…);

errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });

errors.count()

Page 19: What's New in the Berkeley Data Analytics Stack

Projects Building on Spark

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 20: What's New in the Berkeley Data Analytics Stack

GraphX

Combining data-parallel and graph-parallel

»Run graph analytics and ETL in the same engine

»Consume graph computation output in Spark

»Interactive shell

Programmability»Support GraphLab / Pregel APIs in 20 LOC»Implement PageRank in 5 LOC

Coming this summer as a Spark module

Page 21: What's New in the Berkeley Data Analytics Stack

Scalable Machine Learning

Build a Classifier for X

What you want to do What you have to do• Learn the internals of ML

classification algorithms, sampling, feature selection, X-validation,….

• Potentially learn Spark/Hadoop/…• Implement 3-4 algorithms• Implement grid-search to find the

right algorithm parameters• Implement validation algorithms• Experiment with different

sampling-sizes, algorithms, features

• ….

and in the end

Ask For Help21

Page 22: What's New in the Berkeley Data Analytics Stack

MLBase

Making large scale machine learning easy

»User specifies the task (e.g. “classify this dataset”)

»MLBase picks the best algorithm and best parameters for the task

Develop scalable, high-quality ML algorithms

»Naïve Bayes»Logistic/Least Squares Regression (L1/L2

Regularization)»Matrix Factorization (ALS, CCD)»K-Means & DP-Means»Optimization Library: SGD, FISTA, ADMM

First release (summer): collection of scalable algorithms

Page 23: What's New in the Berkeley Data Analytics Stack

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 24: What's New in the Berkeley Data Analytics Stack

Shark

Hive compatible: HiveQL, UDFs, metadata, etc.

»Works in existing Hive warehouses without changing queries or data!

Fast execution engine»Uses Spark as the underlying execution

engine»Low-latency, interactive queries»Scales out and tolerate worker failures

Easy to combine with Spark»Process data with SQL queries as well as

raw Spark ops

Page 25: What's New in the Berkeley Data Analytics Stack

Real-world Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes

Page 26: What's New in the Berkeley Data Analytics Stack

Comparisonhttp://tinyurl.com/bigdata-benchmark

Page 27: What's New in the Berkeley Data Analytics Stack

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 28: What's New in the Berkeley Data Analytics Stack

Spark Streaming

Extends Spark for large scale stream processing

»Receive data directly from Kafka, Flume, Twitter, etc.

»Fast, scalable, and fault-tolerant

Simple, yet rich batch-like API»Easy to express your complex streaming

computation»Fault-tolerant, stateful stream processing

out of the box»Easy to inter-mix batch and stream

processing

Extends Spark for doing large scale stream processing

Scales to 100s of nodes and achieves second scale latencies

Efficient and fault-tolerant stateful stream processing

Integrates with Spark’s batch and interactive processing

Provides a simple batch-like API for implementing complex algorithms

Page 29: What's New in the Berkeley Data Analytics Stack

Motivation

Many important applications must process large streams of live data and provide results in near-real-time

»Social network trends»Website statistics» Intrusion detection systems»Etc.

Page 30: What's New in the Berkeley Data Analytics Stack

Challenges

Require large clusters

Require latencies of few seconds

Require fault-tolerance

Require integration with batch processing

Page 31: What's New in the Berkeley Data Analytics Stack

Integration with Batch Processing

Many environments require processing same data in live streaming as well as batch post-processing

Hard for any existing single framework to achieve both

»Provide low latency for streaming workloads»Handle large volumes of data for batch

workloads

Extremely painful to maintain two stacks »Different programming models»Double the implementation effort»Double the number of bugs

Page 32: What's New in the Berkeley Data Analytics Stack

Existing Streaming Systems

Storm – Limited fault-tolerance guarantee

»Replays records if not processed»Processes each record at least once»May double count events!»Mutable state can be lost due to failure!

Trident – Use transactions to update state

»Processes each record exactly once»Per state transaction to external database

is slow

Neither integrate well with batch processing systems

Page 33: What's New in the Berkeley Data Analytics Stack

Spark Streaming

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD operations

• Finally, the processed results of the RDD operations are returned in batches

Spark

SparkStreami

ng

batches of X seconds

live data stream

processed results

Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs

Page 34: What's New in the Berkeley Data Analytics Stack

Spark Streaming

Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs• Batch sizes as low as

½ second, latency ~ 1 second

• Potential for combining batch processing and streaming processing in the same system

Spark

SparkStreami

ng

batches of X seconds

live data stream

processed results

Page 35: What's New in the Berkeley Data Analytics Stack

Example: Get Twitter Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

DStream: a sequence of RDDs representing a stream of data

batch @ t+1batch @ t

batch @ t+2

tweets DStream

stored in memory as an RDD (immutable,

distributed)

Twitter Streaming API

Page 36: What's New in the Berkeley Data Analytics Stack

Example: Get Twitter Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap (status => getTags(status))

batch @ t+1batch @ t

batch @ t+2

tweets DStream

Twitter Streaming API

transformation: modify data in one DStream to create

another DStream

new DStream

flatMap flatMap

flatMap

… new RDDs created for every

batch

hashTags Dstream[#cat, #dog, … ]

Page 37: What's New in the Berkeley Data Analytics Stack

Example: Get Twitter Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t

batch @ t+2

tweets DStream

hashTags DStream

every batch saved to

HDFS

Page 38: What's New in the Berkeley Data Analytics Stack

Example: Get Twitter Hashtags

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.foreach(hashTagRDD => { … })

foreach: do whatever you want with the processed data

flatMap flatMap flatMap

foreach foreach foreach

batch @ t+1batch @ t

batch @ t+2

tweets DStream

hashTags DStream

Write to database, update analytics UI, do whatever

you want

Page 39: What's New in the Berkeley Data Analytics Stack

Window-based Transformations

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

DStream of data

sliding window operation window length sliding interval

window length

sliding interval

Page 40: What's New in the Berkeley Data Analytics Stack

Arbitrary Stateful Computations

Specify function to generate new state based on previous state and new data

»Example: Maintain per-user mood as state, and update it with their tweets

updateMood(newTweets, lastMood) => newMood

moods = tweets.updateStateByKey(tweets => updateMood(tweets))

»Exactly-once semantics even under worker failures

Page 41: What's New in the Berkeley Data Analytics Stack

Arbitrary Combination of Batch and Streaming

ComputationsInter-mix RDD and DStream operations!

»Example: Join incoming tweets with a spam HDFS file to filter out bad tweets

tweets.transform(tweetsRDD => {

tweetsRDD.join(spamHDFSFile).filter(...)

})

Page 42: What's New in the Berkeley Data Analytics Stack

DStream Input Sources

Out of the box we provide»Kafka»Twitter»HDFS»Flume»Raw TCP sockets

Very simple API to write a receiver for your own data source!

Page 43: What's New in the Berkeley Data Analytics Stack

PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency

» Tested with 100 text streams on 100 EC2 instances with 4 cores each

0 20 40 60 80 100

0

1

2

3

4

5

6

7Grep

1 sec

# Nodes in Cluster

Clu

ster

Th

hro

ug

hp

ut

(GB

/s)

0 50 1000

0.5

1

1.5

2

2.5

3

3.5WordCount

1 sec2 sec

# Nodes in Cluster

Clu

ster

Th

rou

gh

pu

t (G

B/s

)

High Throughputand

Low Latency

Page 44: What's New in the Berkeley Data Analytics Stack

Comparison with Storm

Higher throughput than Storm»Spark Streaming: 670k

records/second/node»Storm: 115k records/second/node

100 10000

40

80

120Grep

Spark

Storm

Record Size (bytes)

Th

rou

gh

pu

t p

er

nod

e (

MB

/s)

100 100005

1015202530

WordCount

Spark

Storm

Record Size (bytes)

Th

rou

gh

pu

t p

er

nod

e (

MB

/s)

Page 45: What's New in the Berkeley Data Analytics Stack

Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

Page 46: What's New in the Berkeley Data Analytics Stack

Real Applications: Traffic Sensing

Traffic transit time estimation using online machine learning on GPS observations

• Markov chain Monte Carlo simulations on GPS observations

• Very CPU intensive, requires dozens of machines for useful computation

• Scales linearly with cluster size

0 20 40 60 800

400

800

1200

1600

2000

# Nodes in Cluster

GP

S o

bse

rvati

on

s p

er

seco

nd

Page 47: What's New in the Berkeley Data Analytics Stack

Unifying Batch and Stream Models

Spark program on Twitter log file using RDDsval tweets = sc.hadoopFile("hdfs://...")

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFile("hdfs://...")

Spark Streaming program on Twitter stream using DStreamsval tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

Same code base works for both batch processing and stream

processing

Page 48: What's New in the Berkeley Data Analytics Stack

ConclusionBerkeley Data Analytics Stack

»Next generation of data analytics stack with speed and functionality

More information: www.spark-project.org

Hands-on Tutorials: ampcamp.berkeley.edu

»Video tutorials, EC2 exercises »AMP Camp 2 – August 29-30, 2013