What's New in the Berkeley Data Analytics Stack
-
Upload
hadoopsummit -
Category
Technology
-
view
107 -
download
1
description
Transcript of What's New in the Berkeley Data Analytics Stack
What's New in the Berkeley Data Analytics StackTathagata Das, Reynold Xin (AMPLab, UC Berkeley)
Hadoop Summit 2013 UC BERKELEY
Berkeley Data Analytics Stack
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark Streamin
gGraphX MLBase
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos / YARN Resource Manager
Spark Streamin
gGraphX MLBase
Project History
2010: Spark (core execution engine) open sourced
2012: Shark open sourced
Feb 2013: Spark Streaming alpha open sourced
Jun 2013: Spark entered Apache Incubator
Community
3000+ people online training
800+ meetup members
60+ developers contributing
17 companies contributing
Hadoop and continuous computing: looking beyond MapReduceBruno Fernandez-Ruiz, Senior Fellow & VP Platforms, Yahoo!Hadoop Summit 2013 Keynote
2012 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit
2012 Hadoop Summit (Future of Apache Hadoop)
2013 Hadoop Summit (Hadoop Economics)
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
Spark
Fast and expressive cluster computing system interoperable with Apache Hadoop
Improves efficiency through:»In-memory computing primitives»General computation graphs
Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell
Up to 100× faster(2-10× on disk)
Often 5× less code
Why a New Framework?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
»More complex, multi-pass analytics (e.g. ML, graph)
»More interactive ad-hoc queries»More real-time stream processing
Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
»Distributed collections of objects»Can optionally be cached in memory across
cluster»Manipulated through parallel operators»Automatically recomputed on failure
Programming interface»Functional APIs in Scala, Java, Python»Interactive use from Scala and Python shell
Example: Log Mining
Exposes RDDs through a functional API in Java, Python, Scala
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist()
Block 1
Block 2
Block 3
Worker
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20
sec for on-disk data)
Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)
Worker
Errors 3
Worker
Errors 1
Master
Spark: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
Machine Learning Algorithms
Logistic Regression
0 25 50 75 100 125
0.96
110
K-Means Clustering
0 30 60 90 120 150 180
4.1
155
Hadoop MRSpark
Time per Iteration (s)
Spark in Java and Python
Python APIlines = spark.textFile(…)
errors = lines.filter( lambda s: "ERROR" in s)
errors.count()
Java APIJavaRDD<String> lines = spark.textFile(…);
errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });
errors.count()
Projects Building on Spark
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
GraphX
Combining data-parallel and graph-parallel
»Run graph analytics and ETL in the same engine
»Consume graph computation output in Spark
»Interactive shell
Programmability»Support GraphLab / Pregel APIs in 20 LOC»Implement PageRank in 5 LOC
Coming this summer as a Spark module
Scalable Machine Learning
Build a Classifier for X
What you want to do What you have to do• Learn the internals of ML
classification algorithms, sampling, feature selection, X-validation,….
• Potentially learn Spark/Hadoop/…• Implement 3-4 algorithms• Implement grid-search to find the
right algorithm parameters• Implement validation algorithms• Experiment with different
sampling-sizes, algorithms, features
• ….
and in the end
Ask For Help21
MLBase
Making large scale machine learning easy
»User specifies the task (e.g. “classify this dataset”)
»MLBase picks the best algorithm and best parameters for the task
Develop scalable, high-quality ML algorithms
»Naïve Bayes»Logistic/Least Squares Regression (L1/L2
Regularization)»Matrix Factorization (ALS, CCD)»K-Means & DP-Means»Optimization Library: SGD, FISTA, ADMM
First release (summer): collection of scalable algorithms
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
Shark
Hive compatible: HiveQL, UDFs, metadata, etc.
»Works in existing Hive warehouses without changing queries or data!
Fast execution engine»Uses Spark as the underlying execution
engine»Low-latency, interactive queries»Scales out and tolerate worker failures
Easy to combine with Spark»Process data with SQL queries as well as
raw Spark ops
Real-world Performance
1.7 TB Real Warehouse Data on 100 EC2 nodes
Comparisonhttp://tinyurl.com/bigdata-benchmark
Today’s Talk
Spark
SharkSQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark Streamin
gGraphX MLBase
Spark Streaming
Extends Spark for large scale stream processing
»Receive data directly from Kafka, Flume, Twitter, etc.
»Fast, scalable, and fault-tolerant
Simple, yet rich batch-like API»Easy to express your complex streaming
computation»Fault-tolerant, stateful stream processing
out of the box»Easy to inter-mix batch and stream
processing
Extends Spark for doing large scale stream processing
Scales to 100s of nodes and achieves second scale latencies
Efficient and fault-tolerant stateful stream processing
Integrates with Spark’s batch and interactive processing
Provides a simple batch-like API for implementing complex algorithms
Motivation
Many important applications must process large streams of live data and provide results in near-real-time
»Social network trends»Website statistics» Intrusion detection systems»Etc.
Challenges
Require large clusters
Require latencies of few seconds
Require fault-tolerance
Require integration with batch processing
Integration with Batch Processing
Many environments require processing same data in live streaming as well as batch post-processing
Hard for any existing single framework to achieve both
»Provide low latency for streaming workloads»Handle large volumes of data for batch
workloads
Extremely painful to maintain two stacks »Different programming models»Double the implementation effort»Double the number of bugs
Existing Streaming Systems
Storm – Limited fault-tolerance guarantee
»Replays records if not processed»Processes each record at least once»May double count events!»Mutable state can be lost due to failure!
Trident – Use transactions to update state
»Processes each record exactly once»Per state transaction to external database
is slow
Neither integrate well with batch processing systems
Spark Streaming
• Chop up the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes them using RDD operations
• Finally, the processed results of the RDD operations are returned in batches
Spark
SparkStreami
ng
batches of X seconds
live data stream
processed results
Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs
Spark Streaming
Discretized Stream Processing - run a streaming computation as a series of very small, deterministic batch jobs• Batch sizes as low as
½ second, latency ~ 1 second
• Potential for combining batch processing and streaming processing in the same system
Spark
SparkStreami
ng
batches of X seconds
live data stream
processed results
Example: Get Twitter Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
DStream: a sequence of RDDs representing a stream of data
batch @ t+1batch @ t
batch @ t+2
tweets DStream
stored in memory as an RDD (immutable,
distributed)
Twitter Streaming API
Example: Get Twitter Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
batch @ t+1batch @ t
batch @ t+2
tweets DStream
Twitter Streaming API
transformation: modify data in one DStream to create
another DStream
new DStream
flatMap flatMap
flatMap
… new RDDs created for every
batch
hashTags Dstream[#cat, #dog, … ]
Example: Get Twitter Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t
batch @ t+2
tweets DStream
hashTags DStream
every batch saved to
HDFS
Example: Get Twitter Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { … })
foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1batch @ t
batch @ t+2
tweets DStream
hashTags DStream
Write to database, update analytics UI, do whatever
you want
Window-based Transformations
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
DStream of data
sliding window operation window length sliding interval
window length
sliding interval
Arbitrary Stateful Computations
Specify function to generate new state based on previous state and new data
»Example: Maintain per-user mood as state, and update it with their tweets
updateMood(newTweets, lastMood) => newMood
moods = tweets.updateStateByKey(tweets => updateMood(tweets))
»Exactly-once semantics even under worker failures
Arbitrary Combination of Batch and Streaming
ComputationsInter-mix RDD and DStream operations!
»Example: Join incoming tweets with a spam HDFS file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
DStream Input Sources
Out of the box we provide»Kafka»Twitter»HDFS»Flume»Raw TCP sockets
Very simple API to write a receiver for your own data source!
PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency
» Tested with 100 text streams on 100 EC2 instances with 4 cores each
0 20 40 60 80 100
0
1
2
3
4
5
6
7Grep
1 sec
# Nodes in Cluster
Clu
ster
Th
hro
ug
hp
ut
(GB
/s)
0 50 1000
0.5
1
1.5
2
2.5
3
3.5WordCount
1 sec2 sec
# Nodes in Cluster
Clu
ster
Th
rou
gh
pu
t (G
B/s
)
High Throughputand
Low Latency
Comparison with Storm
Higher throughput than Storm»Spark Streaming: 670k
records/second/node»Storm: 115k records/second/node
100 10000
40
80
120Grep
Spark
Storm
Record Size (bytes)
Th
rou
gh
pu
t p
er
nod
e (
MB
/s)
100 100005
1015202530
WordCount
Spark
Storm
Record Size (bytes)
Th
rou
gh
pu
t p
er
nod
e (
MB
/s)
Fast Fault Recovery
Recovers from faults/stragglers within 1 sec
Real Applications: Traffic Sensing
Traffic transit time estimation using online machine learning on GPS observations
• Markov chain Monte Carlo simulations on GPS observations
• Very CPU intensive, requires dozens of machines for useful computation
• Scales linearly with cluster size
0 20 40 60 800
400
800
1200
1600
2000
# Nodes in Cluster
GP
S o
bse
rvati
on
s p
er
seco
nd
Unifying Batch and Stream Models
Spark program on Twitter log file using RDDsval tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming program on Twitter stream using DStreamsval tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Same code base works for both batch processing and stream
processing
ConclusionBerkeley Data Analytics Stack
»Next generation of data analytics stack with speed and functionality
More information: www.spark-project.org
Hands-on Tutorials: ampcamp.berkeley.edu
»Video tutorials, EC2 exercises »AMP Camp 2 – August 29-30, 2013