Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei...
-
Upload
truongdien -
Category
Documents
-
view
245 -
download
0
Transcript of Introduction to BigData Analysis Stack - SJTU · Introduction to BigData Analysis Stack Jeremy Wei...
Lecture 11Introduction to BigData Analysis Stack
Jeremy WeiCenter for HPC, SJTU
May 8th, 2017
1896 1920 1987 2006CS075
+
Outline
• What is BigData?
• Hadoop Ecosystem
• Spark’s Improvement over Hadoop
2
What is Big Data?
Volume- exceeds limits of traditional
column and row relational DB
- constantly growing
Velocity- arrives rapidly, often in real
time
Variety- does not have a standard
structure, e.g. text, images
“Big data are high volume, high velocity, and high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization” (Gartner 2012)
Requires
Vertical scalability- ability to grow storage to
accommodate new ‘records’
Requires
Data streaming- real time processing,
analysis and transformation
Requires
Horizontal scalability
- ability to add additional data
structures
How is big data generated?
Social media: posts, pictures and
videos
Purchase transaction records
Mobile phone GPS signals
High volume administrative
& transactional records
Sensors gathering information: e.g.
Climate, traffic etc.
Digital satellite images
Pilot 1: Smart meter project
Day
To
tal d
aily e
lectr
icity c
on
su
mp
tio
n (
kW
h)
Irish smart meter pilot study:
Single meter, total daily electricity consumption
July 2
009
Octob
er 2009
Febru
ary 20
10
May
201
0
Augus
t 2010
Dec
embe
r 201
0
Christmas 2009 Christmas 2010
Consecutive days w ith low
consumption, possibly a w eek aw ay?
50
100
150
Pilot 1: Smart meter projectResearch Question: Investigate the potential of
smart meter electricity data (high frequency – 30
mins) to identify household occupancy levels,
potentially household structure
• England and Ireland both conducted
pilots of rollout in 2009-2010 – data now
available for research
• Southampton University commissioned
by Beyond 2011 to conduct preliminary
research (due mid Feb 2014)
Pilot 2: Mobile Phone Project
• 4 pilot projects:
– Smartmeter
– Mobile Phones
– Prices
• RD&I Research Innovation Labs
Pilot 2: Mobile Phone ProjectResearch Question: To investigate the
possibility of using mobile phone data to model
population flows, eg travel to work statistics
• Location data:
– Telefonica proposal to provide aggregate
data on origin-destination flows
• Requirement to engage with GDS
before proceeding further
Pilot 3: Prices ProjectResearch Question: To investigate how we
can scrape prices data from the internet and
how this data could be used within price
statistics
• 2-day workshop held with big data
experts from Statistics Netherlands
• Focus on groceries
• Early prototype code in place
• Engagement with Billion Prices Project
Pilot 3: Prices by webscraping
Rendered webpage:
HTML code:......
</div><div class="productLists" id="endFacets-1"><ul class="cf products line"><li id="p-254942348-3" class=" first"><div
class="desc"><h3 class="inBasketInfoContainer"><a id="h-254942348" href="/groceries/Product/Details/?id=254942348" class="si_pl_254942348-title"><span class="image"><img
src="http://img.tesco.com/Groceries/pi/121\5010044000121\IDShot_90x90.jpg" alt="" /><!----></span>Warburtons Toastie Sliced
White Bread 800G</a></h3><p class="limitedLife"><a href="http://www.tesco.com/groceries/zones/default.aspx?name=quality-and-
freshness">Delivering the freshest food to your door- Find out more ></a></p><div class="descContent"><!----><div
class="promo"><a href="/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31234788" title="All products available for this offer" id="flyout-254942348-promo-A31234788--pos" class="promoFlyout"><span class="promoImgBox"><img
src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2for.png" class="promoFlyout promo" alt="Special Offer"
id="flyout-254942348-promo-A31234788--posimg" /></span><em>Any 2 for £2.00</em></a><span> valid from 21/1/2014 until
10/2/2014</span></div><div class="tools"><div class="moreInfo"><a href="/groceries/Product/Details/?id=254942348"
class="midiFlyout" id="flyout-254942348-midi-0-"><img class="midiFlyout hd" src="http://ui.tescoassets.com/groceries/UIAssets/I/../Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/i
nfoBlue.gif" alt="" title="View product information" id="flyout-254942348-midi-1-" /></a></div><!----><div
class="links"><ul><li><a
href="http://www.tesco.com/groceries/product/browse/default.aspx?notepad=white%20sliced%20loaf%20800g&N=4294793217"
class="shelfFlyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium White Bread <!----></span>shelf </a></li></ul></div></div></div></div><div class="quantity"><div class="content addToBasket"><p
class="price"><span class="linePrice">£1.45<!----></span><span class="linePriceAbbr"> (£0.18/100g)</span></p><h4
class="hide">Add to basket</h4><form method="post" id="fMultisearch-254942348"
.....
Pilot 3:
The Billion Prices Project @ MIT
Daily Online
Price Index
(United
States)
Lehman Brothers files
for bankruptcy (15 Sept
2008)
Pilot 4: Twitter ProjectResearch Question: To investigate how to capture
geo-located tweets from Twitter and how this data
might provide insights on commuting patterns and
internal migration
• Opportunity to start experimenting early on with big
data technologies
• Pilot work has successfully harvested geo-located
tweets from the live Twitter feed using Python and
Twitter API
• Need to determine whether planned application will
exceed rate-limits
Pilot 4: Twitter ProjectTemporal Patterns of International Mobility by selected country
Pilot 4: Mobility patterns from
Dover
Calais
Implication of BigData
• Analyzing the whole collection, not sampled ones.
• Long-tailed low-density heterogeneous data.
• Correlation, not causation.
ArchitectureofBigDataAnalytics
16
Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications
Data
Mining
OLAP
Reports
QueriesHadoop
MapReduce
Pig
Hive
Jaql
Zookeeper
Hbase
Cassandra
Oozie
Avro
Mahout
Others
Middleware
Extract
Transform
Load
Data
Warehouse
Traditional
Format
CSV,
Tables
* Internal
* External
* Multiple
formats
* Multiple
locations
* Multiple
applications
BigData
SourcesBigData
Transformation
BigData
Platforms&Tools
BigData
Analytics
Applications
BigData
Analytics
Transformed
DataRaw
Data
+
17
HDFS
MapReduce Processing
Storage
Source: http://hadoop.apache.org/
+
Why Hadoop?
18
+
MapReduce Workflow
19
+
MapReduce
20
+
MapReduce
21
+ HBase:HadoopDatabase
23
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
+Pig:High-levelscriptsforHadoop
24
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
+ Sqoop:SQLimporterforHadoop
25
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
+ Hive:SQLonMapReduce
26
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
+Impala:High-performanceSQLonMapReduce
27
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
+ HadoopEcosystem
28
Source: Shiva Achari (2015), Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop, Packt Publishing
Limitations of Hadoop
• Needs synchronization before shuffle, in which the slowest task may become the bottleneck.
• Needs serialization between iteration, which is unfit for iterative jobs.
• Batch jobs is NOT optimized for streaming interactive queries.
• Do NOT utilize memory efficiently.
Next-gen Big Data Processing Goals
• Low latency (interactive) queries on
historical data: enable faster decisions
– E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming):
enable decisions on real-time data
– E.g., detect & block worms in real-time (a
worm may infect 1mil hosts in 1.3sec)
• Sophisticated data processing: enable
‚better‛ decisions
– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
• ..mostly focused on large on-disk datasets: great for
batch but slow
Application
Storage
Data Processing
Infrastructure
Goals
Batch
Interactive Streaming
One stack to
rule them all!
§ Easy to combine batch, streaming, and interactive computations
§ Easy to develop sophisticated algorithms
§ Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.
• Aggressive use of memory
• Why?
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
• Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing clusters
fit into memory
• e.g., 1TB = 1 billion records @ 1KB
each
3. Memory density (still) grows with Moore’s law
• RAM/SSD hybrid memories at
horizon High end datacenter node
16 cores
10-30TB
128-512GB
1-4TB
10Gbps
0.2-1GB/s
(x10 disks)
1-4GB/s(x4 disks)
40-60GB/s
Support Interactive and Streaming Comp.
• Increase parallelism
• Why?
– Reduce work per node à
improve latency
• Techniques:
– Low latency parallel scheduler
that achieve high locality
– Optimized parallel communication
patterns (e.g., shuffle, broadcast)
– Efficient recovery from failures
and straggler mitigation
result
T
result
Tnew (< T)
• Trade between result accuracy and
response times
• Why?
– In-memory processing does not
guarantee interactive query
processing
• e.g., ~10’s sec just to scan 512
GB RAM!
• Gap between memory capacity
and transfer rate increasing
• Challenges:
– accurately estimate error and
running time for…
– … arbitrary computations
128-512GB
16 cores
40-60GB/s
doubles every 18 months
doubles every 36 months
Support Interactive and Streaming Comp.
Spark
In-Memory Cluster Computing for
Iterative and Interactive Applications
UC Berkeley
Background
• Commodity clusters have become an important computing
platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce
power many of these apps
• Theme of this work: provide similarly powerful abstractions
for a broader class of applications
Motivation
Current popular programming models for clusters transform data flowing from stable storage to stable storage
e.g., MapReduce:
Map
Map
Map
Reduce
Reduce
Input Output
Motivation• Acyclic data flow is a powerful abstraction, but is
not efficient for applications that repeatedly reuse a
working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to
efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to
support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability
Solution: augment data flow model with
“resilient distributed datasets” (RDDs)
Programming Model
• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost
– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations
• Parallel operations on RDDs
– Reduce, collect, count, save, …
• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worke
r
Worke
r
Worke
r
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base
RDDTransformed
RDD
Cached RDD Parallel
operation
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
RDDs in More Detail• An RDD is an immutable, partitioned, logical collection of records
– Need not be materialized, but rather contains information to rebuild a dataset from stable storage
• Partitioning can be based on a key in each record (using hash or range partitioning)
• Built using bulk transformations on other RDDs
• Can be cached for future reuse
RDD Operations
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Parallel operations (Actions)
(return a result to driver)
reduce
collect
count
save
lookupKey
…
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
• e.g.:cachedMsgs = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
.cache()
HdfsRDDpath: hdfs://…
FilteredRDDfunc:
contains(...)
MappedRDDfunc: split(…)
CachedRDD
Example 1: Logistic Regression
• Goal: find best line separating two sets of points
target
random initial line
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()
• var w = Vector.random(D)
• for (i <- 1 to ITERATIONS) {
• val gradient = data.map(p =>
• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
• ).reduce(_ + _)
• w -= gradient
• }
• println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD
transformations
res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec)).reduceByKey(myCombiner).map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)
• EM alg. for traffic prediction (Mobile Millennium)
• K-means clustering
• Alternating Least Squares matrix factorization
• In-memory OLAP aggregation on Hive data
• SQL on Spark (future work)
Berkeley Data Analytics Stack
(BDAS)
Infrastructure
Storage
Data Processing
Application
Resource Management
Data Management
Share infrastructure across
frameworks
(multi-programming for datacenters)
Efficient data sharing across
frameworks
Data Processing
• in-memory processing
• trade between time, quality, and
cost
Application
New apps: AMP-Genomics, Carat, …
Berkeley AMPLab
§ ‚Launched‛ January 2011: 6 Year Plan
– 8 CS Faculty
– ~40 students
– 3 software engineers
• Organized for collaboration:
lg
orit
hm
sa
chi
ne
s
eo
pl
e
Berkeley AMPLab
• Funding:
– XData, CISE Expedition Grant
– Industrial, founding sponsors
– 18 other sponsors, including
Goal: Next Generation of Analytics Data Stack for Industry &
Research:
• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
Berkeley Data Analytics Stack
(BDAS)• Existing stack components….
HDFS
MPI…
Resource
Mgmnt.
Data
Mgmnt.
Data
Processing
Hadoop
HIVE Pig
HBase Storm
Data Management
Data Processing
Resource Management
Mesos• Management platform that allows multiple framework to share
cluster
• Compatible with existing open analytics stack
• Deployed in production at Twitter on 3,500+ servers
HDFS
MPI…
Resource
Mgmnt.
Data
Mgmnt.
Data
Processing
Hadoop
HIVE Pig
HBase Storm
Mesos
Spark• In-memory framework for interactive and iterative
computations
– Resilient Distributed Dataset (RDD): fault-tolerance, in-
memory storage abstraction
• Scala interface, Java and Python APIs
HDFS
Mesos
MPI
Resource
Mgmnt.
Data
Mgmnt.
Data
ProcessingStorm…
Spark Hadoop
HIVE Pig
Spark Streaming [Alpha Release]
• Large scale streaming computation
• Ensure exactly one semantics
• Integrated with Spark à unifies batch, interactive, and streaming
computations!
HDFS
Mesos
MP
I
Resource
Mgmnt.
Data
Mgmnt.
Data
Processing
Hadoop
HIVE PigStor
m
…
Spark
Spark
Streamin
g
Shark à Spark SQL
• HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
• In tests on hundreds node cluster at
HDFS
Mesos
MP
I
Resource
Mgmnt.
Data
Mgmnt.
Data
ProcessingStor
m
…
Spark
Spark
Streamin
g Shark
Hadoop
HIVE Pig
Tachyon
• High-throughput, fault-tolerant in-memory storage
• Interface compatible to HDFS
• Support for Spark and Hadoop
HDFS
Mesos
MP
I
Resource
Mgmnt.
Data
Mgmnt.
Data
Processing
Hadoop
HIVE PigStor
m
…
Spark
Spark
Streamin
g Shark
Tachyon
BlinkDB
• Large scale approximate query engine
• Allow users to specify error or time bounds
• Preliminary prototype starting being tested at Facebook
Mesos
MP
I
Resource
Mgmnt.
Data
ProcessingStor
m…
Spark
Spark
Streamin
g Shark
BlinkDB
HDFSData
Mgmnt.Tachyon
Hadoop
Pig
HIVE
SparkGraph
• GraphLab API and Toolkits on top of Spark
• Fault tolerance by leveraging Spark
Mesos
MP
I
Resource
Mgmnt.
Data
ProcessingStor
m…
Spark
Spark
Streamin
g Shark
BlinkDB
HDFSData
Mgmnt.Tachyon
Hadoop
HIVE
PigSpark
Graph
MLlib
• Declarative approach to ML
• Develop scalable ML algorithms
• Make ML accessible to non-experts
Mesos
MP
I
Resource
Mgmnt.
Data
ProcessingStor
m…
Spark
Spark
Streamin
g Shark
BlinkDB
HDFSData
Mgmnt.Tachyon
Hadoop
HIVE
PigSpark
Graph
MLbas
e
Compatible with Open Source Ecosystem
• Support existing interfaces whenever possible
Mesos
MP
I
Resource
Mgmnt.
Data
ProcessingStor
m…
Spark
Spark
Streamin
g Shark
BlinkDB
HDFSData
Mgmnt.Tachyon
Hadoop
HIVE
PigSpark
Graph
MLbas
e
GraphLab API
Hive Interface
and Shell
HDFS API
Compatibility layer for
Hadoop, Storm, MPI,
etc to run over Mesos
Compatible with Open Source Ecosystem
• Use existing interfaces whenever possible
Mesos
MP
I
Resource
Mgmnt.
Data
ProcessingStor
m…
Spark
Spark
Streamin
g Shark
BlinkDB
HDFSData
Mgmnt.Tachyon
Hadoop
HIVE
PigSpark
Graph
MLbas
e
Support HDFS
API, S3 API, and
Hive metadata
Support Hive API
Accept inputs
from Kafka,
Flume, Twitter,
TCP
Sockets, …
Summary• Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency scheduling,...
• Easy to combine batch, streaming, and interactivecomputations
– Spark execution engine supports all comp. models
• Easy to develop sophisticated algorithms
– Scala interface, APIs for Java, Python, Hive QL, …
– New frameworks targeted to graph based and ML algorithms
• Compatible with existing open source ecosystem
• Open source (Apache/BSD) and fully committed to release high quality software
– Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer)
Batch
Interactive
Streaming
Spark
Conclusion
• By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics
• RDDs provide:
– Lineage info for fault recovery and debugging
– Adjustable in-memory caching
– Locality-aware parallel operations
• We plan to make Spark the basis of a suite of batch and interactive data analysis tools