Big Data Trend and Open Data
-
Upload
jongwook-woo -
Category
Data & Analytics
-
view
73 -
download
0
Transcript of Big Data Trend and Open Data
Jongwook Woo
HiPIC
CalStateLA
UKC 2016
Dallas, TXAug 12 2016
Jongwook Woo, PhD, [email protected]
High-Performance Information Computing Center (HiPIC)California State University Los Angeles
Big Data Trend and Open Data
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Myself
Experience: Since 2002, Professor at California State Univ Los Angeles
– PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware
Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships
– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors
High Performance Information Computing CenterJongwook Woo
CalStateLA
Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city in 2016
– Collect, Search, and Analyze City Data• Hadoop, Solr, Java, Cloudera
Sept 2013: Samsung Advanced Technology Training Institute
Since 2008– Introduce Hadoop Big Data and education to Univ and
Research Centers• Yonsei, Gachon• US: USC, Pennsylvania State Univ, University of Maryland College Park,
Univ of Bridgeport, Louisiana State Univ, California State Univ LB• Europe: Univ of Luxembourg
Myself
High Performance Information Computing CenterJongwook Woo
CalStateLA
Experience in Big Data
Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data
– IMSC of USC– Pennsylvania State University
Grants IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in
Research and Education Grant
Partnership Academic Education Partnership with Databricks, Tableau, Qlik,
Cloudera, Hortonworks, SAS, Teradata
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Data Issues
Large-Scale dataTera-Byte (1012), Peta-byte (1015)
– Because of web– Sensor Data (IoT), Bioinformatics, Social Computing,
Streaming data, smart phone, online game…
Cannot handle with the legacy approachToo bigNon-/Semi-structured dataToo expensive
Need new systemsNon-expensive
High Performance Information Computing CenterJongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big DataHow to compute Big DataGoogle
How to store Big Data– GFS– Distributed Systems on non-expensive commodity
computersHow to compute Big Data
– MapReduce– Parallel Computing with non-expensive computers
Own super computersPublished papers in 2003, 2004
High Performance Information Computing CenterJongwook Woo
CalStateLA
What is Hadoop?
9
Hadoop Founder: o Doug Cutting
Apache Committer: Lucene, Nutch, …
High Performance Information Computing CenterJongwook Woo
CalStateLA
Definition: Big Data
Non-expensive frameworks that can store a large scale data and process it faster in parallelHadoop
–Non-expensive Super Computer–More public than the traditional super
computers• You can store and process your applications
– In your university labs, small companies, research centers
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hadoop Cluster: Logical Diagram
Web Browser of Cluster nonitor: CM/Am-bari
HTTP(S)
Agent Hadoop Agent Hadoop Agent Hadoop
Agent Hadoop Agent Hadoop Agent Hadoop
Cluster Monitor
......
...
Agent Hadoop Agent Hadoop Agent Hadoop
HDFS HDFS HDFS
HDFS HDFS HDFS
HIVE ZooKeeper Impala
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Alternate of Hadoop MapReduce
Limitation in MapReduceHard to program in JavaBatch Processing
– Not interactiveDisk storage for intermediate data
– Performance issue
Spark by UC Berkley AMP Lab In-memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk
– MapReduce
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark
In-Memory Data ComputingFaster than Hadoop MapReduce
Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra,
ArcGIS, Couchbase…
New Programming with faster data sharingGood in complex multi-stage applications
– Iterative graph algorithms, Machine LearningInteractive query
High Performance Information Computing CenterJongwook Woo
CalStateLA
SparkRDDs, Transformations, and Actions
Spark Streamin
greal-time
SparkSQL
MLLibML
machine
learning
DStream’s: Streams of
RDD’s
SchemaRDD’s
DataFramesRDD-Based Matrices
Spark Cores
GraphX
(graph)
RDD-Based Matrices
SparkR
RDD-Based Matrices
High Performance Information Computing CenterJongwook Woo
CalStateLA
RDD Operations
TransformationDefine new RDDs from the current
–Lazy: not computed immediatelymap(), filter(), join()
ActionsReturn valuescount(), collect(), take(), save()
High Performance Information Computing CenterJongwook Woo
CalStateLA
Programming in Spark
ScalaFunctional Programming
–Fundamental of programming is function• Input/Output is function
No side effects–No states
PythonLegacy, large Libraries
Java
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark Spark SQL
DataFrame– Turning an RDD into a Relation
Querying using SQL
Spark Streaming DStream
– RDD in streaming– Windows
• To select DStream from streaming data
Mlib, ML Sparse vector support, Decision trees, Linear/Logistic Regression,
PCA Pipeline
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark
SparkFile Systems: TachyonResource Manager: Mesos
But, Hadoop has been dominating marketIntegrating Spark into Hadoop clusterCloud Computing
– Amazon AWS, Azure HDInsight, IBM Bluemix• Object Storage, S3
Hadoop vendors– HDP, CDH
Databricks: Spark on AWS– No Hadoop ecosystems
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark with Hadoop YARN
Spark Client
Slave Nodes
ResourceManager (RM) Per Cluster Create Spark AM and allocate Containers for Spark AM
NodeManager (NM) Per Node Spark workers
ApplicationMaster (AM) Per Application Containers for Spark Executors
Master Node
NodeManager
NodeManager
NodeManager
Container: Spark Executor
Spark AM
ResourceManager
High Performance Information Computing CenterJongwook Woo
CalStateLA
Big Data Analysis Flow
Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government
Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…
Data FilteringHive, Pig
Data Analysis and ScienceHive, Pig, Spark, BI Tools (Datameer, Qlik, …)
Data VisualizationQlik, Datameer, Excel PowerView
High Performance Information Computing CenterJongwook Woo
CalStateLA
Databricks cluster at CalStateLA
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Data
USA governmentFederal, State, City governmentsExpose data to public
USA BusinessTwitter, Yelp, …Expose data to public with APIs
– Some restriction to download
City governmentNew York
– Taxi, Uber, …Los Angeles
– Open Data, Open Hub with Geo info
High Performance Information Computing CenterJongwook Woo
CalStateLA
Open Big Data Analysis in CalStateLA
Social Media Data AnalysisTwitter Sentiment Analysis for Alphago
Open Data from GovernmentAirline Data analysisCrime Data analysis
Web Service APIBusiness Data Analysis from Yelp and Google Places API
High Performance Information Computing CenterJongwook Woo
CalStateLA
Data from Industry: Twitter Data
SystemsAzure HDInsights Spark8 Nodes
– 40 cores: 2.4GHz Intel Xeon– Memory - Each Node: 28 GB
Data SourceKeyword ‘alphago’ from Tweeter via Apache NiFi
Data Size 63,193 tweets
Real Time Data Collection period03/12 – 03/17/2016
– No data collected on 03/13
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries that Tweets “Alphago”
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries
# of Tweets per CountryUSA: > 11,000Japan: > 9,000Korea: > 1,900Russia, UK: > 1,600Thai Land, France : >
1,000 Netherland, Spain,
Ukraine: > 600
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries Sentiment
Positive Negative
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 Countries
Most Tweeted Countries All countries show more positive tweets
–Korea, Japan, USA
Country Positive Negative
USA 5070 3567
Japan 8118 217
…
Korea 1053 407
…
High Performance Information Computing CenterJongwook Woo
CalStateLA
Daily Tweets in 03/12 – 03/17/2016
3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/20160
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Alphago vs Lee Sedol
Game 4: Mar 13 Lee Se-Dol win
Game 5: Mar 15
Game 3: Mar 12
High Performance Information Computing CenterJongwook Woo
CalStateLA
Ngram words
3 word in row right after Go-Champion “sedol” and “se-dol”
sedol
se-dol3-grams FrequencyAgain-to-win 1,187
Is-something-I’ll 369
Is-something-i 199
In-go-tournament 168
High Performance Information Computing CenterJongwook Woo
CalStateLA
Sentiment Map of Alphago
PositiveNegative
High Performance Information Computing CenterJongwook Woo
CalStateLA
Sentiment Map of Lee Se-Dol vs Alphago
YouTube video: “alphago sentiment” by Google The sentiment of the World in Geo and Time:
https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a
High Performance Information Computing CenterJongwook Woo
CalStateLA
Federal Government: Airline Data Set
Government Open DataAirline Data Set in 2012 – 2014
– US Dept of transportation
Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 7 GB– Windows Server 2012 R2 Datacenter
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
High Performance Information Computing CenterJongwook Woo
CalStateLA
Airline Data Set
High Performance Information Computing CenterJongwook Woo
CalStateLA
City Government: Crime Data Set
Open Data in City of Los Angeles Crime Data Set in 2012-2015 File Size – 151MB Total Number of offenses – 8.94 million
Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQLNumber of Data Nodes: 4
– CPU: 4 Cores; MEMORY: 14 GB– Windows Server 2012 R2 Datacenter– Extending to last 10 years of data set
High Performance Information Computing CenterJongwook Woo
CalStateLA
Projection of Raw Data
ASSAULT CRIMINAL TRAFFIC VANDALISM others theft0
10000
20000
30000
40000
50000
60000
70000
80000
90000
year2012 year2013 year2014 year2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
Total No. of Crimes in 2012-15
months
1 2 3 4 5 6 7 8 9 10 11 120
5000
10000
15000
20000
25000
year2012 year2013 year2014 year2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
No.of Crimes for every 5miles from CalStateLA
0-5 5-10 11-15 15-20 20-25 25-30 30-35 >350
10000
20000
30000
40000
50000
60000
70000
80000
90000
csula_2012 csula_2013 csula_2014 csula_2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
No.of Crimes for every 5miles from UCLA
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400
20000
40000
60000
80000
100000
120000
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
No. of Crimes for every 5miles from USC
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400
20000
40000
60000
80000
100000
120000
ucla_2012 ucla_2013 ucla_2014 ucla_2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
Comparision of Crimes for every 5miles from CalStateLA,
UCLA and USC in 2015
0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >500
20000
40000
60000
80000
100000
120000
csula_2015 ucla_2015 usc_2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
No.of crimes per area in LA
77th
Street
Newton
Southw
est
Van N
uys
Centra
l
Foothi
ll
Hollen
beck
N Holl
ywoo
d
Wes
t Vall
ey
Olympic
Wes
t LA
02000400060008000
1000012000140001600018000
in2012 in2013 in2014 in2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
Total No.of Crimes for every 2hours in LA
77th
Street
Newton
Southw
est
Van N
uys
Centra
l
Foothi
ll
Hollen
beck
N Holl
ywoo
d
Wes
t Vall
ey
Olympic
Wes
t LA
02000400060008000
1000012000140001600018000
in2012 in2013 in2014 in2015
High Performance Information Computing CenterJongwook Woo
CalStateLA
No.of crimes for every 2hrs within 5miles from CalStateLA,
UCLA and USC in 2015
00:00-02:0002:00-04:0004:00-06:0006:00-08:0008:00-10:0010:00-12:0012:00-14:0014:00-16:0016:00-18:0018:00-20:0020:00-22:0022:00-24:00
0 2000 4000 6000 8000 10000 12000
usc ucla csula
High Performance Information Computing CenterJongwook Woo
CalStateLA
BUSINESS DATA ANALYSIS
DATA SET DETAILS
• Yelp Review Data : 1.9GB
• Business Data: 500MB• Web Service API from Yelp and Google Places
Analysis Join
YELP CHALLENGE
DATA SET
GOOGLE PLACES
YELP DATA
High Performance Information Computing CenterJongwook Woo
CalStateLA
Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings)
count0
5
10
15
20
25
30
35
40
3431
2926
19 19
15 15 15
Chart Title
Hair Salons Auto Repair General DentistryInsurance Churches Skin CareChiropractors Barbers Elementary Schools
• Hair Salons and Insurance are popular qualified business categories
High Performance Information Computing CenterJongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA, usc , ucla
High Performance Information Computing CenterJongwook Woo
CalStateLA
Number of food business in radius 0-25 miles from CalStateLA, usc and ucla
CalStateLA have more food businesses within 5 miles compared to UCLA and USC
0- 5 5-10. 10-15. 15-20 20-250
100
200
300
400
500
600
CSULA USC UCLA
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
The station producing hydrogen for Hydrogen Vehicle
Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to
the public. Hyundai, Toyota
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
Workflow
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
Model by Manvi Chandra
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
Results and observations
High Performance Information Computing CenterJongwook Woo
CalStateLA
Hydrogen Gas Power Plant Prediction Model
Results and observations Can predict Vehicle Pressure
– Pressure of hydrogen gas within the vehicle Hydrogen Storage System
– using our model in Azure Visual Studio ML– Building Spark ML
Decision forest Regression– constructing a multitude of decision trees at training
time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.
High Performance Information Computing CenterJongwook Woo
CalStateLA
Contents
Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training
High Performance Information Computing CenterJongwook Woo
CalStateLA
Spark Big Data Training and R&D
HiPICCalifornia State University Los Angeles Supported by
– Databricks and its cloud computing services– Amazon AWS, IBM Buemix, MS Azure– Hortonworks, Cloudera– Datameer
High Performance Information Computing CenterJongwook Woo
CalStateLA
Databricks Partners
High Performance Information Computing CenterJongwook Woo
CalStateLA
Training Hadoop and SparkCloudera visits to interview Jongwook Woo
High Performance Information Computing CenterJongwook Woo
CalStateLA
Training Hadoop on IBM Bluemix at California State Univ. Los Angeles
High Performance Information Computing CenterJongwook Woo
CalStateLA
Question?
High Performance Information Computing CenterJongwook Woo
CalStateLA
References
Hadoop, http://hadoop.apache.orgApache Spark op Word Count Example (
http://spark.apach.org )Databricks (http://www.databricks.com ) “Market Basket Analysis using Spark”,
Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN
https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes
High Performance Information Computing CenterJongwook Woo
CalStateLA
Introduction to Big Data with Apache Spark, databricks Stanford Spark Class (http://stanford.edu/~rezab ) Cornell University, CS5304 DS320: DataStax Enterprise Analytics with Spark Cloudera, http://www.cloudera.com Hortonworks, http://www.hortonworks.com Spark 3 Use Cases,
http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/
References
High Performance Information Computing CenterJongwook Woo
CalStateLA
Scheduling Process
) rdd1.join(rdd2) .groupBy(…) .filter(…)
RDD Objects Optimizer
Optimizer: build operator DAG
agnostic to operators!
doesn’t know about stages
DAGScheduler
split graph into stages of taskssubmit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster managerretry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
ThreadsTask
stagefailed
High Performance Information Computing CenterJongwook Woo
CalStateLA
Block manager
Task threads
Spark Components
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…) .count()
...
Your program
Spark Driver/Client(app master) Spark worker(s)
HDFS, HBase, Amazon S3, Couchbase, Cassandra, …
RDD graph
Scheduler
Block tracker Block manager
Task threads
Shuffle tracker
Clustermanager
Block manager
Task threads
High Performance Information Computing CenterJongwook Woo
CalStateLA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with inputs co-partitioned
map, filter
“Wide” (shuffle) deps: boundary of stages
“Narrow” deps: A stage pipeline to be run on the same node
High Performance Information Computing CenterJongwook Woo
CalStateLA
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with inputs co-partitioned
map, filter
“Narrow” deps: A stage pipeline to be run on the same node
“Wide” (shuffle) deps: boundary of stages
High Performance Information Computing CenterJongwook Woo
CalStateLA
Scheduler OptimizationsPipelines within a
stage 2map, union
Stage 3: join algorithms
based on partitioning (minimize shuffles) join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
High Performance Information Computing CenterJongwook Woo
CalStateLA
Scheduler Optimizations
Conceptually
Stage 1: 3 tasks
Stage 2: 4 tasks
Stage 3: 3 tasks
Total: 3 stages, 10 tasks
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task