Big Data Trend and Open Data

Jongwook Woo

HiPIC

CalStateLA

UKC 2016

Dallas, TXAug 12 2016

Jongwook Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Big Data Trend and Open Data

mailto:[email protected]

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training


CalStateLA

Myself

Experience: Since 2002, Professor at California State Univ Los Angeles

– PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood

– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware

Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships

– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors


CalStateLA

Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city in 2016

– Collect, Search, and Analyze City Data• Hadoop, Solr, Java, Cloudera

Sept 2013: Samsung Advanced Technology Training Institute

Since 2008– Introduce Hadoop Big Data and education to Univ and

Research Centers• Yonsei, Gachon• US: USC, Pennsylvania State Univ, University of Maryland College Park,

Univ of Bridgeport, Louisiana State Univ, California State Univ LB• Europe: Univ of Luxembourg

Myself


CalStateLA

Experience in Big Data

Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data

– IMSC of USC– Pennsylvania State University

Grants IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in

Research and Education Grant

Partnership Academic Education Partnership with Databricks, Tableau, Qlik,

Cloudera, Hortonworks, SAS, Teradata


CalStateLA

Contents



CalStateLA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web– Sensor Data (IoT), Bioinformatics, Social Computing,

Streaming data, smart phone, online game…

Cannot handle with the legacy approachToo bigNon-/Semi-structured dataToo expensive

Need new systemsNon-expensive


CalStateLA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– Distributed Systems on non-expensive commodity

computersHow to compute Big Data

– MapReduce– Parallel Computing with non-expensive computers

Own super computersPublished papers in 2003, 2004


CalStateLA

What is Hadoop?

9

Hadoop Founder: o Doug Cutting

Apache Committer: Lucene, Nutch, …


CalStateLA

Definition: Big Data

Non-expensive frameworks that can store a large scale data and process it faster in parallelHadoop

–Non-expensive Super Computer–More public than the traditional super

computers• You can store and process your applications

– In your university labs, small companies, research centers


CalStateLA

Hadoop Cluster: Logical Diagram

Web Browser of Cluster nonitor: CM/Am-bari

HTTP(S)

Agent Hadoop Agent Hadoop Agent Hadoop


Cluster Monitor

......

...


HDFS HDFS HDFS

HDFS HDFS HDFS

HIVE ZooKeeper Impala


CalStateLA

Contents



CalStateLA

Alternate of Hadoop MapReduce

Limitation in MapReduceHard to program in JavaBatch Processing

– Not interactiveDisk storage for intermediate data

– Performance issue

Spark by UC Berkley AMP Lab In-memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk

– MapReduce


CalStateLA

Spark

In-Memory Data ComputingFaster than Hadoop MapReduce

Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra,

ArcGIS, Couchbase…

New Programming with faster data sharingGood in complex multi-stage applications

– Iterative graph algorithms, Machine LearningInteractive query


CalStateLA

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibML

machine

learning

DStream’s: Streams of

RDD’s

SchemaRDD’s

DataFramesRDD-Based Matrices

Spark Cores

GraphX

(graph)

RDD-Based Matrices

SparkR

RDD-Based Matrices


CalStateLA

RDD Operations

TransformationDefine new RDDs from the current

–Lazy: not computed immediatelymap(), filter(), join()

ActionsReturn valuescount(), collect(), take(), save()


CalStateLA

Programming in Spark

ScalaFunctional Programming

–Fundamental of programming is function• Input/Output is function

No side effects–No states

PythonLegacy, large Libraries

Java


CalStateLA

Spark Spark SQL

DataFrame– Turning an RDD into a Relation

Querying using SQL

Spark Streaming DStream

– RDD in streaming– Windows

• To select DStream from streaming data

Mlib, ML Sparse vector support, Decision trees, Linear/Logistic Regression,

PCA Pipeline


CalStateLA

Contents



CalStateLA

Spark

SparkFile Systems: TachyonResource Manager: Mesos

But, Hadoop has been dominating marketIntegrating Spark into Hadoop clusterCloud Computing

– Amazon AWS, Azure HDInsight, IBM Bluemix• Object Storage, S3

Hadoop vendors– HDP, CDH

Databricks: Spark on AWS– No Hadoop ecosystems


CalStateLA

Spark with Hadoop YARN

Spark Client

Slave Nodes

ResourceManager (RM) Per Cluster Create Spark AM and allocate Containers for Spark AM

NodeManager (NM) Per Node Spark workers

ApplicationMaster (AM) Per Application Containers for Spark Executors

Master Node

NodeManager

NodeManager

NodeManager

Container: Spark Executor

Spark AM

ResourceManager


CalStateLA

Big Data Analysis Flow

Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government

Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…

Data FilteringHive, Pig

Data Analysis and ScienceHive, Pig, Spark, BI Tools (Datameer, Qlik, …)

Data VisualizationQlik, Datameer, Excel PowerView


CalStateLA

Databricks cluster at CalStateLA


CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Use Cases Hadoop Spark Training


CalStateLA

Open Data

USA governmentFederal, State, City governmentsExpose data to public

USA BusinessTwitter, Yelp, …Expose data to public with APIs

– Some restriction to download

City governmentNew York

– Taxi, Uber, …Los Angeles

– Open Data, Open Hub with Geo info


CalStateLA

Open Big Data Analysis in CalStateLA

Social Media Data AnalysisTwitter Sentiment Analysis for Alphago

Open Data from GovernmentAirline Data analysisCrime Data analysis

Web Service APIBusiness Data Analysis from Yelp and Google Places API


CalStateLA

Data from Industry: Twitter Data

SystemsAzure HDInsights Spark8 Nodes

– 40 cores: 2.4GHz Intel Xeon– Memory - Each Node: 28 GB

Data SourceKeyword ‘alphago’ from Tweeter via Apache NiFi

Data Size 63,193 tweets

Real Time Data Collection period03/12 – 03/17/2016

– No data collected on 03/13


CalStateLA

Top 10 Countries that Tweets “Alphago”


CalStateLA

Top 10 Countries

# of Tweets per CountryUSA: > 11,000Japan: > 9,000Korea: > 1,900Russia, UK: > 1,600Thai Land, France : >

1,000 Netherland, Spain,

Ukraine: > 600


CalStateLA

Top 10 Countries Sentiment

Positive Negative


CalStateLA

Top 10 Countries

Most Tweeted Countries All countries show more positive tweets

–Korea, Japan, USA

Country Positive Negative

USA 5070 3567

Japan 8118 217

…

Korea 1053 407

…


CalStateLA

Daily Tweets in 03/12 – 03/17/2016

3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/20160

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Alphago vs Lee Sedol

Game 4: Mar 13 Lee Se-Dol win

Game 5: Mar 15

Game 3: Mar 12


CalStateLA

Ngram words

3 word in row right after Go-Champion “sedol” and “se-dol”

sedol

se-dol3-grams FrequencyAgain-to-win 1,187

Is-something-I’ll 369

Is-something-i 199

In-go-tournament 168


CalStateLA

Sentiment Map of Alphago

PositiveNegative


CalStateLA

Sentiment Map of Lee Se-Dol vs Alphago

YouTube video: “alphago sentiment” by Google The sentiment of the World in Geo and Time:

https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a





CalStateLA

Federal Government: Airline Data Set

Government Open DataAirline Data Set in 2012 – 2014

– US Dept of transportation

Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 7 GB– Windows Server 2012 R2 Datacenter


CalStateLA

Airline Data Set


CalStateLA

City Government: Crime Data Set

Open Data in City of Los Angeles Crime Data Set in 2012-2015 File Size – 151MB Total Number of offenses – 8.94 million

Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQLNumber of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 14 GB– Windows Server 2012 R2 Datacenter– Extending to last 10 years of data set


CalStateLA

Projection of Raw Data

ASSAULT CRIMINAL TRAFFIC VANDALISM others theft0

10000

20000

30000

40000

50000

60000

70000

80000

90000

year2012 year2013 year2014 year2015


CalStateLA

Total No. of Crimes in 2012-15

months

1 2 3 4 5 6 7 8 9 10 11 120

5000

10000

15000

20000

25000

year2012 year2013 year2014 year2015


CalStateLA

Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015


CalStateLA

No.of Crimes for every 5miles from CalStateLA

0-5 5-10 11-15 15-20 20-25 25-30 30-35 >350

10000

20000

30000

40000

50000

60000

70000

80000

90000

csula_2012 csula_2013 csula_2014 csula_2015


CalStateLA

No.of Crimes for every 5miles from UCLA

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400

20000

40000

60000

80000

100000

120000

ucla_2012 ucla_2013 ucla_2014 ucla_2015


CalStateLA

No. of Crimes for every 5miles from USC

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400

20000

40000

60000

80000

100000

120000

ucla_2012 ucla_2013 ucla_2014 ucla_2015


CalStateLA

Comparision of Crimes for every 5miles from CalStateLA,

UCLA and USC in 2015

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >500

20000

40000

60000

80000

100000

120000

csula_2015 ucla_2015 usc_2015


CalStateLA

No.of crimes per area in LA

77th

Street

Newton

Southw

est

Van N

uys

Centra

l

Foothi

ll

Hollen

beck

N Holl

ywoo

d

Wes

t Vall

ey

Olympic

Wes

t LA

02000400060008000

1000012000140001600018000

in2012 in2013 in2014 in2015


CalStateLA

Total No.of Crimes for every 2hours in LA

77th

Street

Newton

Southw

est

Van N

uys

Centra

l

Foothi

ll

Hollen

beck

N Holl

ywoo

d

Wes

t Vall

ey

Olympic

Wes

t LA

02000400060008000

1000012000140001600018000

in2012 in2013 in2014 in2015


CalStateLA

No.of crimes for every 2hrs within 5miles from CalStateLA,

UCLA and USC in 2015

00:00-02:0002:00-04:0004:00-06:0006:00-08:0008:00-10:0010:00-12:0012:00-14:0014:00-16:0016:00-18:0018:00-20:0020:00-22:0022:00-24:00

0 2000 4000 6000 8000 10000 12000

usc ucla csula


CalStateLA

BUSINESS DATA ANALYSIS

DATA SET DETAILS

• Yelp Review Data : 1.9GB

• Business Data: 500MB• Web Service API from Yelp and Google Places

Analysis Join

YELP CHALLENGE

DATA SET

GOOGLE PLACES

YELP DATA


CalStateLA

Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings)

count0

5

10

15

20

25

30

35

40

3431

2926

19 19

15 15 15

Chart Title

Hair Salons Auto Repair General DentistryInsurance Churches Skin CareChiropractors Barbers Elementary Schools

• Hair Salons and Insurance are popular qualified business categories


CalStateLA

Businesses popular in 5 miles of CalStateLA, usc , ucla


CalStateLA

Number of food business in radius 0-25 miles from CalStateLA, usc and ucla

CalStateLA have more food businesses within 5 miles compared to UCLA and USC

0- 5 5-10. 10-15. 15-20 20-250

100

200

300

400

500

600

CSULA USC UCLA


CalStateLA

Hydrogen Gas Power Plant Prediction Model

The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.


CalStateLA


The station producing hydrogen for Hydrogen Vehicle

Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to

the public. Hyundai, Toyota


CalStateLA


Workflow


CalStateLA


Model by Manvi Chandra


CalStateLA


Results and observations


CalStateLA


Results and observations Can predict Vehicle Pressure

– Pressure of hydrogen gas within the vehicle Hydrogen Storage System

– using our model in Azure Visual Studio ML– Building Spark ML

Decision forest Regression– constructing a multitude of decision trees at training

time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.


CalStateLA

Contents



CalStateLA

Spark Big Data Training and R&D

HiPICCalifornia State University Los Angeles Supported by

– Databricks and its cloud computing services– Amazon AWS, IBM Buemix, MS Azure– Hortonworks, Cloudera– Datameer


CalStateLA

Databricks Partners


CalStateLA

Training Hadoop and SparkCloudera visits to interview Jongwook Woo


CalStateLA

Training Hadoop on IBM Bluemix at California State Univ. Los Angeles


CalStateLA

Question?


CalStateLA

References

Hadoop, http://hadoop.apache.orgApache Spark op Word Count Example (

http://spark.apach.org )Databricks (http://www.databricks.com ) “Market Basket Analysis using Spark”,

Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN

https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes

http://hadoop.apache.org/

http://spark.apach.org/

http://www.databricks.com/

http://www.ejournalofscience.org/archive/vol5no4/vol5no4_5.pdf

https://github.com/hipic/spark_mba


CalStateLA

Introduction to Big Data with Apache Spark, databricks Stanford Spark Class (http://stanford.edu/~rezab ) Cornell University, CS5304 DS320: DataStax Enterprise Analytics with Spark Cloudera, http://www.cloudera.com Hortonworks, http://www.hortonworks.com Spark 3 Use Cases,

http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/

References

http://stanford.edu/~rezab

http://www.cloudera.com/

http://www.hortonworks.com/


CalStateLA

Scheduling Process

) rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects Optimizer

Optimizer: build operator DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of taskssubmit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster managerretry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed


CalStateLA

Block manager

Task threads

Spark Components

sc = new SparkContext

f = sc.textFile(“…”)

f.filter(…) .count()

...

Your program

Spark Driver/Client(app master) Spark worker(s)

HDFS, HBase, Amazon S3, Couchbase, Cassandra, …

RDD graph

Scheduler

Block tracker Block manager

Task threads

Shuffle tracker

Clustermanager

Block manager

Task threads


CalStateLA

Dependency Types

union

groupByKey

join with inputs not

co-partitioned

join with inputs co-partitioned

map, filter

“Wide” (shuffle) deps: boundary of stages

“Narrow” deps: A stage pipeline to be run on the same node


CalStateLA

Dependency Types

union

groupByKey

join with inputs not

co-partitioned

join with inputs co-partitioned

map, filter

“Narrow” deps: A stage pipeline to be run on the same node

“Wide” (shuffle) deps: boundary of stages


CalStateLA

Scheduler OptimizationsPipelines within a

stage 2map, union

Stage 3: join algorithms

based on partitioning (minimize shuffles) join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task


CalStateLA

Scheduler Optimizations

Conceptually

Stage 1: 3 tasks

Stage 2: 4 tasks

Stage 3: 3 tasks

Total: 3 stages, 10 tasks

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task

Big Data Trend and Open Data

Data & Analytics

Transcript of Big Data Trend and Open Data