Big Data Trend with Open Platform

81
Jongwook Woo HiPIC CalSta te LA SWRC 2017 San Diego, CA Feb 25 2017 Jongwook Woo, PhD, [email protected] High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend with Open Platform

Transcript of Big Data Trend with Open Platform

Page 1: Big Data Trend with Open Platform

Jongwook Woo

HiPIC

CalStateLA

SWRC 2017

San Diego, CAFeb 25 2017

Jongwook Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Big Data Trend with Open Platform

Page 2: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 3: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Myself

Experience: Since 2002, Professor at California State Univ Los Angeles

– PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood

– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware

Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships

– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors

Page 4: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city in 2016

– Collect, Search, and Analyze City Data• Hadoop, Solr, Java, Cloudera

Sept 2013: Samsung Advanced Technology Training Institute

Since 2008– Introduce Hadoop Big Data and education to Univ and

Research Centers• Yonsei, Gachon• US: USC, Pennsylvania State Univ, University of Maryland College Park,

Univ of Bridgeport, Louisiana State Univ, California State Univ LB• Europe: Univ of Luxembourg

Myself

Page 5: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience in Big Data

Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data

– IMSC of USC– Pennsylvania State University

Grants IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in

Research and Education Grant

Partnership Academic Education Partnership with Databricks, Tableau, Qlik,

Cloudera, Hortonworks, SAS, Teradata

Page 6: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 7: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web– Sensor Data (IoT), Bioinformatics, Social Computing,

Streaming data, smart phone, online game…

Cannot handle with the legacy approachToo bigNon-/Semi-structured dataToo expensive

Need new systemsNon-expensive

Page 8: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– Distributed Systems on non-expensive commodity

computersHow to compute Big Data

– MapReduce– Parallel Computing with non-expensive computers

Own super computersPublished papers in 2003, 2004

Page 9: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

What is Hadoop?

9

Hadoop Founder: o Doug Cutting

Apache Committer: Lucene, Nutch, …

Page 10: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Super Computer vs Hadoop

Parallel vs. Distributed file systems by Michael Malak

Cluster for Compute

Cluster for Store Cluster for Compute/Store

Page 11: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Definition: Big Data

Non-expensive frameworks that can store a large scale data and process it faster in parallelHadoop

–Non-expensive Super Computer–More public than the traditional super

computers• You can store and process your applications

– In your university labs, small companies, research centers

Page 12: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hadoop Cluster: Logical Diagram

Web Browser of Clus-ter nonitor: CM/Am-

bariHTTP(S)

Agent Hadoop Agent Hadoop Agent Hadoop

Agent Hadoop Agent Hadoop Agent Hadoop

Cluster Monitor

......

...

Agent Hadoop Agent Hadoop Agent Hadoop

HDFS HDFS HDFS

HDFS HDFS HDFS

HIVE ZooKeeper Impala

Page 13: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hadoop Ecosystems

http://dawn.dbsdataprojects.com/tag/hadoop/

Page 14: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 15: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Alternate of Hadoop MapReduce

Limitation in MapReduceHard to program in JavaOnly Map and Reduce

– Limited ParallelizationBatch Processing

– Not interactiveDisk storage for intermediate data

– Performance issue

Spark by UC Berkley AMP Lab In-memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk

– MapReduce

Page 16: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark

In-Memory Data ComputingFaster than Hadoop MapReduce

Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra,

ArcGIS, Couchbase…

New Programming with faster data sharingGood in complex multi-stage applications

– Iterative graph algorithms, Machine LearningInteractive query

Page 17: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

ML / MLLibmachin

e learnin

g

DStream’s: Streams of

RDD’s

SchemaRDD’s

DataFramesRDD-Based Matrices

Spark Cores

GraphX

(graph)

RDD-Based Matrices

SparkR

RDD-Based Matrices

Page 18: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark Drivers and Workers

DriversClient

–with SparkContext• Communicate with Spark workers

WorkersSpark ExecutorRun on cluster nodes

–ProductionRun in local threads

–Development and Test

Page 19: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

RDD

Resilient Distributed Dataset (RDD)Distributed collections of objects

–that can be cached in memoryImmutable

–RDD, DStream, SchemaRDD, PairRDDLineage

–History of the objects–Automatically and efficiently re-compute lost

data

Page 20: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

RDD and Data Frame Operations

TransformationDefine new RDDs and Data Frame from the

current–Lazy: not computed immediately

map(), filter(), join(), select(), groupBy()

ActionsReturn valuescount(), collect(), take(), save()

Page 21: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Programming in Spark

ScalaFunctional Programming

– Fundamental of programming is function• Input/Output is function

No side effects– No states

PythonLegacy, large Libraries

JavaR

Page 22: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Page 23: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark Spark SQL

Querying using SQL, HiveQL Data Frame

ML Machine Learning on Data Frame, Pipelining MLib

– On RDD– Sparse vector support, Decision trees, Linear/Logistic Regression,

PCA, SVM

Spark Streaming DStream

– RDD in streaming– Windows

• To select DStream from streaming data

Page 24: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Scheduling Process

) rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects Optimizer

Optimizer: build operator DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of taskssubmit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster managerretry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Page 25: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

During Scheduling Process

https://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming-62871797

Page 26: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 27: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark

SparkFile Systems: TachyonResource Manager: Mesos

But, Hadoop has been dominating marketIntegrating Spark into Hadoop clusterCloud Computing

– Amazon AWS, Azure HDInsight, IBM Bluemix• Object Storage, S3

Hadoop vendors– HDP, CDH

Databricks: Spark on AWS– No Hadoop ecosystems

Page 28: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Block manager

Task threads

Spark Components

sc = new SparkContext

f = sc.textFile(“…”)

f.filter(…) .count()

...

Your program

Spark Driver/Client(app master) Spark worker(s)

HDFS, HBase, Amazon S3, Couchbase, Cassandra, …

RDD graph

Scheduler

Block tracker Block manager

Task threads

Shuffle tracker

Clustermanager

Block manager

Task threads

Page 29: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark with Hadoop YARN

Spark Client

Slave Nodes

ResourceManager (RM) Per Cluster Create Spark AM and allocate Containers for Spark AM

NodeManager (NM) Per Node Spark workers

ApplicationMaster (AM) Per Application Containers for Spark Executors

Master Node

NodeManager

NodeManager

NodeManager

Container: Spark Executor

Spark AM

ResourceManager

Page 30: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks cluster at CalStateLA

Page 31: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 32: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Platform

Open SourceOpen ConferenceOpen Data

Public Data

Page 33: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Source

Hadoophttp://hadoop.apache.org/

Sparkhttp://spark.apache.org/

NoSQLhttp://hbase.apache.org/

Search Enginehttp://lucene.apache.org/solr/

Page 34: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Conference

Hadoop SummitLive Streaming

–http://siliconangle.tv/hadoop-summit-2016/

Spark Summithttps://spark-summit.org/east-2017/Live Streaming

–http://go.spark-summit.org/east-2017/live-stream?_ga=1.62160364.1150099959.1484851457

Page 35: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Data

USA governmentFederal, State, City governmentsExpose data to public

USA BusinessTwitter, Yelp, …Expose data to public with APIs

– Some restriction to download

City governmentNew York

– Taxi, Uber, …Los Angeles

– Open Data, Open Hub with Geo info

Page 36: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 37: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks Partners

Page 38: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Industrial CollaborationCloudera visits to interview Jongwook Woo

Page 39: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Industrial Collaboration: IBM Bluemix at CalStateLA

Page 40: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Big Data Analysis and Prediction Flow

Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government

Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…

Data FilteringHive, Pig

Data Analysis and ScienceHive, Pig, Spark, BI Tools (Tableua, Qlik, …)

Data VisualizationQlik, Datameer, Excel PowerView

Page 41: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks cluster at CalStateLA

Page 42: Big Data Trend with Open Platform

Jongwook Woo

HiPIC

CalStateLA

LOCAL BUSINESS DATA ANALYSIS

Yashaswi  AnanthRuchi Singh

Mahsa Tayer Farahani

Page 43: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

LOCAL BUSINESS DATA ANALYSIS

Using Local Business DataFrom Yelp and Google Local

Grad Students at CalStateLA Symposium, Feb 24 2017Yashaswi  AnanthRuchi SinghMahsa Tayer Farahani

Page 44: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

REVIEW COUNT FOR BUSINESS TYPES

• Food• Services• Entertainment• Shopping• Medical

Page 45: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

TOP BUSINESS IN THE SIX CATEGORIES

Page 46: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Review count of popular sub-categories of business

Page 47: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Sentiment Analysis of Services category

Page 48: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top business

Top 5 most popular local business on Yelp between 2006-2016 in the selected cities

Page 49: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Businesses popular in 5 miles of CalStateLA, USC , UCLA

Page 50: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Historical Analysis Of College Scorecard

CalStateLA Symposium Feb 24 2017

Kunal PritwaniAtinder Singh

Dharmesh SoniMounika Vallabhaneni

Page 51: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data is collected from the site. : https://www.kaggle.com/kaggle/college-scorecard

We have historical data of over 100,000 colleges in the US spanning over 14 years.

Data Size – 1.33 GB

File Format – CSV ( Comma Separated Values)

Specification of Data Set

Page 52: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mean Income Medical college of Wisconsin: 250KUpstate Medical University: 152.7KCalTech: 103KWashington and Lee University: 100K

Page 53: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Comparing Average Net Price of Two States (Annual Tuition)

UCLA: $13,817 CalStateLA: $4,370 Fashion Inst of Tech: $11.5K CUNY: $5K

Page 54: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

SAT Scores in Different Colleges

Math (Blue), Verbal (Orange), Mean Earning (Purple)• CalTech: 800, 778.9, $98.7K• MIT: 800, 764.4, $124.4K• Harvard: 791, 795.6, $133K• Princeton: 793, 791, $115.6K• Yale: 788, 794.4, $97.8K

Page 55: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Comparing Average Undergraduates Receiving PELL

GRANT

Universal Career Community College: 100% PELL grant scholarship

Page 56: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Average Undergraduates Receiving PELL GRANT in Each

CollegeEast Georgia State College: $2,854 Avg. PELL grant: 97.285%

Page 57: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Alphago vs Lee using Twitter Data

SystemsAzure HDInsights Spark8 Nodes

– 40 cores: 2.4GHz Intel Xeon– Memory - Each Node: 28 GB

Data SourceKeyword ‘alphago’ from Tweeter via Apache NiFi

Data Size 63,193 tweets

Real Time Data Collection period03/12 – 03/17/2016

– No data collected on 03/13

Page 58: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries that Tweets “Alphago”

Page 59: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries

# of Tweets per CountryUSA: > 11,000Japan: > 9,000Korea: > 1,900Russia, UK: > 1,600Thai Land, France : >

1,000 Netherland, Spain,

Ukraine: > 600

Page 60: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries Sentiment

Positive Negative

Page 61: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries

Most Tweeted Countries All countries show more positive tweets

–Korea, Japan, USA

Country Positive Negative

USA 5070 3567

Japan 8118 217

Korea 1053 407

Page 62: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Daily Tweets in 03/12 – 03/17/2016

3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/20160

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Alphago vs Lee Sedol

Game 4: Mar 13 Lee Se-Dol win

Game 5: Mar 15

Game 3: Mar 12

Page 63: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Ngram words

3 word in row right after Go-Champion “sedol” and “se-dol”

sedol

se-dol3-grams FrequencyAgain-to-win 1,187

Is-something-I’ll 369

Is-something-i 199

In-go-tournament 168

Page 64: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Sentiment Map of Alphago

PositiveNegative

Page 65: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Sentiment Map of Lee Se-Dol vs Alphago

YouTube video: “alphago sentiment” by Google The sentiment of the World in Geo and Time:

https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a

Page 66: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Government Open DataAirline Data Set in 2012 – 2014

– US Dept of transportation

Cluster by Nillohit at HiPIC, CSULA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 7 GB– Windows Server 2012 R2 Datacenter

Page 67: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 68: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 69: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 70: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

City Government: Crime Data Set

Open Data in City of Los AngelesCrime Data Set in 2014

Ram Dharan and Sridhar Reddy at HiPIC, CSULA Microsoft Azure using Hive and Spark SQLNumber of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 14 GB– Windows Server 2012 R2 Datacenter– Extending to last 10 years of data set

Page 71: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Crime DataLos Angeles 2014

2% 8%

9%

12%

17%19%

33%

Total occurences of each Crime

CRIMINALVANDALISMOTHERSBURGALARYASSAULTTRAFFICTHEFT

Page 72: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Total No.of Crimes in 2014

1 2 3 4 5 6 7 8 9 10 11 120

5000

10000

15000

20000

25000

19169

17384

19730

19413

20645

20494

21480

21280

21287

21669

19844

21355

No.of Crimes per Month

Page 73: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Raw Data Projection on Map

Page 74: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping of Crimes Occurred within 5miles from CalStateLA

Page 75: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping of Crimes Occurred within 5miles from UCLA

Page 76: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping of Crimes Occurred within 5miles from USC

Page 77: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015

Page 78: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

No. of crimes within 5 miles from CSULA, UCLA and USC on crime type

ASSAULT

BURGALARY

CRIMIN

AL

THEFT

TRAFFIC

VANDALISM

others

0

5000

10000

15000

20000

25000

30000

csula ucla usc

Page 79: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Big Data Spark Spark and Hadoop Open Platform Use Cases Future Trend

Page 80: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Future Research Trend

Deep LearningTensorFlow and Spark

– Yahoo, Intel, Google– Image Recognition, Prediction Analysis

ChatBotAmazon Alexa APIIBM Watson ChatBot APIGoogle Home API

More into In-Memory Processing

– Spark DataFrame, Data Set, MLCloud Computing

– IBM Bluemix, MS Azure, Google Cloud, Amazon AWS

Page 81: Big Data Trend with Open Platform

High Performance Information Computing CenterJongwook Woo

CalStateLA

Question?