Big Data Trend and Open Data

74
Jongwook Woo HiPIC CalSt ateLA UKC 2016 Dallas, TX Aug 12 2016 Jongwook Woo, PhD, [email protected] High-Performance Information Computing Center (HiPIC) California State University Los Angeles Big Data Trend and Open Data

Transcript of Big Data Trend and Open Data

Page 1: Big Data Trend and Open Data

Jongwook Woo

HiPIC

CalStateLA

UKC 2016

Dallas, TXAug 12 2016

Jongwook Woo, PhD, [email protected]

High-Performance Information Computing Center (HiPIC)California State University Los Angeles

Big Data Trend and Open Data

Page 2: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training

Page 3: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Myself

Experience: Since 2002, Professor at California State Univ Los Angeles

– PhD in 2001: Computer Science and Engineering at USC Since 1998: R&D consulting in Hollywood

– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등– Information Search and Integration with FAST, Lucene/Solr, Sphinx – implements eBusiness applications using J2EE and middleware

Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships

– For Big Data research and training• Amazon AWS, MicroSoft Azure, IBM Bluemix• Databricks, Hadoop vendors

Page 4: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience (Cont’d): Bring in Big Data R&D and training to Korea since 2009Collaborating with LA city in 2016

– Collect, Search, and Analyze City Data• Hadoop, Solr, Java, Cloudera

Sept 2013: Samsung Advanced Technology Training Institute

Since 2008– Introduce Hadoop Big Data and education to Univ and

Research Centers• Yonsei, Gachon• US: USC, Pennsylvania State Univ, University of Maryland College Park,

Univ of Bridgeport, Louisiana State Univ, California State Univ LB• Europe: Univ of Luxembourg

Myself

Page 5: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Experience in Big Data

Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data

– IMSC of USC– Pennsylvania State University

Grants IBM Bluemix , MicroSoft Windows Azure, Amazon AWS in

Research and Education Grant

Partnership Academic Education Partnership with Databricks, Tableau, Qlik,

Cloudera, Hortonworks, SAS, Teradata

Page 6: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training

Page 7: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data Issues

Large-Scale dataTera-Byte (1012), Peta-byte (1015)

– Because of web– Sensor Data (IoT), Bioinformatics, Social Computing,

Streaming data, smart phone, online game…

Cannot handle with the legacy approachToo bigNon-/Semi-structured dataToo expensive

Need new systemsNon-expensive

Page 8: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Two Cores in Big Data

How to store Big DataHow to compute Big DataGoogle

How to store Big Data– GFS– Distributed Systems on non-expensive commodity

computersHow to compute Big Data

– MapReduce– Parallel Computing with non-expensive computers

Own super computersPublished papers in 2003, 2004

Page 9: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

What is Hadoop?

9

Hadoop Founder: o Doug Cutting

Apache Committer: Lucene, Nutch, …

Page 10: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Definition: Big Data

Non-expensive frameworks that can store a large scale data and process it faster in parallelHadoop

–Non-expensive Super Computer–More public than the traditional super

computers• You can store and process your applications

– In your university labs, small companies, research centers

Page 11: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hadoop Cluster: Logical Diagram

Web Browser of Cluster nonitor: CM/Am-bari

HTTP(S)

Agent Hadoop Agent Hadoop Agent Hadoop

Agent Hadoop Agent Hadoop Agent Hadoop

Cluster Monitor

......

...

Agent Hadoop Agent Hadoop Agent Hadoop

HDFS HDFS HDFS

HDFS HDFS HDFS

HIVE ZooKeeper Impala

Page 12: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training

Page 13: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Alternate of Hadoop MapReduce

Limitation in MapReduceHard to program in JavaBatch Processing

– Not interactiveDisk storage for intermediate data

– Performance issue

Spark by UC Berkley AMP Lab In-memory storage for intermediate data 20 ~ 100 times faster than N/W and Disk

– MapReduce

Page 14: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark

In-Memory Data ComputingFaster than Hadoop MapReduce

Can integrate with Hadoop and its ecosystemsHDFS Amzon S3, HBase, Hive, Sequence files, Cassandra,

ArcGIS, Couchbase…

New Programming with faster data sharingGood in complex multi-stage applications

– Iterative graph algorithms, Machine LearningInteractive query

Page 15: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibML

machine

learning

DStream’s: Streams of

RDD’s

SchemaRDD’s

DataFramesRDD-Based Matrices

Spark Cores

GraphX

(graph)

RDD-Based Matrices

SparkR

RDD-Based Matrices

Page 16: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

RDD Operations

TransformationDefine new RDDs from the current

–Lazy: not computed immediatelymap(), filter(), join()

ActionsReturn valuescount(), collect(), take(), save()

Page 17: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Programming in Spark

ScalaFunctional Programming

–Fundamental of programming is function• Input/Output is function

No side effects–No states

PythonLegacy, large Libraries

Java

Page 18: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark Spark SQL

DataFrame– Turning an RDD into a Relation

Querying using SQL

Spark Streaming DStream

– RDD in streaming– Windows

• To select DStream from streaming data

Mlib, ML Sparse vector support, Decision trees, Linear/Logistic Regression,

PCA Pipeline

Page 19: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training

Page 20: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark

SparkFile Systems: TachyonResource Manager: Mesos

But, Hadoop has been dominating marketIntegrating Spark into Hadoop clusterCloud Computing

– Amazon AWS, Azure HDInsight, IBM Bluemix• Object Storage, S3

Hadoop vendors– HDP, CDH

Databricks: Spark on AWS– No Hadoop ecosystems

Page 21: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark with Hadoop YARN

Spark Client

Slave Nodes

ResourceManager (RM) Per Cluster Create Spark AM and allocate Containers for Spark AM

NodeManager (NM) Per Node Spark workers

ApplicationMaster (AM) Per Application Containers for Spark Executors

Master Node

NodeManager

NodeManager

NodeManager

Container: Spark Executor

Spark AM

ResourceManager

Page 22: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Big Data Analysis Flow

Data CollectionBatch API: Yelp, GoogleStreaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government

Data StorageHDFS, S3, Object Storage, NoSQL DB (Couchbase)…

Data FilteringHive, Pig

Data Analysis and ScienceHive, Pig, Spark, BI Tools (Datameer, Qlik, …)

Data VisualizationQlik, Datameer, Excel PowerView

Page 23: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks cluster at CalStateLA

Page 24: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Use Cases Hadoop Spark Training

Page 25: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Data

USA governmentFederal, State, City governmentsExpose data to public

USA BusinessTwitter, Yelp, …Expose data to public with APIs

– Some restriction to download

City governmentNew York

– Taxi, Uber, …Los Angeles

– Open Data, Open Hub with Geo info

Page 26: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Open Big Data Analysis in CalStateLA

Social Media Data AnalysisTwitter Sentiment Analysis for Alphago

Open Data from GovernmentAirline Data analysisCrime Data analysis

Web Service APIBusiness Data Analysis from Yelp and Google Places API

Page 27: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Data from Industry: Twitter Data

SystemsAzure HDInsights Spark8 Nodes

– 40 cores: 2.4GHz Intel Xeon– Memory - Each Node: 28 GB

Data SourceKeyword ‘alphago’ from Tweeter via Apache NiFi

Data Size 63,193 tweets

Real Time Data Collection period03/12 – 03/17/2016

– No data collected on 03/13

Page 28: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries that Tweets “Alphago”

Page 29: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries

# of Tweets per CountryUSA: > 11,000Japan: > 9,000Korea: > 1,900Russia, UK: > 1,600Thai Land, France : >

1,000 Netherland, Spain,

Ukraine: > 600

Page 30: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries Sentiment

Positive Negative

Page 31: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 Countries

Most Tweeted Countries All countries show more positive tweets

–Korea, Japan, USA

Country Positive Negative

USA 5070 3567

Japan 8118 217

Korea 1053 407

Page 32: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Daily Tweets in 03/12 – 03/17/2016

3/12/2016 3/13/2016 3/14/2016 3/15/2016 3/16/2016 3/17/20160

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Alphago vs Lee Sedol

Game 4: Mar 13 Lee Se-Dol win

Game 5: Mar 15

Game 3: Mar 12

Page 33: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Ngram words

3 word in row right after Go-Champion “sedol” and “se-dol”

sedol

se-dol3-grams FrequencyAgain-to-win 1,187

Is-something-I’ll 369

Is-something-i 199

In-go-tournament 168

Page 34: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Sentiment Map of Alphago

PositiveNegative

Page 35: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Sentiment Map of Lee Se-Dol vs Alphago

YouTube video: “alphago sentiment” by Google The sentiment of the World in Geo and Time:

https://youtu.be/vAzdnj4fkOg?list=PLaEg1tCLuW0BYLqVS5RTbToiB8wQ2w14a

Page 36: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Federal Government: Airline Data Set

Government Open DataAirline Data Set in 2012 – 2014

– US Dept of transportation

Cluster by Nillohit at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQL Number of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 7 GB– Windows Server 2012 R2 Datacenter

Page 37: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 38: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 39: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Airline Data Set

Page 40: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

City Government: Crime Data Set

Open Data in City of Los Angeles Crime Data Set in 2012-2015 File Size – 151MB Total Number of offenses – 8.94 million

Ram Dharan and Sridhar Reddy at HiPIC, CalStateLA Microsoft Azure using Hive and Spark SQLNumber of Data Nodes: 4

– CPU: 4 Cores; MEMORY: 14 GB– Windows Server 2012 R2 Datacenter– Extending to last 10 years of data set

Page 41: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Projection of Raw Data

ASSAULT CRIMINAL TRAFFIC VANDALISM others theft0

10000

20000

30000

40000

50000

60000

70000

80000

90000

year2012 year2013 year2014 year2015

Page 42: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Total No. of Crimes in 2012-15

months

1 2 3 4 5 6 7 8 9 10 11 120

5000

10000

15000

20000

25000

year2012 year2013 year2014 year2015

Page 43: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Mapping of Crimes Occurred within 5miles from CalStateLA, UCLA and USC in 2015

Page 44: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

No.of Crimes for every 5miles from CalStateLA

0-5 5-10 11-15 15-20 20-25 25-30 30-35 >350

10000

20000

30000

40000

50000

60000

70000

80000

90000

csula_2012 csula_2013 csula_2014 csula_2015

Page 45: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

No.of Crimes for every 5miles from UCLA

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400

20000

40000

60000

80000

100000

120000

ucla_2012 ucla_2013 ucla_2014 ucla_2015

Page 46: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

No. of Crimes for every 5miles from USC

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 >400

20000

40000

60000

80000

100000

120000

ucla_2012 ucla_2013 ucla_2014 ucla_2015

Page 47: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Comparision of Crimes for every 5miles from CalStateLA,

UCLA and USC in 2015

0-5 5-10 11-15 15-20 20-25 25-30 30-35 35-40 40-50 >500

20000

40000

60000

80000

100000

120000

csula_2015 ucla_2015 usc_2015

Page 48: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

No.of crimes per area in LA

77th

Street

Newton

Southw

est

Van N

uys

Centra

l

Foothi

ll

Hollen

beck

N Holl

ywoo

d

Wes

t Vall

ey

Olympic

Wes

t LA

02000400060008000

1000012000140001600018000

in2012 in2013 in2014 in2015

Page 49: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Total No.of Crimes for every 2hours in LA

77th

Street

Newton

Southw

est

Van N

uys

Centra

l

Foothi

ll

Hollen

beck

N Holl

ywoo

d

Wes

t Vall

ey

Olympic

Wes

t LA

02000400060008000

1000012000140001600018000

in2012 in2013 in2014 in2015

Page 50: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

No.of crimes for every 2hrs within 5miles from CalStateLA,

UCLA and USC in 2015

00:00-02:0002:00-04:0004:00-06:0006:00-08:0008:00-10:0010:00-12:0012:00-14:0014:00-16:0016:00-18:0018:00-20:0020:00-22:0022:00-24:00

0 2000 4000 6000 8000 10000 12000

usc ucla csula

Page 51: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

BUSINESS DATA ANALYSIS

DATA SET DETAILS

• Yelp Review Data : 1.9GB

• Business Data: 500MB• Web Service API from Yelp and Google Places

Analysis Join

YELP CHALLENGE

DATA SET

GOOGLE PLACES

YELP DATA

Page 52: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Top 10 businesses within 5 miles from CalStateLA (with 5 or 4 star ratings)

count0

5

10

15

20

25

30

35

40

3431

2926

19 19

15 15 15

Chart Title

Hair Salons Auto Repair General DentistryInsurance Churches Skin CareChiropractors Barbers Elementary Schools

• Hair Salons and Insurance are popular qualified business categories

Page 53: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Businesses popular in 5 miles of CalStateLA, usc , ucla

Page 54: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Number of food business in radius 0-25 miles from CalStateLA, usc and ucla

CalStateLA have more food businesses within 5 miles compared to UCLA and USC

0- 5 5-10. 10-15. 15-20 20-250

100

200

300

400

500

600

CSULA USC UCLA

Page 55: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

The Cal State L.A. Hydrogen Research and Fueling Facility (H2 Station) opened on May 7, 2014.

Page 56: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

The station producing hydrogen for Hydrogen Vehicle

Cal State L.A. Hydrogen Research and Fueling Facility the first station in the nation to sell hydrogen fuel to

the public. Hyundai, Toyota

Page 57: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

Workflow

Page 58: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

Model by Manvi Chandra

Page 59: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

Results and observations

Page 60: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Hydrogen Gas Power Plant Prediction Model

Results and observations Can predict Vehicle Pressure

– Pressure of hydrogen gas within the vehicle Hydrogen Storage System

– using our model in Azure Visual Studio ML– Building Spark ML

Decision forest Regression– constructing a multitude of decision trees at training

time • the mode of the classes (classification) • mean prediction (regression) of the individual trees.

Page 61: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Contents

Myself Introduction To Big Data Introduction To Spark Spark and Hadoop Open Data and Use Cases Hadoop Spark Training

Page 62: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Spark Big Data Training and R&D

HiPICCalifornia State University Los Angeles Supported by

– Databricks and its cloud computing services– Amazon AWS, IBM Buemix, MS Azure– Hortonworks, Cloudera– Datameer

Page 63: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Databricks Partners

Page 64: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Training Hadoop and SparkCloudera visits to interview Jongwook Woo

Page 65: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Training Hadoop on IBM Bluemix at California State Univ. Los Angeles

Page 66: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Question?

Page 67: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

References

Hadoop, http://hadoop.apache.orgApache Spark op Word Count Example (

http://spark.apach.org )Databricks (http://www.databricks.com ) “Market Basket Analysis using Spark”,

Jongwook Woo, in Journal of Science and Technology, April 2015, Volume 5, No 4, pp207-209, ISSN 2225-7217, ARPN

https://github.com/hipic/spark_mba, HiPIC of California State University Los Angenes

Page 68: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Introduction to Big Data with Apache Spark, databricks Stanford Spark Class (http://stanford.edu/~rezab ) Cornell University, CS5304 DS320: DataStax Enterprise Analytics with Spark Cloudera, http://www.cloudera.com Hortonworks, http://www.hortonworks.com Spark 3 Use Cases,

http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/

References

Page 69: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Scheduling Process

) rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects Optimizer

Optimizer: build operator DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of taskssubmit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster managerretry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Page 70: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Block manager

Task threads

Spark Components

sc = new SparkContext

f = sc.textFile(“…”)

f.filter(…) .count()

...

Your program

Spark Driver/Client(app master) Spark worker(s)

HDFS, HBase, Amazon S3, Couchbase, Cassandra, …

RDD graph

Scheduler

Block tracker Block manager

Task threads

Shuffle tracker

Clustermanager

Block manager

Task threads

Page 71: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Dependency Types

union

groupByKey

join with inputs not

co-partitioned

join with inputs co-partitioned

map, filter

“Wide” (shuffle) deps: boundary of stages

“Narrow” deps: A stage pipeline to be run on the same node

Page 72: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Dependency Types

union

groupByKey

join with inputs not

co-partitioned

join with inputs co-partitioned

map, filter

“Narrow” deps: A stage pipeline to be run on the same node

“Wide” (shuffle) deps: boundary of stages

Page 73: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Scheduler OptimizationsPipelines within a

stage 2map, union

Stage 3: join algorithms

based on partitioning (minimize shuffles) join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task

Page 74: Big Data Trend and Open Data

High Performance Information Computing CenterJongwook Woo

CalStateLA

Scheduler Optimizations

Conceptually

Stage 1: 3 tasks

Stage 2: 4 tasks

Stage 3: 3 tasks

Total: 3 stages, 10 tasks

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task