Donghui Zhang [email protected] 2017-5 … · IBM ung Intel SAP osoft Apple acle...

54
Donghui Zhang [email protected] 2017-5-4 Host: NECINA DIG Co-Host: MIT CSSA

Transcript of Donghui Zhang [email protected] 2017-5 … · IBM ung Intel SAP osoft Apple acle...

Page 1: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Donghui Zhang [email protected]

2017-5-4

Host: NECINA DIG Co-Host: MIT CSSA

Page 2: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Your Background Familiar with big-data analytics?

Value = show you what’s “under the hood”.

Familiar with big-data platform?

Mostly review; Value = think about my opinions.

Just curious?

Value = general awareness.

Not interested in big data?

You are in the wrong room.

http://BigAnalyticsPlatform.com 2 (C) 2017 Donghui Zhang

Page 3: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Disclaimer The opinions expressed on this site are mine and

do not necessarily represent those of my employer.

BigAnalyticsPlatform.com is my personal blogging site. I currently work at Facebook.

3 http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 4: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 4 (C) 2017 Donghui Zhang

Page 6: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Why Big-Data Platform? Platform can be a competitive advantage.

Enable junior developers to quickly create robust applications.

Google thinks of itself as a systems engineering company.

6

Quote source: Todd Hoff. “Google Architecture”. http://highscalability.com/google-architecture

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 7: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

7

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 8: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

8

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

top 3 cloud service providers

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 9: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

9

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

Larry Ellison: “Amazon’s lead is

over”

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 10: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

10

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

Apple “Pie”

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 11: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

11

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

Samsung bought Joyant

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 12: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

12

Data source: Yahoo Finance on 1/3/2017.

159 208 174

106

504

616

156

357

547

234 222

338

0100200300400500600700

IBM

Sam

sun

g

Inte

l

SA

P

Mic

roso

ft

Ap

ple

Ora

cle

Am

azo

n

Go

og

le

Ten

cen

t

Ali

bab

a

Fac

ebo

ok

1911 193819681972 1975 1976 197719941998199819992004

Ma

rke

t ca

p (

bil

lio

n$)

Company + year founded

All biggies have big data platforms

Alibaba 2015: 377 sec (3,377 nodes Apsara) Tencent 2016: 134 sec (512 nodes OpenPower) Gray sort. See http://sortbenchmark.org

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 13: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 13 (C) 2017 Donghui Zhang

Page 14: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

What is Big Data? Big data sets

e.g. “This year our users uploaded 10X more videos; we have big data now.”

big volume, big variety, or big velocity

exceed existing data processing capabilities

Big data analytics

e.g. “We use big data to predict stock trends.”

Big data stack

software

platform

infrastructure

14 http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 15: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

The Big Data Stack

15

Analytics

Infrastructure Think IaaS such as AWS EC2. Networked VMs.

Platform Think PaaS such as Google App Engine. A platform for developing software.

Analytics Software Think SaaS such as Microsoft Office 365. Software that Data Scientists can use.

Reports, docs, ad hoc scripts...

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 16: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Google Stack

16

Infrastructure

Platform

Products

Custom-built machines; RedHat Linux

GFS/Colossus, BigTable, Spanner, MapReduce/Cloud Dataflow, Chubby,

Borg/Omega

search, advertising, gmail, docs, maps, youtube, cloud platform, …

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 17: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Sample Open-Source Stack

17

Infrastructure

Platform

Analytics Software

Analytics

VMs

Spark on YARN with Hive

Tableau, scikit-learn

Python scripts

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 19: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

5 V’s of Big Data Volume

Variety

Velocity

Value

Veracity

19

“Your small data can be my big data!”

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 20: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

5 V’s of Big Data Volume

Variety

Velocity

Value

Veracity

20

Lessons

• A key feature missing in RDBMS is variety.

RDBMS guru: “Put you data in a database!” Scientist: “My data is not relational.” RDBMS guru: “Make your data relational!” Scientist: “But it is not relational!”

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 21: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

5 V’s of Big Data Volume

Variety

Velocity

Value

Veracity

21

Streaming. ETL ELT: Load first, transform later.

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 22: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

5 V’s of Big Data Volume

Variety

Velocity

Value

Veracity

22

Lessons

• Do big data for increasing business value, not for tech.

• Read a book on building a startup.

http://BigAnalyticsPlatform.com

Source: Frank McSherry. “Scalability! But at what COST?” http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html

If you are going to use a big data system for yourself,

see if it is faster than your laptop.

Frank McSherry

(C) 2017 Donghui Zhang

Page 23: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

5 V’s of Big Data Volume

Variety

Velocity

Value

Veracity

23

Source: Philip Russom. “Best Practices for Data Lake Management”. https://tdwi.org/research/2016/10/checklist-data-lake-management.aspx

Lessons

• Use Data Lakes, not Data Swamps.

• Read Russom’s “Best Practices for Data lake Management”.

Data scientist: “My analysis suggested this billion-dollar action.” Manager: “Where was the data from?”

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 24: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 24 (C) 2017 Donghui Zhang

Page 25: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Big Data History

25

What goes around comes around.

Mike Stonebraker

Everything has prior art.

David DeWitt

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 26: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Big Data History 1969: relational model (Edgar F. Codd*)

1976: System R by IBM (Jim Gray*; transactions)

1986: Postgres (Mike Stonebraker*; ADT)

1990: Gamma (David DeWitt; shared nothing)

2004: MapReduce (Jeff Dean; flexibility)

2005: “One size doesn’t fit all” (Mike Stonebraker)

2006: Hadoop (Doug Cutting)

2011: Spark (Matei Zaharia)

2017: Death of shared nothing (David DeWitt)

26

* Turing Award Winners (1981, 1998, 2014). http://amturing.acm.org/byyear.cfm

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 27: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Big Data History

27

Lessons

• Don’t reinvent the wheels. • Read the editors’ intro for “the red book”. • Read "Architecture of a Database System". • Study favorite posts on HighScalability.

The red book: Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed. http://www.redbook.io HighScalability: http://highscalability.com/all-time-favorites

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 28: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 28 (C) 2017 Donghui Zhang

Page 29: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

How to Scale to Many Servers?

29

When your data is small

http://BigAnalyticsPlatform.com

clients

server

(C) 2017 Donghui Zhang

Page 30: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

How to Scale to Many Servers?

30

Use a load balancer

http://BigAnalyticsPlatform.com

clients

LB

servers

(C) 2017 Donghui Zhang

Page 31: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

How to Scale to Many Servers? Round-Robin DNS, Point of Presence, multi-level LB.

http://BigAnalyticsPlatform.com 31

LB clients

servers

POP

POP

POP

POP

POP

(C) 2017 Donghui Zhang

Page 33: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

33

Google Belgium Data Center

Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 34: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

34

Image source: Malte Schwarzkopf. "What does it take to make Google work at scale". https://docs.google.com/presentation/d/1OvJStE8aohGeI3y5BcYX8bBHwoHYCPu99A3KTTZElr0

Google Belgium Data Center

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 35: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Google Data Centers About 40 data centers

About 2 million machines

Machines are organized in containers each having 1,160 machines

30 racks of 40 machines

Sometimes double stacked

35

Data sources: James Pearn, “How many servers does Google have?” https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY “Learn How Google Works: in Gory Detail”. http://www.ppcblog.com/how-google-works

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 37: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

How to Evaluate a Distributed System Well-known goals

Useful (solve your business need)

Performant (high throughput, low latency)

Elastic (you may add/remove nodes)

Scalable (adding nodes improves performance)

Fault tolerant (deal with failures)

In addition, I’d advocate

Flexible (scaling, model, interface, architecture)

http://BigAnalyticsPlatform.com 37 (C) 2017 Donghui Zhang

Page 38: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Shared Nothing Shared Storage

http://BigAnalyticsPlatform.com 38

Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program

For 30 years, DW were shared nothing.

Now they are all shared storage.

Gamma Teradata Netezza Vertica

DB2/PE SQL Server PDW

Greenplum Asterdata

SciDB

Redshift Spectrum Snowflake Microsoft SQL DW Google BigQuery

(C) 2017 Donghui Zhang

Page 39: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Why Shared Storage? Flexible Scaling!

http://BigAnalyticsPlatform.com 39

Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program

in minutes

(C) 2017 Donghui Zhang

Page 40: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Case Study: Snowflake (flexible scaling)

S3 DATA

STORAGE

COMPUTE

LAYER

VIRTUAL

WAREHOUSE

N

1

N

2

N

3

N

4 CLUSTER OF EC2 INSTANCES

DATA CACHE

VIRTUAL

WAREHOUSE

N

1

N

2

VIRTUAL

WAREHOUSE

N

1

N

2

N

3

N

4

N

5

N

6

N

7

N

8

CLOUD

SERVICES

AUTHENTICATION & ACCESS CONTROL

QUERY

OPTIMIZER

TRANSACTION

MANAGER

INFRASTRUCTURE

MANAGER SECURITY

METADATA

STORAGE

Database tables stored here

These disks are strictly used as

caches

40

Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 41: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Case Study: Spark

http://BigAnalyticsPlatform.com 41

SparkSQL ML Streaming GraphX

Spark Core

RDD API DataFrame API

Standalone YARN MESOS Local

Java/Scala/Python/R shell/script

(C) 2017 Donghui Zhang

Page 42: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Case Study: Spark (Flexible Model)

http://BigAnalyticsPlatform.com 42

SparkSQL ML Streaming GraphX

Spark Core

RDD API DataFrame API

Standalone YARN MESOS Local

Java/Scala/Python/R shell/script

Not only SQL, but also ML, streaming, graph.

(C) 2017 Donghui Zhang

Page 43: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Case Study: Spark (Flexible Interface)

http://BigAnalyticsPlatform.com 43

SparkSQL ML Streaming GraphX

Spark Core

RDD API DataFrame API

Standalone YARN MESOS Local

Java/Scala/Python/R shell/script

You could access Spark using traditional JDBC.

Also, interactive session (in multiple languages).

Also, submit a script as a task.

(C) 2017 Donghui Zhang

Page 44: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Case Study: Spark (Flexible Architecture)

http://BigAnalyticsPlatform.com 44

SparkSQL ML Streaming GraphX

Spark Core

RDD API DataFrame API

Standalone YARN MESOS Local

Java/Scala/Python/R shell/script

May deploy on top of existing YARN or MESOS.

Could also be standalone.

Possible to add components.

(C) 2017 Donghui Zhang

Page 45: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

How to Evaluate a Distributed System

http://BigAnalyticsPlatform.com 45

Lessons

• Flexibility is an important metric. • Spark is a flexible system. • Cloud DW: shared storage.

(C) 2017 Donghui Zhang

In addition to well-known goals

Useful, Performant, Elastic, Scalable, Fault tolerant

I’d advocate

Flexible (scaling, model, interface, architecture)

Page 46: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 46 (C) 2017 Donghui Zhang

Page 47: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Growing Need for Big Data Jobs

47

Source: https://www.indeed.com/jobtrends

10X in 5 years

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 48: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Big Data Roles Chief Data Officer

Data Scientist

Data Engineer

Solutions Architect

Big Data Strategist

...... at least 15 more

48

Source: “Top 20 Big Data jobs and their responsibilities”. http://bigdata-madesimple.com/top-20-big-data-jobs-and-their-responsibilities

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 49: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

If You Want to Do Analytics Python

Numpy, Jupyter Notebook

Machine Learning

Scikit-learn

Practice at http://DrivenData.org

http://BigAnalyticsPlatform.com 49 (C) 2017 Donghui Zhang

Page 50: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

If You Want to Do Big Data Platform Only for senior engineers

Practice at http://LeetCode.com

Embrace open source

Assemble a solution; don’t build from scratch

Consulting business: target medium-sized companies

http://BigAnalyticsPlatform.com 50 (C) 2017 Donghui Zhang

Page 51: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

If You Want to Build A Startup Read some books about building a startup

Don’t assume you know users’ pain point

Throw away prototype code

Three key people must have good working relationship: What-To-Do, How-To-Do, and When-To-Do

When in doubt, keep it simple

Strive for a clean API (external and internal)

Do one thing really well first

http://BigAnalyticsPlatform.com 51 (C) 2017 Donghui Zhang

Page 52: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Stonebraker’s Startup Loop while (true)

{

1. Talk with users to find their pain;

2. Brainstorm with professors;

3. Recruit students to build a prototype;

4. Draw a quadrant; E.g.

5. Co-found a VC-backed startup;

6. Play banjo; write papers; give talks; receive awards;

}

E.g. Streambase, Vertica, VoltDB, Paradigm4, Tamr, …

E.g. Received ACM Turing Award 2014

52

Small Big

Simple

Complex

http://BigAnalyticsPlatform.com (C) 2017 Donghui Zhang

Page 53: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Content Why

What

History

Technical How-Tos

Career Advice

Conclusions

http://BigAnalyticsPlatform.com 53 (C) 2017 Donghui Zhang

Page 54: Donghui Zhang dzhang@BigAnalyticsPlatform.com 2017-5 … · IBM ung Intel SAP osoft Apple acle Amazon le Tencent a Facebook ... The red book: Bailis, Hellerstein, ... Netezza Vertica

Conclusions All “biggies” have big-data platform

Shared nothing shared storage

Leverage on open source: pick/compose/expand

Flexibility is a key metric for distributed systems

http://BigAnalyticsPlatform.com 54 (C) 2017 Donghui Zhang