Ncku csie talk about Spark

102
Introduction to Spark Wisely Chen ([email protected]) Sr. Engineer at Yahoo

description

My talk at NCKU CSIE.

Transcript of Ncku csie talk about Spark

Page 1: Ncku csie talk about Spark

Introduction to SparkWisely Chen ([email protected])

Sr. Engineer at Yahoo

Page 2: Ncku csie talk about Spark

Agenda• Big data will change the world?

• What is Spark?

• Demo (Start a spark cluster / Word Count)

• Break : 10min

• Spark Concept

• Demo (ETL / MLib)

• Q&A

Page 3: Ncku csie talk about Spark

Who am I? • Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Spark Summit 2014 San Francisco

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Page 4: Ncku csie talk about Spark

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Page 5: Ncku csie talk about Spark

Big data will change the world?

Page 6: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Page 7: Ncku csie talk about Spark

More Sensor

Data

Machine Learning

Robot

Page 8: Ncku csie talk about Spark

Human Action Data

Machine LearningRobot

What is sensor?

Page 9: Ncku csie talk about Spark

Internet of thing

Page 10: Ncku csie talk about Spark

More Sensor • 2000

• Browser

• Digital Camera(Photo)

• 2000~2014

• Browser

• Mobile (GPS,More Photo,Video)

• Wearable Device(Pulse)

• Google Glass(More Video)

• Internet of thing(………)

Page 11: Ncku csie talk about Spark

More Sensor

Bigger Data

Machine Learning

Robot

Page 12: Ncku csie talk about Spark

More Sensor • 2000

• Browser

• Digital Camera(Photo)

• 2000~2014

• Browser

• Mobile (GPS,More Photo,Video)

• Wearable Device(Pulse)

• Google Glass(More Video)

• Internet of thing(………)

Page 13: Ncku csie talk about Spark

Technology Improve• Sloan Digital Sky Survey(SDSS) collected more data in its

first few weeks than had been amassed in the entire history of astronomy.

• The Large Synoptic Survey Telescope in Chile, due to come on stream in 2016, will acquire that quantity of data every five days.

Page 14: Ncku csie talk about Spark

New Area

30min Zebra fish experiment = 1TB http://research.janelia.org/zebrafish/

Page 15: Ncku csie talk about Spark

Hadoop handle big data well

• 18M of hadoop related jobs on Yahoo Grid

• Yahoo handle over 440 PB data daily

• Most of job are ETL/SQL/BI

Page 16: Ncku csie talk about Spark

EBay’s data volume 2015 : 130EB 2020 : 4000ZB

Vadim Kutsyy “Data Science Empowering Personalization” in Big data innovation Summit 2014 Boston

Page 17: Ncku csie talk about Spark

Data is not only bigger

• We have more area of data

• More Sensor

• Sensor technology improve

• New area

Page 18: Ncku csie talk about Spark

More Sensor

Bigger Data

Better Machine Learning

Robot

Page 19: Ncku csie talk about Spark

Word Grammar Check• MS researcher Michele and Eric try to improve

grammar check algorithm

• They took four algorithm and feed in 10M, 100M and 1B words

• In 10M words, sophisticated algorithm(86%) works bester than simpler algorithm(75%)

• In 1B words, simpler algorithm(95%+) improved a lot, even better than sophisticated algorithm(94%)

Page 20: Ncku csie talk about Spark

–Google AI guru Peter Norvig, "The Unreasonable Effectiveness of Data”

“Simple models and a lot of data trump more elaborate models based on less data”

In translate area

Page 21: Ncku csie talk about Spark

Different type of data• In Harvard Data Mining Class, two team do Netflix

recommendation challenge

• Team A came up with a very sophisticated algorithm using the Netflix data.

• Team B used a very simple algorithm, but they added in additional data beyond the Netflix set

• Team B got much better results, close to the best results on the Netflix leaderboard

Page 22: Ncku csie talk about Spark

Taiwan Shopping User Analysis

Man tend to view underwear. But they don’t buy it

Male Users’ Top5 View Categories 1. Computer 2. Camera 3. ………. 4. ………. 5. Woman Underwear

Page 23: Ncku csie talk about Spark

2 types of dataTraffic Data

Transaction Data

User’s views / clicks “Weak intention” Large amount

User’s checkout “Strong Intention”

Small amour

Page 24: Ncku csie talk about Spark

Small Data

Sophisticated Algo

OK Result

Big Data

Simple Algo

Better Result

Data Set 1

Smart Model which leverage

more area of data

Data Set 2

Best Result

Page 25: Ncku csie talk about Spark

More Sensor

Bigger Data

Better Machine Learning

Helpful Robot

Page 26: Ncku csie talk about Spark

Robot• FoxCon’s robot can replace 70% worker

• Google driverless car/big dog

• Amazon warehouse robot

Page 27: Ncku csie talk about Spark

More Sensor

Bigger Data

Better Machine Learning

Helpful Robot

Page 28: Ncku csie talk about Spark

It is not movie it is happening

Page 29: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot User Behavior

Recommendation Algorithm

Recommendation to user

(1/3 sales are from recommendation module)

Amazon layoff the editor team and replaced by recommendation algorithm

Page 30: Ncku csie talk about Spark

Sensor

Data

Machine Learning

RobotWeather Humidity

Sun ….

Give more water to area A

Page 31: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

DNNresearch, Behavio, Wavii, Flutter, autofuss, DeepMind,

spider.io, Adometry, QQuest Visual, Jetpac

Talaria, Stackdriver SCHAFT, Industrial Perception, Redwood Robotics,

Meka Robotics, Holomni, Bot & Dolly,

Boston Dynamics, Titan Aerospace,

Nest Lab, MyEnergy, Skybox Imaging, Dropcam,

Google buy 47 company at 13,14

IOT

Google is top 1 leader in big data

24 company on the ring

Page 32: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Page 33: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Be part of it!!!

Page 34: Ncku csie talk about Spark

The ring will change the world and

big data is the core of ring

Page 35: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Hadoop

Page 36: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Hadoop

Not so well

Page 37: Ncku csie talk about Spark

Hadoop is not good in machine learning

• Efficiency

• Difficult

• Data Engineer

• Data Scientist

• Data Analyst

• Algorithm: it is not so easy to parallelize your algorithm

Page 38: Ncku csie talk about Spark

Sensor

Data

Machine Learning

Robot

Page 39: Ncku csie talk about Spark

Hadoop is not good in machine learning

• Efficiency

• Difficult : Data scientist don’t know how to do

• Algorithm: it is not so easy to parallelize your algorithm

Page 40: Ncku csie talk about Spark

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

Page 41: Ncku csie talk about Spark

Hadoop is not good in machine learning

• Efficiency

• Difficult

• Algorithm: it is not so easy to parallelize your algorithm

Page 42: Ncku csie talk about Spark

Data Analyst

Data Engineer

Data Scientist

Page 43: Ncku csie talk about Spark

Language Support• Python : Data Scientist , Data Engineer

• Java : Data Engineer

• Scala : Data Engineer

• SQL : Data Scientist, Data Analyst , Data Engineer

• R : Data Scientist, Data Analyst

• (will be official support in 1.2)

Page 44: Ncku csie talk about Spark

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Page 45: Ncku csie talk about Spark

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Page 46: Ncku csie talk about Spark

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Page 47: Ncku csie talk about Spark

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

Page 48: Ncku csie talk about Spark

What is Spark• From UC Berkeley AMP Lab

• Apache Spark™ is a very fast and general engine for large-scale data processing

• Most activity Big data open source project since Hadoop

Page 49: Ncku csie talk about Spark

Community

Page 50: Ncku csie talk about Spark

Where is Spark?

Page 51: Ncku csie talk about Spark

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

Page 52: Ncku csie talk about Spark

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

Page 53: Ncku csie talk about Spark

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark/SparkSQL

Page 54: Ncku csie talk about Spark

More than MapReduce

HDFS

Spark Core : MapReduce

SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Page 55: Ncku csie talk about Spark

How to use it?

• 1. go to https://spark.apache.org/

• 2. Download and unzip it

• 3. ./sbin/start-all.sh or ./bin/spark-shell

Page 56: Ncku csie talk about Spark

EC2

• ./ec2/spark-ec2 -k xxx -i xxx -s 3 launch CLUSTERNAME

!

• http://spark.apache.org/docs/latest/ec2-scripts.html

Page 57: Ncku csie talk about Spark

DEMO

Page 58: Ncku csie talk about Spark

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Page 59: Ncku csie talk about Spark

BREAK

Page 60: Ncku csie talk about Spark

Spark Concept

Page 61: Ncku csie talk about Spark

Why is Spark so fast?

Page 62: Ncku csie talk about Spark

Most machine learning algorithms need iterative computing

Page 63: Ncku csie talk about Spark

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

Page 64: Ncku csie talk about Spark

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

Spark

Page 65: Ncku csie talk about Spark

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

Page 66: Ncku csie talk about Spark

Memory Size Problem

Cache storage in local disk(2sec)

Cache storage in memory(2sec)

Network transfer(30 sec)

Page 67: Ncku csie talk about Spark

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

Page 68: Ncku csie talk about Spark

100GB data on 100 node cluster

Logistic regression Ru

nnin

g Ti

me(

S)

0

35

70

105

140

HBM'1 HBM SP'1 SP3

4662

139

KMeans

Runn

ing

Tim

e(S)

0

50

100

150

200

HBM'1 HBM SP'1 SP

33

8287

182

Page 69: Ncku csie talk about Spark

Map Reduce

map

map

mapInput (HDFS) reduce

reduce

Shuffle

Output (HDFS)

Page 70: Ncku csie talk about Spark

Map Reduce

map

map

mapInput (HDFS) reduce

reduce

Shuffle

Output (HDFS)

Map Reduce

Page 71: Ncku csie talk about Spark

map,filtergroupBy on !

non-partitioned data

union

join with input!co-partitioned

join with inputs not!co-partitioned

Map(Narrow) Reduce(Wide)

Page 72: Ncku csie talk about Spark

DAG Engine

groupBy

map

union

join

Page 73: Ncku csie talk about Spark

Hadoop(4 MR)

groupBy

map

union

join

MR1

MR2

MR3MR4

Page 74: Ncku csie talk about Spark

Spark (2MR,1map)

groupBy

map

union

join

MR1

Map MR2

Page 75: Ncku csie talk about Spark

Input (HDFS)

MR1

MR2 Tmp (HDFS)

MapReduce

Spark

Tmp (HDFS) MR3

MR4

Input (HDFS)

MR1

MAP Tmp (MEM)

MR4

Tmp (MEM)

Output (HDFS)

Tmp (HDFS)

Output (HDFS)

Page 76: Ncku csie talk about Spark

CACHE

Stage 1

Stage 2

groupBy

map

union

join

Stage 2

Page 77: Ncku csie talk about Spark

RDD

• Resilient Distributed Dataset

• Interface of data, stored in RAM or on Disk

• Built through parallel transformations

Page 78: Ncku csie talk about Spark

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Page 79: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Page 80: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Page 81: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 82: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 83: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Page 84: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Page 85: Ncku csie talk about Spark

Log mining

a = sc.textfile(“hdfs://aaa.com/a.txt”)!err = a.filter(lambda t=> “ERROR” in t )! .filter(lambda t=> “2014” in t)!!err.cache()!err.count()!!m = err.filter(lambda t=>“MYSQL” in t)!! ! .count()!a = err.filter(lambda t=> “APACHE” in t )!! ! .count()

Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

Page 86: Ncku csie talk about Spark

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

Page 87: Ncku csie talk about Spark

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Page 88: Ncku csie talk about Spark

DEMO

Page 89: Ncku csie talk about Spark

Log Mining

Page 90: Ncku csie talk about Spark

Page Rank

Page 91: Ncku csie talk about Spark

a1.0

1.0

1.0

1.0

PageRank

1st Iter 2nd Iter 3rd Iter

b

d

c

Rank Tmp

Result

Rank Tmp

Result

a1.85

1.00.58

b

d

c

0.58

a1.31

1.720.39

b

d

c

0.58

Page 92: Ncku csie talk about Spark

SparkSQL

Page 93: Ncku csie talk about Spark

Recommendation

Page 94: Ncku csie talk about Spark
Page 95: Ncku csie talk about Spark

MLlib• Data

• data: [(36, 2802, 4.0), (36, 256, 4.0), …]

• rank, numIter, lambda are int

• candidates : [(0, 2),(0, 3),(0, 4)…]

• model = ALS.train(data, rank, numIter, lambda)

• model.predictAll(candidates)

Page 96: Ncku csie talk about Spark

Homework• 1. Install Spark and run word count (50%)

• Data : http://www.gutenberg.org/ebooks/5000

• Output: total word number

• 2. Write Movie Recommendation (50%)

• Trainning Data : http://arbor.ee.ntu.edu.tw/~wisely/data/lesson.tgz

• Input: 10 rating(1-5) on the 10 movie

• Example: movie 123 rating is 3 , movie 45 is 5

• Output: Top 10 recommendation movie

• Any algorithm is ok

Page 97: Ncku csie talk about Spark
Page 98: Ncku csie talk about Spark

BI (SparkSQL)

Streaming (SparkStreaming)

Machine Learning (MLlib)

Spark

Page 99: Ncku csie talk about Spark

Background Knowledge• Tweet real time data store into SQL database

• Spark MLLib use Wikipedia data to train a TF-IDF model

• SparkSQL select tweet and filter by TF-IDF model

• Generate live BI report

Page 100: Ncku csie talk about Spark

Code• val wiki = sql(“select text from wiki”)

• val model = new TFIDF()

• model.train(wiki)

• registerFunction(“similarity” , model.similarity _ )

• select tweet from tweet where similarity(tweet, “$search” > 0.01 )

Page 101: Ncku csie talk about Spark

DEMO

http://youtu.be/dJQ5lV5Tldw?t=39m30s

Page 102: Ncku csie talk about Spark

Q & A