Osd ctw spark
-
Upload
thegiive-chen -
Category
Software
-
view
632 -
download
1
description
Transcript of Osd ctw spark
![Page 1: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/1.jpg)
Spark Next generation cloud
computing engine
Wisely Chen
![Page 2: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/2.jpg)
Agenda• What is Spark?
• Next big thing
• How to use Spark?
• Demo
• Q&A
![Page 3: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/3.jpg)
Who am I?
• Wisely Chen ( [email protected] )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
![Page 4: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/4.jpg)
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
![Page 5: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/5.jpg)
Machine Learning
Distribute Computing
Big Data
![Page 6: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/6.jpg)
Recommendation
Forecast
![Page 7: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/7.jpg)
HADOOP
![Page 8: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/8.jpg)
Faster ML
Distribute Computing
Bigger Big Data
![Page 9: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/9.jpg)
Opinion from Cloudera• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !
• From http://0rz.tw/y3OfM
![Page 10: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/10.jpg)
What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since Hadoop
![Page 11: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/11.jpg)
Where is Spark?
![Page 12: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/12.jpg)
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
![Page 13: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/13.jpg)
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
![Page 14: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/14.jpg)
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark
![Page 15: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/15.jpg)
Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
![Page 16: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/16.jpg)
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
![Page 17: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/17.jpg)
Why Spark?
![Page 18: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/18.jpg)
天下武功,無堅不破,惟快不破
![Page 19: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/19.jpg)
3X~25X than MapReduce framework !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
![Page 20: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/20.jpg)
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
![Page 21: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/21.jpg)
Why is Spark so fast?
![Page 22: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/22.jpg)
HDFS
• 100X lower than memory
• Store data into Network+Disk
• Network speed is 100X than memory
• Implement fault tolerance
![Page 23: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/23.jpg)
MapReduce Pagerank!
• …..readInputFromHDFS…
• for (int runs = 0; runs < iter_runnumber ; runs++) {
• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);
• …………
• }
• …..writeOutputToHDFS….
![Page 24: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/24.jpg)
Workflow
Input HDFS
Iter 1 RunRank
Tmp HDFS
Iter 2 RunRank
Tmp HDFS
Iter N RunRank
Input HDFS
Iter 1 RunRank
Tmp Mem
Iter 2 RunRank
Tmp Mem
Iter N RunRank
MapReduce
Spark
![Page 25: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/25.jpg)
First iteration!take 200 sec
3rd iteration!take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!take 20 sec
![Page 26: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/26.jpg)
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
![Page 27: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/27.jpg)
Fault Tolerance
天下武功,無堅不破,惟快不破
![Page 28: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/28.jpg)
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
![Page 29: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/29.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!
Worker!!!!
Worker!!!!Task
TaskTask
![Page 30: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/30.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!Block1
RDD a
Worker!!!!!Block2
RDD a
Worker!!!!!Block3
RDD a
![Page 31: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/31.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
![Page 32: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/32.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
![Page 33: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/33.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Cache1 Cache2
Cache3
![Page 34: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/34.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Cache1 Cache2
Cache3
![Page 35: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/35.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Cache1 Cache2
Cache3
![Page 36: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/36.jpg)
1st iteration(no cache)!
take same time
with cache!take 7 sec
RDD Cache
![Page 37: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/37.jpg)
RDD Cache
• Data locality
• CacheA big shuffle!take 20min
After cache, take only 265ms
self join 5 billion record data
![Page 38: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/38.jpg)
Easy to use
• Interactive Shell
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
![Page 39: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/39.jpg)
Scala Word Count• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
![Page 40: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/40.jpg)
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
![Page 41: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/41.jpg)
Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
![Page 42: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/42.jpg)
Java vs Scala• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
![Page 43: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/43.jpg)
Python• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
![Page 44: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/44.jpg)
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
![Page 45: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/45.jpg)
FYI• Combiner : ReduceByKey(_+_)
!
• Typical WordCount :
• groupByKey().mapValues{ arr =>
• var r = 0 ; arr.foreach{i=> r+=i} ; r
• }
![Page 46: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/46.jpg)
WordCount
ReduceByKey !reduce a lot in map side
hadoop style shuffle!send a lot data to network
![Page 47: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/47.jpg)
DEMO
![Page 48: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/48.jpg)
• FB 打卡 Yahoo! 徵人訊息,獲得 Yahoo! 沐浴小鴨
• FB打卡說 ”Yahoo! APP超讚!!”
並附上超級商城或新聞APP截圖,即可憑打卡記錄,獲得小鴨護腕墊或購物袋一只
![Page 49: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/49.jpg)
Just memory?• From Matei’s paper: http://0rz.tw/VVqgP
• HBM: stores data in an in-memory HDFS instance.
• SP : Spark
• HBM’1, SP’1 : first run
• Storage: HDFS with 256 MB blocks
• Node information
• m1.xlarge EC2 nodes
• 4 cores
• 15 GB of RAM
![Page 50: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/50.jpg)
100GB data on 100 node cluster
Logistic regression Ru
nnin
g Ti
me(
S)
0
35
70
105
140
HBM'1 HBM SP'1 SP3
4662
139
KMeans
Runn
ing
Tim
e(S)
0
50
100
150
200
HBM'1 HBM SP'1 SP
33
8287
182
![Page 51: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/51.jpg)
There is more• General DAG scheduler
• Control partition shuffle
• Fast driven RPC to launch task
!
• For more info, check http://0rz.tw/jwYwI
![Page 52: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/52.jpg)
![Page 53: Osd ctw spark](https://reader033.fdocuments.in/reader033/viewer/2022050903/53fde6b38d7f72a81c8b4bb2/html5/thumbnails/53.jpg)