OCF.tw's talk about "Introduction to spark"
-
Upload
giivee-the -
Category
Technology
-
view
990 -
download
0
description
Transcript of OCF.tw's talk about "Introduction to spark"
![Page 1: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/1.jpg)
Introduction to SparkWisely Chen (aka thegiive)
Sr. Engineer at Yahoo
![Page 2: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/2.jpg)
Agenda• What is Spark? ( Easy )
• Spark Concept ( Middle )
• Break : 10min
• Spark EcoSystem ( Easy )
• Spark Future ( Middle )
• Q&A
![Page 3: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/3.jpg)
Who am I? • Wisely Chen ( [email protected] )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Spark Summit 2014 San Francisco
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
![Page 4: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/4.jpg)
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
![Page 5: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/5.jpg)
![Page 6: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/6.jpg)
Recommendation
Forecast
![Page 7: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/7.jpg)
HADOOP
![Page 8: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/8.jpg)
Opinion from Cloudera• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !
• From http://0rz.tw/y3OfM
![Page 9: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/9.jpg)
What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since Hadoop
![Page 10: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/10.jpg)
Community
![Page 11: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/11.jpg)
Community
![Page 12: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/12.jpg)
Where is Spark?
![Page 13: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/13.jpg)
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
![Page 14: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/14.jpg)
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
![Page 15: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/15.jpg)
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark/SparkSQL
![Page 16: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/16.jpg)
Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
![Page 17: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/17.jpg)
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
![Page 18: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/18.jpg)
Why Spark?
![Page 19: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/19.jpg)
天下武功,無堅不破,惟快不破
![Page 20: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/20.jpg)
3X~25X than MapReduce framework !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
![Page 21: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/21.jpg)
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
![Page 22: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/22.jpg)
Language Support
• Python
• Java
• Scala
![Page 23: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/23.jpg)
Python Word Count• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via Spark API
Process via Python
![Page 24: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/24.jpg)
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
![Page 25: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/25.jpg)
Why is Spark so fast?
![Page 26: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/26.jpg)
Most machine learning algorithms need iterative computing
![Page 27: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/27.jpg)
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank Tmp
Result
Rank Tmp
Result
a1.85
1.00.58
b
d
c
0.58
a1.31
1.720.39
b
d
c
0.58
![Page 28: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/28.jpg)
HDFS is 100x slower than memory
Input (HDFS) Iter 1 Tmp
(HDFS) Iter 2 Tmp (HDFS) Iter N
Input (HDFS) Iter 1 Tmp
(Mem) Iter 2 Tmp (Mem) Iter N
MapReduce
Spark
![Page 29: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/29.jpg)
First iteration(HDFS)!take 200 sec
3rd iteration(mem)!take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!take 7.4 sec
![Page 30: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/30.jpg)
Spark Concept
![Page 31: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/31.jpg)
Shuffle
Map Reduce
![Page 32: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/32.jpg)
DAG Engine
![Page 33: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/33.jpg)
DAG Engine
![Page 34: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/34.jpg)
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
![Page 35: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/35.jpg)
Fault Tolerance
天下武功,無堅不破,惟快不破
![Page 36: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/36.jpg)
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
![Page 37: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/37.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!
Worker!!!!
Worker!!!!Task
TaskTask
![Page 38: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/38.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!Block1
RDD a
Worker!!!!!Block2
RDD a
Worker!!!!!Block3
RDD a
![Page 39: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/39.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
![Page 40: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/40.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
![Page 41: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/41.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Cache1 Cache2
Cache3
![Page 42: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/42.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Cache1 Cache2
Cache3
![Page 43: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/43.jpg)
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Cache1 Cache2
Cache3
![Page 44: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/44.jpg)
1st iteration(no cache)!
take same time
with cache!take 7 sec
RDD Cache
![Page 45: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/45.jpg)
RDD Cache
• Data locality
• CacheA big shuffle!take 20min
After cache, take only 265ms
self join 5 billion record data
![Page 46: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/46.jpg)
Scala Word Count• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
![Page 47: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/47.jpg)
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
![Page 48: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/48.jpg)
Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
![Page 49: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/49.jpg)
Java vs Scala• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
![Page 50: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/50.jpg)
Python• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
![Page 51: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/51.jpg)
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
![Page 52: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/52.jpg)
How to use it?
• 1. go to https://spark.apache.org/
• 2. Download and unzip it
• 3. ./sbin/start-all.sh or ./bin/spark-shell
![Page 53: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/53.jpg)
DEMO
![Page 54: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/54.jpg)
EcoSystem/Future
![Page 55: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/55.jpg)
![Page 56: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/56.jpg)
Hadoop EcoSystem
![Page 57: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/57.jpg)
Hadoop EcoSystem
![Page 58: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/58.jpg)
Spark ECOSystem
HDFS
Spark Core : MapReduce
SparkSQL: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
![Page 59: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/59.jpg)
Unified Platform
![Page 60: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/60.jpg)
Detail
SparkSQL
Spark
MLlib
Hive HDFS Cassandra RDBMS
Streaming BI ETL
![Page 61: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/61.jpg)
Complexity
![Page 62: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/62.jpg)
Performance
![Page 63: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/63.jpg)
Write once, Run use case
![Page 64: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/64.jpg)
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
![Page 65: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/65.jpg)
Spark bridge people together
![Page 66: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/66.jpg)
Data Analyst
Data Engineer Data Scientist
![Page 67: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/67.jpg)
Bridge people together
• Scala : Engineer
• Java : Engineer
• Python : Data Scientist , Engineer
• R : Data Scientist , Data Analyst
• SQL : Data Analyst
![Page 68: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/68.jpg)
Yahoo EC team
Data Platform!!!!!!!!!!
Filtered Data!
(HDFS)
Data Mart!
(Oracle)
ML Model!(Spark)
BI Report!(MSTR)
Traffic!Data
Transaction!Data
Shark
![Page 69: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/69.jpg)
Data Analyst
![Page 70: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/70.jpg)
Data Analyst
• =
• Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01
!
!
• http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr
350 TB data
Machine Learning
https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900
![Page 71: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/71.jpg)
Data Scientist
http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/
![Page 72: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/72.jpg)
SQL (Data Analyst)
Cloud Computing
(Data Engineer)
Machine Learning (Data Scientist)
Spark
![Page 73: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/73.jpg)
Databricks Cloud DEMO
![Page 74: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/74.jpg)
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
![Page 76: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/76.jpg)
BI (SparkSQL)
Streaming (SparkStreaming)
Machine Learning (MLlib)
Spark
![Page 77: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/77.jpg)
Background Knowledge• Tweet real time data store into SQL database
• Spark MLLib use Wikipedia data to train a TF-IDF model
• SparkSQL select tweet and filter by TF-IDF model
• Generate live BI report
![Page 78: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/78.jpg)
Code• val wiki = sql(“select text from wiki”)
• val model = new TFIDF()
• model.train(wiki)
• registerFunction(“similarity” , model.similarity _ )
• select tweet from tweet where similarity(tweet, “$search” > 0.01 )
![Page 79: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/79.jpg)
DEMO
http://youtu.be/dJQ5lV5Tldw?t=39m30s
![Page 80: OCF.tw's talk about "Introduction to spark"](https://reader033.fdocuments.in/reader033/viewer/2022052820/547e90bdb47959b1508b4b64/html5/thumbnails/80.jpg)
Q & A