Post on 26-Jan-2015
description
PySpark Next generation cloud
computing engine using Python
Wisely Chen Yahoo! Taiwan Data team
Who am I?
• Wisely Chen ( thegiive@gmail.com )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
Agenda• What is Spark?
• What is PySpark?
• How to write PySpark applications?
• PySpark demo
• Q&A
HDFS
YARN
MapReduce
What is Spark?
Spark
Storage
Resource Management
Computing Engine
• The leading candidate for “successor to MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !
• From Cloudera CTO http://0rz.tw/y3OfM
What is Spark?
Spark is 3X~25X faster than MapReduce !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
Most machine learning algorithms need iterative computing
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank Tmp
Result
Rank Tmp
Result
a1.85
1.00.58
b
d
c
0.58
a1.31
1.720.39
b
d
c
0.58
HDFS is 100x slower than memory
Input (HDFS) Iter 1 Tmp
(HDFS) Iter 2 Tmp (HDFS) Iter N
Input (HDFS) Iter 1 Tmp
(Mem) Iter 2 Tmp (Mem) Iter N
MapReduce
Spark
First iteration(HDFS)!take 200 sec
3rd iteration(mem)!take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!take 7.4 sec
What is PySpark?
Spark API
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
PySpark• Process via Python
• CPython
• Python lib (NumPy, Scipy…)
• Storage and transfer data in Spark
• HDFS access/Networking/Fault-recovery
• scheduling/broadcast/checkpointing/
Spark ArchitectureMaster!(JVM)
Worker!!!!!!
Task
Client
Block1
Worker!!!!!!
Task
Block2
Worker!!!!!!
Task
Block3
PySpark ArchitectureMaster!(JVM)
Worker!(JVM)!
!!!
Python!Code
Block1
Py Proc
Worker!(JVM)!
!!!
Block2
Py Proc
Worker!(JVM)!
!!!
Block3
Py Proc
JVM
PySpark ArchitectureMaster!(JVM)
Worker!(JVM)!
!!!
Python!Code
Py4J Socket Local FS
Block1
Worker!(JVM)!
!!!
Block2
Worker!(JVM)!
!!!
Block3
PySpark ArchitectureMaster!(JVM)
Worker!(JVM)!
!!!
Python!Code
Block1
Py code
Worker!(JVM)!
!!!
Block2
Worker!(JVM)!
!!!
Block3
Python functions and closures are serialized using PiCloud’s CloudPickle module
Py code
Py code
PySpark ArchitectureMaster!(JVM)
Worker!(JVM)!
!!!
Python!Code
Block1
Py Proc
Worker!(JVM)!
!!!
Block2
Py Proc
Worker!(JVM)!
!!!
Block3
Py Proc
On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
A lot of python processes
How to write PySpark application?
Python Word Count• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via Spark API
Process via Python
Python Word Count
• counts = file.flatMap(lambda line: line.split(" ")) \
You can find the latest Spark
documentation, including the
guide
Original text List
['You', 'can', 'find', 'the', 'latest', 'Spark',
'documentation,', 'including', 'the', ‘guide’]
Python Word Count
• .map(lambda word: (word, 1))
List Tuple List
[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) ….,
……….. (‘the’,1) , (‘guide’ ,1) ]
['You', 'can', 'find', 'the', 'latest', 'Spark',
'documentation,', 'including', 'the', ‘guide’]
Python Word Count• .reduceByKey(lambda a, b: a + b)
Tuple List Reduce Tuple List
[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) ,
(‘guide’ ,1) ]
[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ………
……….. (‘guide’ ,1) ]
Can I use ML python lib on PySpark?
PySpark + scikit-learn• sgd = lm.SGDClassifier(loss=‘log')
• for ii in range(ITERATIONS):
• sgd = sc.parallelize(…) \
• .mapPartitions(lambda x:…) \
• .reduce(lambda x, y: merge(x, y))
Use scikit-learn in Single mode(master)
Cluster operation
Use scikit-learn function in cluster mode ,
deal with partial data
!Source Code is From : http://0rz.tw/o2CHT
!!
PySpark support MLlib
• MLlib is spark version machine learning lib
• Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random")
• Check it out on http://0rz.tw/M35Rz
DEMO 1 : Recommendation using ALS
(Data : MovieLens)
DEMO 2: Interactive Shell
Conclusion
Join Us• Our team’s work is highlight by world top conf
• Hadoop Summit San Jose 2013
• Hadoop Summit Amsterdam 2014
• MSTR World Las Vegas 2014
• SparkSummit San Francisco 2014
• Jenkins Conf Palo Alto 2013
Thank you