PySaprk

PySpark Next generation cloud

computing engine using Python

Wisely Chen Yahoo! Taiwan Data team

Who am I?

• Wisely Chen ( thegiive@gmail.com )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Agenda• What is Spark?

• What is PySpark?

• How to write PySpark applications?

• PySpark demo

• Q&A

MapReduce

What is Spark?

Storage

Resource Management

Computing Engine

• The leading candidate for “successor to MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From Cloudera CTO http://0rz.tw/y3OfM

What is Spark?

Spark is 3X~25X faster than MapReduce !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

MR Spark3

KMeans

MR Spark

PageRank

MR Spark

Most machine learning algorithms need iterative computing

PageRank

1st Iter 2nd Iter 3rd Iter

Rank Tmp

Result

Rank Tmp

Result

1.00.58

1.720.39

HDFS is 100x slower than memory

Input (HDFS) Iter 1 Tmp

(HDFS) Iter 2 Tmp (HDFS) Iter N

Input (HDFS) Iter 1 Tmp

(Mem) Iter 2 Tmp (Mem) Iter N

MapReduce

First iteration(HDFS)!take 200 sec

3rd iteration(mem)!take 7.7 sec

Page Rank algorithm in 1 billion record url

2nd iteration(mem)!take 7.4 sec

What is PySpark?

Spark API

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

PySpark• Process via Python

• CPython

• Python lib (NumPy, Scipy…)

• Storage and transfer data in Spark

• HDFS access/Networking/Fault-recovery

• scheduling/broadcast/checkpointing/

Spark ArchitectureMaster!(JVM)

Worker!!!!!!

Client

Block1

Worker!!!!!!

Block2

Worker!!!!!!

Block3

PySpark ArchitectureMaster!(JVM)

Worker!(JVM)!

Python!Code

Block1

Py Proc

Worker!(JVM)!

Block2

Py Proc

Worker!(JVM)!

Block3

Py Proc

Worker!(JVM)!

Python!Code

Py4J Socket Local FS

Block1

Worker!(JVM)!

Block2

Worker!(JVM)!

Block3

Worker!(JVM)!

Python!Code

Block1

Py code

Worker!(JVM)!

Block2

Worker!(JVM)!

Block3

Python functions and closures are serialized using PiCloud’s CloudPickle module

Py code

Worker!(JVM)!

Python!Code

Block1

Py Proc

Worker!(JVM)!

Block2

Py Proc

Worker!(JVM)!

Block3

Py Proc

On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.

A lot of python processes

How to write PySpark application?

Python Word Count• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Access data via Spark API

Process via Python

Python Word Count

• counts = file.flatMap(lambda line: line.split(" ")) \

You can find the latest Spark

documentation, including the

Original text List

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Python Word Count

• .map(lambda word: (word, 1))

List Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) ….,

……….. (‘the’,1) , (‘guide’ ,1) ]

['You', 'can', 'find', 'the', 'latest', 'Spark',

'documentation,', 'including', 'the', ‘guide’]

Python Word Count• .reduceByKey(lambda a, b: a + b)

Tuple List Reduce Tuple List

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) ,

(‘guide’ ,1) ]

[ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ………

……….. (‘guide’ ,1) ]

Can I use ML python lib on PySpark?

PySpark + scikit-learn• sgd = lm.SGDClassifier(loss=‘log')

• for ii in range(ITERATIONS):

• sgd = sc.parallelize(…) \

• .mapPartitions(lambda x:…) \

• .reduce(lambda x, y: merge(x, y))

Use scikit-learn in Single mode(master)

Cluster operation

Use scikit-learn function in cluster mode ,

deal with partial data

!Source Code is From : http://0rz.tw/o2CHT

PySpark support MLlib

• MLlib is spark version machine learning lib

• Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random")

• Check it out on http://0rz.tw/M35Rz

DEMO 1 : Recommendation using ALS

(Data : MovieLens)

DEMO 2: Interactive Shell

Conclusion

Join Us• Our team’s work is highlight by world top conf

• Hadoop Summit San Jose 2013

• Hadoop Summit Amsterdam 2014

• MSTR World Las Vegas 2014

• SparkSummit San Francisco 2014

• Jenkins Conf Palo Alto 2013

Thank you

PySaprk

Technology

Transcript of PySaprk

The New Brand Landscape 2

Psychology Of The Winner

PickupPal Eco-Rideshare Program

Big Data - The 5 Vs Everyone Must Know

Snapshot of Digital India - March 2014

Introduction to wireframes

How to Set Google Analytics Goals and Funnels

Lets Go on a Field Trip! (Physically & Virtually)

Simple Steps to UX/UI Web Design

On Digital Transformation - 10 Observations

User Centered Design Overview

Ogilvy PR 360 DI Twitter Webinar

Foursquare Wallet presentation

Designing With Vision

Social, Digital & Mobile in The Americas

Expert Positioning Using LinkedIn

ROI in the age of keyword not provided [Mozinar]

Campaign Tracking and Adwords Integration

Social, Digital & Mobile in India 2014

Zipcar Millennials Survey