Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

27
Sparkling Pandas Scaling Pandas beyond a single machine (or letting Pandas Roam) With Special thanks to Juliet Hougland :)

Transcript of Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Page 1: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Sparkling PandasScaling Pandas beyond a single machine

(or letting Pandas Roam)With Special thanks to Juliet Hougland :)

Page 2: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Sparkling PandasScaling Pandas beyond a single machine

(or letting Pandas Roam)With Special thanks to Juliet Hougland :)

Page 3: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Who am I?

Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Engineer at Alpine Data Labs

○ previously DataBricks, Google, Foursquare, Amazon● @holdenkarau● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau

Page 4: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

What is Pandas?

user_id panda_type

01234 giant

12345 red

23456 giant

34567 giant

45678 red

56789 giant

● DataFrames--Indexed, tabular data structures● Easy slicing, indexing, subsetting/filtering● Excellent support for time series data● Data alignment and reshaping

http://pandas.pydata.org/

Page 5: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

What is Spark?

Fast general engine for in memory data processing.

tl;dr - 100x faster than Hadoop MapReduce*

Page 6: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

The different pieces of Spark

Apache Spark

SQL & DataFrames Streaming Language

APIs

Scala, Java, Python, & R

Graph Tools

Spark ML bagel & Grah X

MLLib Community Packages

Page 7: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Some Spark terms

Spark Context (aka sc)● The window to the world of SparksqlContext● The window to the world of DataFramesTransformation● Takes an RDD (or DataFrame) and returns a new RDD

or DataFrameAction● Causes an RDD to be evaluated (often storing the

result)

Page 8: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Dataframes between Spark & Pandas

Spark● Fast● Distributed● Limited API● Some ML● I/O Options● Not indexed

Pandas● Fast● Single Machine● Full Feature API● Integration with ML● Different I/O

Options● Indexed● Easy to visualize

Page 9: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Panda IMG by Peter Beardsley

Page 10: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Simple Spark SQL Example

input = sqlContext.jsonFile(inputFile)input.registerTempTable("tweets")topTweets = sqlContext.sql("SELECT text, retweetCount" + "FROM tweets ORDER BY retweetCount LIMIT 10")local = topTweets.collect()

Page 11: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Convert a Spark DataFrame to Pandas

import pandas...ddf = sqlContext.read.json("hdfs://...")# Some Spark transformationstransformedDdf = ddf.filter(ddf['age'] > 21)return transformedDdf.toPandas()

Page 12: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Convert a Pandas DataFrame to Spark

import pandas...df = panda.DataFrame(...)...ddf = sqlContext.DataFrame(df)

Page 13: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Let’s combine the two

● Spark DataFrames already provides some of what we need○ Add UDFs / UDAFS○ Use bits of Pandas code

● http://spark-packages.org - excellent pace to get libraries

Page 14: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

So where does the PB&J go?

SparkDataFrame

Sparkling Pandas API

Custom UDFS

Pandas Code

Sparkling Pandas

Scala Code

PySpark RDDs

Pandas Code

Internal State

Page 15: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Extending Spark - adding index support

self._index_names

def collect(self): """Collect the elements in an Dataframe and concatenate the partition.""" df = self._schema_rdd.toPandas() df = _update_index_on_df(df, self._index_names) return df

Page 16: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Extending Spark - adding index support

def _update_index_on_df(df, index_names): if index_names: df = df.set_index(index_names) # Remove names from unnamed indexes index_names = _denormalize_names(index_names) df.index.names = index_names return df

Page 17: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Adding a UDF in Python

sqlContext.registerFunction("strLenPython", lambda x: len(x), IntegerType())

Page 18: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Extending Spark SQL w/Scala for fun & profit

// functions we want to be callable from pythonobject functions { def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) }}

Page 19: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Extending Spark SQL w/Scala for fun & profit

def _create_function(name, doc=""): def _(col): sc = SparkContext._active_spark_context f = sc._jvm.com.sparklingpandas.functions, name jc = getattr(f)(col._jc if isinstance(col, Column) else col) return Column(jc) return __functions = { 'kurtosis': 'Calculate the kurtosis, maybe!',}

Page 20: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Simple graphing with Sparkling Pandas

import matplotlib.pyplot as pltplot = speaker_pronouns["pronoun"].plot()plot.get_figure().savefig("/tmp/fig")

Not yet merged in

Page 21: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Why is SparklingPandas fast*?Keep stuff in the JVM as much as possible.

Lazy operations

Distributed

*For really flexible versions of the word fast

Coffee

by eltpics

Panda image by Stéfan

Panda image by cactusroot

Page 22: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Supported operations:

DataFrames● to_spark_sql● applymap● groupby● collect● stats● query● axes● ftype● dtype

Context● simple● read_csv● from_data_frame● parquetFile● read_json● stop

GroupBy● groups● indices● first● median● mean● sum● aggregate

Page 23: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Always onwards and upwards

Now

Hypothetical, Wonderful Future

Wor

k do

ne

Time

Page 25: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Using Sparkling Pandas

You can get Sparkling Pandas from ● Website:

http://www.sparklingpandas.com● Code:

https://github.com/sparklingpandas/sparklingpandas ● Mailing List

https://groups.google.com/d/forum/sparklingpandas

Page 26: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Getting Sparkling Pandas friends

The examples from this will get merged into master.Pandas● http://pandas.pydata.org/ (or pip)Spark● http://spark.apache.org/

Page 27: Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

many pandas by David Goehring

Any questions?