Pandas/Data Analysis at Baypiggies

18
Python Pandas Lessons Learned in Performance and Design

description

Presented at BayPiggies by Chang She and Andy Hayden. pandas is used by many people to make their lives easier when analyzing data. This talk is centered around how the overarching goal of user productivity has driven the balance of API development and performance optimization. We will cover some pandas basics. We'll talk about pandas performance. And we'll discuss data structures and algorithms. Along the way, we'll cover best practices and tools useful for developing open source projects. Chang She is the CTO/co-founder of DataPad. A pythonista and recovering financial quant, Chang was a core contributor to pandas prior to co-founding DataPad. Chang is passionate about creating better data tools to make knowledge workers more productive. Andy is a core contributor to pandas and holds the dubious accolade of having answered the most pandas-related questions on Stack Overflow. Andy is an analyst and software engineer from the UK, turned Data Scientist in CA, and is enthusiastic about making data tools easy. ipython notebooks available here: https://www.wakari.io/sharing/bundle/hayd/baypiggies https://www.wakari.io/sharing/bundle/hayd/vbench https://www.wakari.io/sharing/bundle/hayd/pandorable

Transcript of Pandas/Data Analysis at Baypiggies

Page 1: Pandas/Data Analysis at Baypiggies

Python PandasLessons Learned in Performance and

Design

Page 2: Pandas/Data Analysis at Baypiggies

Who we are

Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan

Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker

Page 3: Pandas/Data Analysis at Baypiggies

What are we talking about

- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix

Page 4: Pandas/Data Analysis at Baypiggies
Page 5: Pandas/Data Analysis at Baypiggies

What is it?

- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community

Page 6: Pandas/Data Analysis at Baypiggies

Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)

USER PRODUCTIVITY

Page 7: Pandas/Data Analysis at Baypiggies

Productivity via better workflow

- Single tool to minimize cognitive dissonance

- Iterative and not linear workflow

- Performant enough for interactive work

Page 8: Pandas/Data Analysis at Baypiggies

Pandas basics

(notebook)

Page 9: Pandas/Data Analysis at Baypiggies

Priorities

- Build the right abstractions

- Get the API right

- Then optimize for performance

Page 10: Pandas/Data Analysis at Baypiggies

Open source APIs

- Sometimes you can’t be all things to all people

- You can only add to an API, rarely change, and never get rid of APIs

- Documentation Documentation Documentation

Page 11: Pandas/Data Analysis at Baypiggies

An example

- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API

Page 12: Pandas/Data Analysis at Baypiggies

Optimization

- Push slow code paths into cython or directly into C

- Try to be smart about minimizing cache misses and not creating unnecessary copies

- Careful with NAs

Page 13: Pandas/Data Analysis at Baypiggies

Tracking Performance (vbench)

Page 14: Pandas/Data Analysis at Baypiggies

what to track?

use vbench to track everything we care about (read: users have complained its slow ?)

unofficial vbenches repos for numpy and scikit

(look)

Page 15: Pandas/Data Analysis at Baypiggies

why

Once users are using your API, they’ll notice performance changes “it feels slower”.

Then timeit and have legitimate grievance… want to automate this process (before user-upset).

Page 16: Pandas/Data Analysis at Baypiggies

how

(notebook)

Page 17: Pandas/Data Analysis at Baypiggies

Pandorable pandas

(notebook)

Page 18: Pandas/Data Analysis at Baypiggies

The End