pandas: a Foundational Python Library for Data Analysis and Statistics

25
pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25

description

PyHPC2011 at SC11 in Seattle

Transcript of pandas: a Foundational Python Library for Data Analysis and Statistics

Page 1: pandas: a Foundational Python Library for Data Analysis and Statistics

pandas: a Foundational Python library for Data Analysisand Statistics

Wes McKinney

PyHPC 2011, 18 November 2011

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25

Page 2: pandas: a Foundational Python Library for Data Analysis and Statistics

An alternate title

High Performance Structured DataManipulation in Python

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25

Page 3: pandas: a Foundational Python Library for Data Analysis and Statistics

My background

Former quant hacker at AQR Capital, now entrepreneur

Background: math, statistics, computer science, quant finance.Shaken, not stirred

Active in scientific Python community

My blog: http://blog.wesmckinney.com

Twitter: @wesmckinn

Book! “Python for Data Analysis”, to hit the shelves later next yearfrom O’Reilly

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25

Page 4: pandas: a Foundational Python Library for Data Analysis and Statistics

Structured data

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25

Page 5: pandas: a Foundational Python Library for Data Analysis and Statistics

Structured data

A familiar data model

Heterogeneous columns or hyperslabsEach column/hyperslab is homogeneously typedRelational databases (SQL, etc.) are just a special case

Need good performance in row- and column-oriented operations

Support for axis metadata

Data alignment is critical

Seamless integration with Python data structures and NumPy

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25

Page 6: pandas: a Foundational Python Library for Data Analysis and Statistics

Structured data challenges

Table modification: column insertion/deletion

Axis indexing and data alignment

Aggregation and transformation by group (“group by”)

Missing data handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast IO: flat files, databases, HDF5, ...

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25

Page 7: pandas: a Foundational Python Library for Data Analysis and Statistics

Not all fun and games

We care nearly equally about

PerformanceEase-of-use (syntax / API fits your mental model)Expressiveness

Clean, consistent API design is hard and underappreciated

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25

Page 8: pandas: a Foundational Python Library for Data Analysis and Statistics

The big picture

Build a foundation for data analysis and statistical computing

Craft the most expressive / flexible in-memory data manipulation toolin any language

Preferably also one of the fastest, too

Vastly simplify the data preparation, munging, and integration process

Comfortable abstractions: master data-fu without needing to be acomputer scientist

Later: extend API with distributed computing backend forlarger-than-memory datasets

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25

Page 9: pandas: a Foundational Python Library for Data Analysis and Statistics

pandas: a brief history

Starting building April 2008 back at AQR

Open-sourced (BSD license) mid-2009

29075 lines of Python/Cython code as of yesterday, and growing fast

Heavily tested, being used by many companies (inc. lots of financialfirms) in production

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25

Page 10: pandas: a Foundational Python Library for Data Analysis and Statistics

Cython: getting good performance

My choice tool for writing performant code

High level access to NumPy C API internals

Buffer syntax/protocol abstracts away striding details ofnon-contiguous arrays, very low overhead vs. working with raw Cpointers

Reduce/remove interpreter overhead associated with working withPython data structures

Interface directly with C/C++ code when necessary

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25

Page 11: pandas: a Foundational Python Library for Data Analysis and Statistics

Axis indexing

Key pandas feature

The axis index is a data structure itself, which can be customized tosupport things like:

1-1 O(1) indexing with hashable Python objectsDatetime indexing for time series dataHierarchical (multi-level) indexing

Use Python dict to support O(1) lookups and O(n) realignment ops.Can specialize to get better performance and memory usage

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25

Page 12: pandas: a Foundational Python Library for Data Analysis and Statistics

Axis indexing

Every axis has an index

Automatic alignment between differently-indexed objects: makes itnearly impossible to accidentally combine misaligned data

Hierarchical indexing provides an intuitive way of structuring andworking with higher-dimensional data

Natural way of expressing “group by” and join-type operations

As good or in many cases much more integrated/flexible thancommercial or open-source alternatives to pandas/Python

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25

Page 13: pandas: a Foundational Python Library for Data Analysis and Statistics

The trouble with Python dicts...

Python dict memory footprint can be quite large

1MM key-value pairs: something like 70mb on a 64-bit systemEven though sizeof(PyObject*) == 8

Python dict is great, but should use a faster, threadsafe hash table forprimitive C types (like 64-bit integer)

BUT: using a hash table only necessary in the general case. Withmonotonic indexes you don’t need one for realignment ops

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25

Page 14: pandas: a Foundational Python Library for Data Analysis and Statistics

Some alignment numbers

Hardware: Macbook Pro Core i7 laptop, Python 2.7.2

Outer-join 500k-length indexes chosen from 1MM elements

Dict-based with random strings: 2.2 secondsSorted strings: 400ms (5.5x faster)Sorted int64: 19ms (115x faster)

Fortunately, time series data falls into this last category

Alignment ops with C primitives could be fairly easily parallelized withOpenMP in Cython

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25

Page 15: pandas: a Foundational Python Library for Data Analysis and Statistics

DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes

Hierarchical indexing one way to support higher-dimensional data in alower-dimensional structure

Simplified NumPy type system: float, int, boolean, object

Rich indexing operations, SQL-like join/merges, etc.

Support heterogeneous columns WITHOUT sacrificing performance inthe homogeneous (e.g. floating point only) case

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25

Page 16: pandas: a Foundational Python Library for Data Analysis and Statistics

DataFrame, under the hood

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25

Page 17: pandas: a Foundational Python Library for Data Analysis and Statistics

Supporting size mutability

In order to have good row-oriented performance, need to storelike-typed columns in a single ndarray

“Column” insertion: accumulate 1 × N × . . . homogeneous columns,later consolidate with other like-typed into a single block

I.e. avoid reallocate-copy or array concatenation steps as long aspossible

Column deletions can be no-copy events (since ndarrays supportviews)

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25

Page 18: pandas: a Foundational Python Library for Data Analysis and Statistics

Hierarchical indexing

New this year, but really should have done long ago

Natural result of multi-key groupby

An intuitive way to work with higher-dimensional data

Much less ad hoc way of expressing reshaping operations

Once you have it, things like Excel-style pivot tables just “fall out”

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25

Page 19: pandas: a Foundational Python Library for Data Analysis and Statistics

Reshaping

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25

Page 20: pandas: a Foundational Python Library for Data Analysis and Statistics

Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25

Page 21: pandas: a Foundational Python Library for Data Analysis and Statistics

Reshaping implementation nuances

Must deal with unbalanced group sizes / missing data

Play vectorization tricks with the NumPy C-contiguous memorylayout: no Python for loops allowed

Care must be taken to handle heterogeneous and homogeneous datacases

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25

Page 22: pandas: a Foundational Python Library for Data Analysis and Statistics

GroupBy

High level process

split data set into groupsapply function to each group (an aggregation or a transformation)combine results intelligently into a result data structure

Can be used to emulate SQL GROUP BY operations

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25

Page 23: pandas: a Foundational Python Library for Data Analysis and Statistics

GroupBy

Grouping closely related to indexing

Create correspondence between axis labels and group labels using oneof:

Array of group labels (like a DataFrame column)Python function to be applied to each axis tick

Can group by multiple keys

For a hierarchically indexed axis, can select a level and group by that(or some transformation thereof)

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25

Page 24: pandas: a Foundational Python Library for Data Analysis and Statistics

GroupBy implementation challenges

Computing the group labels from arbitrary Python objects is veryexpensive

77ms for 1MM strings with 1K groups107ms for 1MM strings with 10K groups350ms for 1MM strings with 100K groups

To sort or not to sort (for iteration)?

Once you have the labels, can reorder the data set in O(n) (with amuch smaller constant than computing the labels)Roughly 35ms to reorder 1MM float64 data points given the labels

(By contrast, computing the mean of 1MM elements takes 1.4ms)

Python function call overhead is significant in cases with lots of smallgroups; much better (orders of magnitude speedup) to writespecialized Cython routines

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25

Page 25: pandas: a Foundational Python Library for Data Analysis and Statistics

Demo, time permitting

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25