pandas: a Foundational Python Library for Data Analysis and Statistics

pandas: a Foundational Python library for Data Analysisand Statistics

Wes McKinney

PyHPC 2011, 18 November 2011

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25

An alternate title

High Performance Structured DataManipulation in Python


My background

Former quant hacker at AQR Capital, now entrepreneur

Background: math, statistics, computer science, quant finance.Shaken, not stirred

Active in scientific Python community

My blog: http://blog.wesmckinney.com

Twitter: @wesmckinn

Book! “Python for Data Analysis”, to hit the shelves later next yearfrom O’Reilly


Structured data

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS


Structured data

A familiar data model

Heterogeneous columns or hyperslabsEach column/hyperslab is homogeneously typedRelational databases (SQL, etc.) are just a special case

Need good performance in row- and column-oriented operations

Support for axis metadata

Data alignment is critical

Seamless integration with Python data structures and NumPy


Structured data challenges

Table modification: column insertion/deletion

Axis indexing and data alignment

Aggregation and transformation by group (“group by”)

Missing data handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast IO: flat files, databases, HDF5, ...


Not all fun and games

We care nearly equally about

PerformanceEase-of-use (syntax / API fits your mental model)Expressiveness

Clean, consistent API design is hard and underappreciated


The big picture

Build a foundation for data analysis and statistical computing

Craft the most expressive / flexible in-memory data manipulation toolin any language

Preferably also one of the fastest, too

Vastly simplify the data preparation, munging, and integration process

Comfortable abstractions: master data-fu without needing to be acomputer scientist

Later: extend API with distributed computing backend forlarger-than-memory datasets


pandas: a brief history

Starting building April 2008 back at AQR

Open-sourced (BSD license) mid-2009

29075 lines of Python/Cython code as of yesterday, and growing fast

Heavily tested, being used by many companies (inc. lots of financialfirms) in production


Cython: getting good performance

My choice tool for writing performant code

High level access to NumPy C API internals

Buffer syntax/protocol abstracts away striding details ofnon-contiguous arrays, very low overhead vs. working with raw Cpointers

Reduce/remove interpreter overhead associated with working withPython data structures

Interface directly with C/C++ code when necessary


Axis indexing

Key pandas feature

The axis index is a data structure itself, which can be customized tosupport things like:

1-1 O(1) indexing with hashable Python objectsDatetime indexing for time series dataHierarchical (multi-level) indexing

Use Python dict to support O(1) lookups and O(n) realignment ops.Can specialize to get better performance and memory usage


Axis indexing

Every axis has an index

Automatic alignment between differently-indexed objects: makes itnearly impossible to accidentally combine misaligned data

Hierarchical indexing provides an intuitive way of structuring andworking with higher-dimensional data

Natural way of expressing “group by” and join-type operations

As good or in many cases much more integrated/flexible thancommercial or open-source alternatives to pandas/Python


The trouble with Python dicts...

Python dict memory footprint can be quite large

1MM key-value pairs: something like 70mb on a 64-bit systemEven though sizeof(PyObject*) == 8

Python dict is great, but should use a faster, threadsafe hash table forprimitive C types (like 64-bit integer)

BUT: using a hash table only necessary in the general case. Withmonotonic indexes you don’t need one for realignment ops


Some alignment numbers

Hardware: Macbook Pro Core i7 laptop, Python 2.7.2

Outer-join 500k-length indexes chosen from 1MM elements

Dict-based with random strings: 2.2 secondsSorted strings: 400ms (5.5x faster)Sorted int64: 19ms (115x faster)

Fortunately, time series data falls into this last category

Alignment ops with C primitives could be fairly easily parallelized withOpenMP in Cython


DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes

Hierarchical indexing one way to support higher-dimensional data in alower-dimensional structure

Simplified NumPy type system: float, int, boolean, object

Rich indexing operations, SQL-like join/merges, etc.

Support heterogeneous columns WITHOUT sacrificing performance inthe homogeneous (e.g. floating point only) case


DataFrame, under the hood


Supporting size mutability

In order to have good row-oriented performance, need to storelike-typed columns in a single ndarray

“Column” insertion: accumulate 1 × N × . . . homogeneous columns,later consolidate with other like-typed into a single block

I.e. avoid reallocate-copy or array concatenation steps as long aspossible

Column deletions can be no-copy events (since ndarrays supportviews)


Hierarchical indexing

New this year, but really should have done long ago

Natural result of multi-key groupby

An intuitive way to work with higher-dimensional data

Much less ad hoc way of expressing reshaping operations

Once you have it, things like Excel-style pivot tables just “fall out”


Reshaping


Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)


Reshaping implementation nuances

Must deal with unbalanced group sizes / missing data

Play vectorization tricks with the NumPy C-contiguous memorylayout: no Python for loops allowed

Care must be taken to handle heterogeneous and homogeneous datacases


GroupBy

High level process

split data set into groupsapply function to each group (an aggregation or a transformation)combine results intelligently into a result data structure

Can be used to emulate SQL GROUP BY operations


GroupBy

Grouping closely related to indexing

Create correspondence between axis labels and group labels using oneof:

Array of group labels (like a DataFrame column)Python function to be applied to each axis tick

Can group by multiple keys

For a hierarchically indexed axis, can select a level and group by that(or some transformation thereof)


GroupBy implementation challenges

Computing the group labels from arbitrary Python objects is veryexpensive

77ms for 1MM strings with 1K groups107ms for 1MM strings with 10K groups350ms for 1MM strings with 100K groups

To sort or not to sort (for iteration)?

Once you have the labels, can reorder the data set in O(n) (with amuch smaller constant than computing the labels)Roughly 35ms to reorder 1MM float64 data points given the labels

(By contrast, computing the mean of 1MM elements takes 1.4ms)

Python function call overhead is significant in cases with lots of smallgroups; much better (orders of magnitude speedup) to writespecialized Cython routines


Demo, time permitting


pandas: a Foundational Python Library for Data Analysis and Statistics

Technology

Transcript of pandas: a Foundational Python Library for Data Analysis and Statistics