Structured Data Challenges in Finance and Statistics

Structured Data Challenges in Finance and Statistics

Wes McKinney

Rice Statistics, 21 November 2011

Wes McKinney () Structured data challenges Rice Statistics 1 / 43

Me

S.B., MIT Math ’07

3 years in the quant finance business

Now: starting a software company, initially to build financial dataanalysis and research systems

My blog: http://blog.wesmckinney.com

Twitter: @wesmckinn

Book! “Python for Data Analysis”, to hit the shelves later next yearfrom O’Reilly Media


Structured data

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS


Partial list of structured data necessities

Table modification: column insertion/deletion/type changes

Rich axis indexing, metadata

Easy data alignment

Aggregation and transformation by group (“group by”)

Missing data (NA) handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast Input/Output: text files, databases, HDF5, ...


Are existing tools good enough?

We care nearly equally about

Ease-of-use (syntax / API fits your mental model)ExpressivenessPerformance (speed and memory usage)

Clean, consistent interface design is hard


Auxiliary concerns

Any tool needs to integrate well with:

Statistical modeling toolsData visualization (plotting)

Target users

Computer scientists, statisticians, software engineers?Data scientists?


Are existing tools good enough?

The typical players

R data.frame and friends + CRAN librariesSQL and other relational databasesPython / NumPy: structured (record) arraysCommercial products: SAS, Stata, MS Excel...

My conclusion: we still have a ways to go

R has become demonstrably better in the last 5 years (e.g. via plyr,reshape2)


Deeper problems in many industries

Facilitating the research process only part of problem

Much of academia: “Production systems?”

Industry: a wasteland of misshapen wheels or expensive vendorproducts

Explosive growth in data-driven production systems

Hybrid-language systems are not always a good idea


The big data conundrum

Great effort being invested in the (difficult) problem of large-scaledata processing, e.g. MapReduce-based

Less effort in the fundamental tooling for data manipulation /preparation / integration

Single-node performance does matter

Single-node code development time matters too


pandas: my effort in this arena

Pick your favorite: panel data structures or Python structured dataanalysis

Starting building April 2008 back at AQR Capital

Open-sourced (BSD license) mid-2009

Heavily tested, being used by many companies (inc. lots of financialfirms) as the cornerstone of their systems

Goal: optimal balance of ease-of-use, flexibility, and performance

Heavy development the last 6 months


Why did I start from scratch?

Accusations of NIH Syndrome abound

In 2008 I simultaneously needed

Agile, high performance data structuresA high productivity programming language for implementing all of thenon-computational business logicA production application platform that would seamlessly integrate withan interactive data analysis / research platform

In short, I was rebuilding major financial systems and I found myoptions inadequate

Thrilling innovation opportunity!


Why did I use Python?

High productivity general purpose language

Well thought-out object-oriented model

Excellent software-development tools

Easy for MATLAB/R users to learn

Flexible built-in data structures (dicts, sets, lists, tuples)

The right open-source scientific computing tools

Powerful array processing (NumPy)Abundant tools for performance computing


But, Python is not perfect

For statistical computing, a chicken-and-egg problem

Python’s plotting libraries are not designed for statistical graphics

Built-in data structures are not especially optimized for my large datause cases

Occasional semantic / syntactic niggles


Partial list of structured data necessities

Table modification: column insertion/deletion/type changes

Rich axis indexing, metadata

Easy data alignment

Aggregation and transformation by group (“group by”)

Missing data (NA) handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast Input/Output: text files, databases, HDF5, ...


DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes

Fast for row- and column-oriented operations

Support heterogeneous columns WITHOUT sacrificing performance inthe homogeneous (e.g. floating point only) case


DataFrame

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS


Axis indexing and metadata

Basic concept: labeled axes in use throughout the library

Need to support

Fast lookups (constant time)Data realignment / selection by labels (linear)Munging together irregularly indexed data

Key innovation: index is a data structure itself. Differentimplementations can support more sophisticated indexing

Axis labels can be any immutable Python object


Irregularly indexed data

Columns

INDEX

DataFrame


Axis indexing

d

a

b

c

e

0

1

2

3

4

Axis Index


Why does this matter?

Real world data is highly irregular, especially time series

Operations between DataFrame objects automatically align on theindexes

Nearly impossible to have errors due to misaligned data

Can vastly facilitate munging unstructured data into structured form

Grants immense freedom in writing research code

Time series are just a special case of a general indexed data structure


Axis indexing

year 1965 1970 1975 1980 1985cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 20 57.7 70.6 74.1 73.4 72 25 52.8 57.7 71.8 74 73.5 30 53.8 57.7 60.4 72.8 74.1 35 46.6 52.6 59.5 63.2 72.7 40 47.5 52.6 56.3 61.4 62.9 45 41.3 48.7 55.9 61.1 61.1 50 41.7 48.7 53.8 60.2 60 55 36.3 42.1 54.1 61 59.2 60 37.5 42.1 48.9 61.6 59.1 65 30.2 30.9 48.8 60.4 59.7 70 30.2 30.9 41.7 60.4 57.6 75 30.2 30.9 41.7 62.5 57.5

"I have a brain!"

Me too!


Hierarchical indexing

Basic idea: represent high dimensional data in a lower-dimensionalstructure that is easier to reason about

Axis index with k levels of indexing

Slice chunks of data in constant time!

Provides a very natural way of implementing reshaping operations

Advantage over a truly N-dimensional object: space-efficient denserepresentation if groups are unbalanced

Extremely useful for econometric models on panel data


Hierarchical indexing

agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7


Joining and merging

Join and merge-type operations are very easy to implement withindexing in place

Multi-key join same code as aligning hierarchically-indexedDataFrames

Will illustrate this with examples


Supporting size mutability

In order to have good row-oriented performance, need to storelike-typed columns in a single ndarray

“Column” insertion: accumulate 1 × N × . . . homogeneous columns,later consolidate with other like-typed into a single block

I.e. avoid reallocate-copy or array concatenation steps as long aspossible

Column deletions can be no-copy events (since ndarrays supportviews)


DataFrame, under the hood

O F BI I F B F F O

6 2 4 5 31 1 652 3 4

Actually You see


Reshaping

The fundamental operations

stack: pivot level from columns to rowsunstack: pivot level from rows to columns

Completely natural and intuitive with hierarchical indexing

No munging of column names necessary

year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4 20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5 25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8 30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2 35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4 20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5 25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4 30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2 35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7


Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)

agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7


Reshaping implementation nuances

Must carefully deal with unbalanced group sizes / missing data

I play vectorization tricks with the NumPy memory layout: no forloops!

Care must be taken to handle heterogeneous and homogeneous datacases


GroupBy

High level process

split data set into groupsapply function to each group (an aggregation or a transformation)combine results intelligently into a result data structure

Can be used to emulate SQL GROUP BY operations


GroupBy

Grouping closely related to indexing

Create correspondence between axis labels and group labels using oneof:

Array of group labels (like a DataFrame column)Python function to be applied to each axis tick

Can group by multiple keys

For a hierarchically indexed axis, can select a level and group by that(or some transformation thereof)


Anatomy of GroupBy

grouped = obj.groupby([key1, key2, key3])

This returns a GroupBy object

Each of the keys could be any of:

A Python functionA vectorA column name


Anatomy of GroupBy

aggregate, transform, and the more general apply supported

group_means = grouped.agg(np.mean)

group_means2 = grouped.mean()

demeaned = grouped.transform(lambda x: x - x.mean())


Anatomy of GroupBy

The GroupBy object is also iterable

group_means = {}

for group_name, group in grouped:

group_means[group_name] = grouped.mean()


GroupBy and hierarchical indexing

Hierarchical indexing came about as the natural result of a multi-keyaggregation:

>>> group_means = df.groupby(['country', 'agefrom']).mean()>>> group_means[['ls', 'lsc', 'pop']].unstack('country')

ls lsc pop country Australia Austria Australia Austria Australia Austriaagefrom 15 70.03 31.1 26.17 14.67 6163 3310 20 58.02 59.98 34.51 45.83 1113 531 25 57.02 45.5 33.02 32.73 5021 2791 30 59.16 46.56 35.29 33.87 1082 527 35 59.58 43.29 34.53 30.3 1053 528.8 40 58.8 40.98 33.92 27.88 1005 522.5 45 56.79 39.19 31.71 26 927.2 503.5 50 54.71 37.14 30.37 23.94 836.6 475.2 55 51.93 34.82 26.98 22.11 735.8 443.2 60 49.92 32.35 25.85 20.2 632.6 408.8 65 47.02 29.59 22.98 18.44 522.5 361.9 70 46.27 28.52 22.53 17.72 410.5 295.8 75 46.28 28.06 23.41 18.42 624.5 437.6


What makes GroupBy hard?

factor-izing the group labels is very expensive

Function call overhead on small groups

To sort or not to sort?

Cheaper than computing the group labels!

Munging together results in exceptional cases is tricky


Time series operations

Fixed frequency indexing (DateRange)

Domain-specific time offsets (like business day, business month end)

Frequency conversion

Forward-filling / back-filling / interpolation

Leading/lagging

In the works (later this year), better/faster upsampling/downsampling


Shoot-out with R’s time series objects

“Inner join” addition between two irregular 500K-length time seriessampled from 1 million-length time series. Timestamps are POSIXct (R) /int64 (Python)

package timing factor

pandas 21.5 ms 1.0xts 41.3 ms 1.92fts 370 ms 17.2its 1117 ms 51.95

zoo 3479 ms 161.8

Where there is smoke, there is fire?Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2


Erm, inner join?

Intersecting time stamps loses information


My (mild) performance obsession


My (mild) performance obsession

Good performance stems from many factors

Well-designed algorithms (from a complexity standpoint)Minimizing copying of dataMinimizing interpreter / function call overheadTaking advantage of memory layout

I value functionality over performance, but I do spend a lot of timeprofiling code and grokking performance characteristics


Demo, time permitting


Structured Data Challenges in Finance and Statistics

Technology

Transcript of Structured Data Challenges in Finance and Statistics