Structured Data Challenges in Finance and Statistics

42
Structured Data Challenges in Finance and Statistics Wes McKinney Rice Statistics, 21 November 2011 Wes McKinney () Structured data challenges Rice Statistics 1 / 43

description

Presented

Transcript of Structured Data Challenges in Finance and Statistics

Page 1: Structured Data Challenges in Finance and Statistics

Structured Data Challenges in Finance and Statistics

Wes McKinney

Rice Statistics, 21 November 2011

Wes McKinney () Structured data challenges Rice Statistics 1 / 43

Page 2: Structured Data Challenges in Finance and Statistics

Me

S.B., MIT Math ’07

3 years in the quant finance business

Now: starting a software company, initially to build financial dataanalysis and research systems

My blog: http://blog.wesmckinney.com

Twitter: @wesmckinn

Book! “Python for Data Analysis”, to hit the shelves later next yearfrom O’Reilly Media

Wes McKinney () Structured data challenges Rice Statistics 2 / 43

Page 3: Structured Data Challenges in Finance and Statistics

Structured data

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS

Wes McKinney () Structured data challenges Rice Statistics 3 / 43

Page 4: Structured Data Challenges in Finance and Statistics

Partial list of structured data necessities

Table modification: column insertion/deletion/type changes

Rich axis indexing, metadata

Easy data alignment

Aggregation and transformation by group (“group by”)

Missing data (NA) handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast Input/Output: text files, databases, HDF5, ...

Wes McKinney () Structured data challenges Rice Statistics 4 / 43

Page 5: Structured Data Challenges in Finance and Statistics

Are existing tools good enough?

We care nearly equally about

Ease-of-use (syntax / API fits your mental model)ExpressivenessPerformance (speed and memory usage)

Clean, consistent interface design is hard

Wes McKinney () Structured data challenges Rice Statistics 5 / 43

Page 6: Structured Data Challenges in Finance and Statistics

Auxiliary concerns

Any tool needs to integrate well with:

Statistical modeling toolsData visualization (plotting)

Target users

Computer scientists, statisticians, software engineers?Data scientists?

Wes McKinney () Structured data challenges Rice Statistics 6 / 43

Page 7: Structured Data Challenges in Finance and Statistics

Are existing tools good enough?

The typical players

R data.frame and friends + CRAN librariesSQL and other relational databasesPython / NumPy: structured (record) arraysCommercial products: SAS, Stata, MS Excel...

My conclusion: we still have a ways to go

R has become demonstrably better in the last 5 years (e.g. via plyr,reshape2)

Wes McKinney () Structured data challenges Rice Statistics 7 / 43

Page 8: Structured Data Challenges in Finance and Statistics

Deeper problems in many industries

Facilitating the research process only part of problem

Much of academia: “Production systems?”

Industry: a wasteland of misshapen wheels or expensive vendorproducts

Explosive growth in data-driven production systems

Hybrid-language systems are not always a good idea

Wes McKinney () Structured data challenges Rice Statistics 8 / 43

Page 9: Structured Data Challenges in Finance and Statistics

The big data conundrum

Great effort being invested in the (difficult) problem of large-scaledata processing, e.g. MapReduce-based

Less effort in the fundamental tooling for data manipulation /preparation / integration

Single-node performance does matter

Single-node code development time matters too

Wes McKinney () Structured data challenges Rice Statistics 9 / 43

Page 10: Structured Data Challenges in Finance and Statistics

pandas: my effort in this arena

Pick your favorite: panel data structures or Python structured dataanalysis

Starting building April 2008 back at AQR Capital

Open-sourced (BSD license) mid-2009

Heavily tested, being used by many companies (inc. lots of financialfirms) as the cornerstone of their systems

Goal: optimal balance of ease-of-use, flexibility, and performance

Heavy development the last 6 months

Wes McKinney () Structured data challenges Rice Statistics 10 / 43

Page 11: Structured Data Challenges in Finance and Statistics

Why did I start from scratch?

Accusations of NIH Syndrome abound

In 2008 I simultaneously needed

Agile, high performance data structuresA high productivity programming language for implementing all of thenon-computational business logicA production application platform that would seamlessly integrate withan interactive data analysis / research platform

In short, I was rebuilding major financial systems and I found myoptions inadequate

Thrilling innovation opportunity!

Wes McKinney () Structured data challenges Rice Statistics 11 / 43

Page 12: Structured Data Challenges in Finance and Statistics

Why did I use Python?

High productivity general purpose language

Well thought-out object-oriented model

Excellent software-development tools

Easy for MATLAB/R users to learn

Flexible built-in data structures (dicts, sets, lists, tuples)

The right open-source scientific computing tools

Powerful array processing (NumPy)Abundant tools for performance computing

Wes McKinney () Structured data challenges Rice Statistics 12 / 43

Page 13: Structured Data Challenges in Finance and Statistics

But, Python is not perfect

For statistical computing, a chicken-and-egg problem

Python’s plotting libraries are not designed for statistical graphics

Built-in data structures are not especially optimized for my large datause cases

Occasional semantic / syntactic niggles

Wes McKinney () Structured data challenges Rice Statistics 13 / 43

Page 14: Structured Data Challenges in Finance and Statistics

Partial list of structured data necessities

Table modification: column insertion/deletion/type changes

Rich axis indexing, metadata

Easy data alignment

Aggregation and transformation by group (“group by”)

Missing data (NA) handling

Pivoting and reshaping

Merging and joining

Time series-specific manipulations

Fast Input/Output: text files, databases, HDF5, ...

Wes McKinney () Structured data challenges Rice Statistics 14 / 43

Page 15: Structured Data Challenges in Finance and Statistics

DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes

Fast for row- and column-oriented operations

Support heterogeneous columns WITHOUT sacrificing performance inthe homogeneous (e.g. floating point only) case

Wes McKinney () Structured data challenges Rice Statistics 15 / 43

Page 16: Structured Data Challenges in Finance and Statistics

DataFrame

cname year agefrom ageto ls lsc pop ccode

0 Australia 1950 15 19 64.3 15.4 558 AUS

1 Australia 1950 20 24 48.4 26.4 645 AUS

2 Australia 1950 25 29 47.9 26.2 681 AUS

3 Australia 1950 30 34 44 23.8 614 AUS

4 Australia 1950 35 39 42.1 21.9 625 AUS

5 Australia 1950 40 44 38.9 20.1 555 AUS

6 Australia 1950 45 49 34 16.9 491 AUS

7 Australia 1950 50 54 29.6 14.6 439 AUS

8 Australia 1950 55 59 28 12.9 408 AUS

9 Australia 1950 60 64 26.3 12.1 356 AUS

Wes McKinney () Structured data challenges Rice Statistics 16 / 43

Page 17: Structured Data Challenges in Finance and Statistics

Axis indexing and metadata

Basic concept: labeled axes in use throughout the library

Need to support

Fast lookups (constant time)Data realignment / selection by labels (linear)Munging together irregularly indexed data

Key innovation: index is a data structure itself. Differentimplementations can support more sophisticated indexing

Axis labels can be any immutable Python object

Wes McKinney () Structured data challenges Rice Statistics 17 / 43

Page 18: Structured Data Challenges in Finance and Statistics

Irregularly indexed data

Columns

INDEX

DataFrame

Wes McKinney () Structured data challenges Rice Statistics 18 / 43

Page 19: Structured Data Challenges in Finance and Statistics

Axis indexing

d

a

b

c

e

0

1

2

3

4

Axis Index

Wes McKinney () Structured data challenges Rice Statistics 19 / 43

Page 20: Structured Data Challenges in Finance and Statistics

Why does this matter?

Real world data is highly irregular, especially time series

Operations between DataFrame objects automatically align on theindexes

Nearly impossible to have errors due to misaligned data

Can vastly facilitate munging unstructured data into structured form

Grants immense freedom in writing research code

Time series are just a special case of a general indexed data structure

Wes McKinney () Structured data challenges Rice Statistics 20 / 43

Page 21: Structured Data Challenges in Finance and Statistics

Axis indexing

year 1965 1970 1975 1980 1985cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 20 57.7 70.6 74.1 73.4 72 25 52.8 57.7 71.8 74 73.5 30 53.8 57.7 60.4 72.8 74.1 35 46.6 52.6 59.5 63.2 72.7 40 47.5 52.6 56.3 61.4 62.9 45 41.3 48.7 55.9 61.1 61.1 50 41.7 48.7 53.8 60.2 60 55 36.3 42.1 54.1 61 59.2 60 37.5 42.1 48.9 61.6 59.1 65 30.2 30.9 48.8 60.4 59.7 70 30.2 30.9 41.7 60.4 57.6 75 30.2 30.9 41.7 62.5 57.5

"I have a brain!"

Me too!

Wes McKinney () Structured data challenges Rice Statistics 21 / 43

Page 22: Structured Data Challenges in Finance and Statistics

Hierarchical indexing

Basic idea: represent high dimensional data in a lower-dimensionalstructure that is easier to reason about

Axis index with k levels of indexing

Slice chunks of data in constant time!

Provides a very natural way of implementing reshaping operations

Advantage over a truly N-dimensional object: space-efficient denserepresentation if groups are unbalanced

Extremely useful for econometric models on panel data

Wes McKinney () Structured data challenges Rice Statistics 22 / 43

Page 23: Structured Data Challenges in Finance and Statistics

Hierarchical indexing

agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7

Wes McKinney () Structured data challenges Rice Statistics 23 / 43

Page 24: Structured Data Challenges in Finance and Statistics

Joining and merging

Join and merge-type operations are very easy to implement withindexing in place

Multi-key join same code as aligning hierarchically-indexedDataFrames

Will illustrate this with examples

Wes McKinney () Structured data challenges Rice Statistics 24 / 43

Page 25: Structured Data Challenges in Finance and Statistics

Supporting size mutability

In order to have good row-oriented performance, need to storelike-typed columns in a single ndarray

“Column” insertion: accumulate 1 × N × . . . homogeneous columns,later consolidate with other like-typed into a single block

I.e. avoid reallocate-copy or array concatenation steps as long aspossible

Column deletions can be no-copy events (since ndarrays supportviews)

Wes McKinney () Structured data challenges Rice Statistics 25 / 43

Page 26: Structured Data Challenges in Finance and Statistics

DataFrame, under the hood

O F BI I F B F F O

6 2 4 5 31 1 652 3 4

Actually You see

Wes McKinney () Structured data challenges Rice Statistics 26 / 43

Page 27: Structured Data Challenges in Finance and Statistics

Reshaping

The fundamental operations

stack: pivot level from columns to rowsunstack: pivot level from rows to columns

Completely natural and intuitive with hierarchical indexing

No munging of column names necessary

year 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010cname agefrom Australia 15 85.5 92.6 91.1 91.7 87.7 80 66.3 73.4 82.3 78.4 20 57.7 70.6 74.1 73.4 72 66.4 59.9 38.2 44.8 41.5 25 52.8 57.7 71.8 74 73.5 72 66.4 59.9 38.2 44.8 30 53.8 57.7 60.4 72.8 74.1 73.5 72 66.4 59.9 38.2 35 46.6 52.6 59.5 63.2 72.7 74.1 73.5 72 66.4 59.9Austria 15 8.1 13.3 23.5 23.8 22.4 27.4 38.9 41.4 42.9 45.4 20 50.9 61.4 64.3 73.9 69.2 72 64.2 73.2 93.9 93.5 25 50.9 50.9 61.4 64.3 71.1 68.9 66.9 64.2 75 92.4 30 24.2 50.9 56.6 61.4 62.9 68.3 66.6 63.1 62.9 73.2 35 24.2 39.6 49.6 61.4 59.9 61.5 66.7 64.5 59.2 62.7

Wes McKinney () Structured data challenges Rice Statistics 27 / 43

Page 28: Structured Data Challenges in Finance and Statistics

Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)

agefrom 15 20 25 30 35 cname year Australia 1965 85.5 57.7 52.8 53.8 46.6 1970 92.6 70.6 57.7 57.7 52.6 1975 91.1 74.1 71.8 60.4 59.5 1980 91.7 73.4 74 72.8 63.2 1985 87.7 72 73.5 74.1 72.7 1990 80 66.4 72 73.5 74.1 1995 66.3 59.9 66.4 72 73.5 2000 73.4 38.2 59.9 66.4 72 2005 82.3 44.8 38.2 59.9 66.4 2010 78.4 41.5 44.8 38.2 59.9Austria 1965 8.1 50.9 50.9 24.2 24.2 1970 13.3 61.4 50.9 50.9 39.6 1975 23.5 64.3 61.4 56.6 49.6 1980 23.8 73.9 64.3 61.4 61.4 1985 22.4 69.2 71.1 62.9 59.9 1990 27.4 72 68.9 68.3 61.5 1995 38.9 64.2 66.9 66.6 66.7 2000 41.4 73.2 64.2 63.1 64.5 2005 42.9 93.9 75 62.9 59.2 2010 45.4 93.5 92.4 73.2 62.7

Wes McKinney () Structured data challenges Rice Statistics 28 / 43

Page 29: Structured Data Challenges in Finance and Statistics

Reshaping implementation nuances

Must carefully deal with unbalanced group sizes / missing data

I play vectorization tricks with the NumPy memory layout: no forloops!

Care must be taken to handle heterogeneous and homogeneous datacases

Wes McKinney () Structured data challenges Rice Statistics 29 / 43

Page 30: Structured Data Challenges in Finance and Statistics

GroupBy

High level process

split data set into groupsapply function to each group (an aggregation or a transformation)combine results intelligently into a result data structure

Can be used to emulate SQL GROUP BY operations

Wes McKinney () Structured data challenges Rice Statistics 30 / 43

Page 31: Structured Data Challenges in Finance and Statistics

GroupBy

Grouping closely related to indexing

Create correspondence between axis labels and group labels using oneof:

Array of group labels (like a DataFrame column)Python function to be applied to each axis tick

Can group by multiple keys

For a hierarchically indexed axis, can select a level and group by that(or some transformation thereof)

Wes McKinney () Structured data challenges Rice Statistics 31 / 43

Page 32: Structured Data Challenges in Finance and Statistics

Anatomy of GroupBy

grouped = obj.groupby([key1, key2, key3])

This returns a GroupBy object

Each of the keys could be any of:

A Python functionA vectorA column name

Wes McKinney () Structured data challenges Rice Statistics 32 / 43

Page 33: Structured Data Challenges in Finance and Statistics

Anatomy of GroupBy

aggregate, transform, and the more general apply supported

group_means = grouped.agg(np.mean)

group_means2 = grouped.mean()

demeaned = grouped.transform(lambda x: x - x.mean())

Wes McKinney () Structured data challenges Rice Statistics 33 / 43

Page 34: Structured Data Challenges in Finance and Statistics

Anatomy of GroupBy

The GroupBy object is also iterable

group_means = {}

for group_name, group in grouped:

group_means[group_name] = grouped.mean()

Wes McKinney () Structured data challenges Rice Statistics 34 / 43

Page 35: Structured Data Challenges in Finance and Statistics

GroupBy and hierarchical indexing

Hierarchical indexing came about as the natural result of a multi-keyaggregation:

>>> group_means = df.groupby(['country', 'agefrom']).mean()>>> group_means[['ls', 'lsc', 'pop']].unstack('country')

ls lsc pop country Australia Austria Australia Austria Australia Austriaagefrom 15 70.03 31.1 26.17 14.67 6163 3310 20 58.02 59.98 34.51 45.83 1113 531 25 57.02 45.5 33.02 32.73 5021 2791 30 59.16 46.56 35.29 33.87 1082 527 35 59.58 43.29 34.53 30.3 1053 528.8 40 58.8 40.98 33.92 27.88 1005 522.5 45 56.79 39.19 31.71 26 927.2 503.5 50 54.71 37.14 30.37 23.94 836.6 475.2 55 51.93 34.82 26.98 22.11 735.8 443.2 60 49.92 32.35 25.85 20.2 632.6 408.8 65 47.02 29.59 22.98 18.44 522.5 361.9 70 46.27 28.52 22.53 17.72 410.5 295.8 75 46.28 28.06 23.41 18.42 624.5 437.6

Wes McKinney () Structured data challenges Rice Statistics 35 / 43

Page 36: Structured Data Challenges in Finance and Statistics

What makes GroupBy hard?

factor-izing the group labels is very expensive

Function call overhead on small groups

To sort or not to sort?

Cheaper than computing the group labels!

Munging together results in exceptional cases is tricky

Wes McKinney () Structured data challenges Rice Statistics 36 / 43

Page 37: Structured Data Challenges in Finance and Statistics

Time series operations

Fixed frequency indexing (DateRange)

Domain-specific time offsets (like business day, business month end)

Frequency conversion

Forward-filling / back-filling / interpolation

Leading/lagging

In the works (later this year), better/faster upsampling/downsampling

Wes McKinney () Structured data challenges Rice Statistics 37 / 43

Page 38: Structured Data Challenges in Finance and Statistics

Shoot-out with R’s time series objects

“Inner join” addition between two irregular 500K-length time seriessampled from 1 million-length time series. Timestamps are POSIXct (R) /int64 (Python)

package timing factor

pandas 21.5 ms 1.0xts 41.3 ms 1.92fts 370 ms 17.2its 1117 ms 51.95

zoo 3479 ms 161.8

Where there is smoke, there is fire?Iron: Macbook Pro Core i7 with R 2.13.1, pandas 0.5.1git / Python 2.7.2

Wes McKinney () Structured data challenges Rice Statistics 38 / 43

Page 39: Structured Data Challenges in Finance and Statistics

Erm, inner join?

Intersecting time stamps loses information

Wes McKinney () Structured data challenges Rice Statistics 39 / 43

Page 40: Structured Data Challenges in Finance and Statistics

My (mild) performance obsession

Wes McKinney () Structured data challenges Rice Statistics 40 / 43

Page 41: Structured Data Challenges in Finance and Statistics

My (mild) performance obsession

Good performance stems from many factors

Well-designed algorithms (from a complexity standpoint)Minimizing copying of dataMinimizing interpreter / function call overheadTaking advantage of memory layout

I value functionality over performance, but I do spend a lot of timeprofiling code and grokking performance characteristics

Wes McKinney () Structured data challenges Rice Statistics 41 / 43

Page 42: Structured Data Challenges in Finance and Statistics

Demo, time permitting

Wes McKinney () Structured data challenges Rice Statistics 42 / 43