Python for Financial Data Analysis with pandas

22
Financial data analysis in Python with pandas Wes McKinney @wesmckinn 10/17/2011 @wesmckinn () Data analysis with pandas 10/17/2011 1 / 22

Transcript of Python for Financial Data Analysis with pandas

Page 1: Python for Financial Data Analysis with pandas

Financial data analysis in Python with pandas

Wes McKinney@wesmckinn

10/17/2011

@wesmckinn () Data analysis with pandas 10/17/2011 1 / 22

Page 2: Python for Financial Data Analysis with pandas

My background

3 years as a quant hacker at AQR, now consultant / entrepreneur

Math and statistics background with the zest of computer science

Active in scientific Python community

My blog: http://blog.wesmckinney.com

Twitter: @wesmckinn

@wesmckinn () Data analysis with pandas 10/17/2011 2 / 22

Page 3: Python for Financial Data Analysis with pandas

Bare essentials for financial research

Fast time series functionality

Easy data alignmentDate/time handlingMoving window statisticsResamping / frequency conversion

Fast data access (SQL databases, flat files, etc.)

Data visualization (plotting)

Statistical models

Linear regressionTime series models: ARMA, VAR, ...

@wesmckinn () Data analysis with pandas 10/17/2011 3 / 22

Page 4: Python for Financial Data Analysis with pandas

Would be nice to have

Portfolio and risk analytics, backtesting

Easy enough to write yourself, though most people do a bad job of it

Portfolio optimization

Most financial firms use a 3rd party library anyway

Derivative pricing

Can use QuantLib in most languages

@wesmckinn () Data analysis with pandas 10/17/2011 4 / 22

Page 5: Python for Financial Data Analysis with pandas

What are financial firms using?

HFT: a C++ and hardware arms race, a different topic

Research

Mainstream: R, MATLAB, Python, ...Econometrics: Stata, eViews, RATS, etc.Non-programmatic environments: ClariFI, Palantir, ...

Production

Popular: Java, C#, C++Less popular, but growing: PythonFringe: Functional languages (Ocaml, Haskell, F#)

@wesmckinn () Data analysis with pandas 10/17/2011 5 / 22

Page 6: Python for Financial Data Analysis with pandas

What are financial firms using?

Many hybrid languages environments (e.g. Java/R, C++/R,C++/MATLAB, Python/C++)

Which is the main implementation language?If main language is Java/C++, result is lower productivity and highercost to prototyping new functionality

Trends

Banks and hedge funds are realizing that Java-based productionsystems can be replaced with 20% as much Python code (or less)MATLAB is being increasingly ditched in favor of Python. R andPython use for research generally growing

@wesmckinn () Data analysis with pandas 10/17/2011 6 / 22

Page 7: Python for Financial Data Analysis with pandas

Python language

Simple, expressive syntax

Designed for readability, like “runnable pseudocode”

Easy-to-use, powerful built-in types and data structures:

Lists and tuples (fixed-size, immutable lists)Dicts (hash maps / associative arrays) and sets

Everything’s an object, including functions

“There should be one, and preferably only one way to do it”

“Batteries included”: great general purpose standard library

@wesmckinn () Data analysis with pandas 10/17/2011 7 / 22

Page 8: Python for Financial Data Analysis with pandas

A simple example: quicksort

Pseudocode from Wikipedia:

function qsort(array)

if length(array) < 2

return array

var list less, greater

select and remove a pivot value pivot from array

for each x in array

if x < pivot then append x to less

else append x to greater

return concat(qsort(less), pivot, qsort(greater))

@wesmckinn () Data analysis with pandas 10/17/2011 8 / 22

Page 9: Python for Financial Data Analysis with pandas

A simple example: quicksort

First try Python implementation:

def qsort(array):

if len(array) < 2:

return array

less , greater = [], []

pivot , rest = array [0], array [1:]

for x in rest:

if x < pivot:

less.append(x)

else:

greater.append(x)

return qsort(less) + [pivot] + qsort(greater)

@wesmckinn () Data analysis with pandas 10/17/2011 9 / 22

Page 10: Python for Financial Data Analysis with pandas

A simple example: quicksort

Use list comprehensions:

def qsort(array):

if len(array) < 2:

return array

pivot , rest = array [0], array [1:]

less = [x for x in rest if x < pivot]

greater = [x for x in rest if x >= pivot]

return qsort(less) + [pivot] + qsort(greater)

@wesmckinn () Data analysis with pandas 10/17/2011 10 / 22

Page 11: Python for Financial Data Analysis with pandas

A simple example: quicksort

Heck, fit it onto one line!

qs = lambda r: (r if len(r) < 2

else (qs([x for x in r[1:] if x < r[0]])

+ [r[0]]

+ qs([x for x in r[1:] if x >= r[0]])))

Though that’s starting to look like Lisp code...

@wesmckinn () Data analysis with pandas 10/17/2011 11 / 22

Page 12: Python for Financial Data Analysis with pandas

A simple example: quicksort

A quicksort using NumPy arrays

def qsort(array):

if len(array) < 2:

return array

pivot , rest = array [0], array [1:]

less = rest[rest < pivot]

greater = rest[rest >= pivot]

return np.r_[qsort(less), [pivot], qsort(greater )]

Of course no need for this when you can just do:

sorted_array = np.sort(array)

@wesmckinn () Data analysis with pandas 10/17/2011 12 / 22

Page 13: Python for Financial Data Analysis with pandas

Python: drunk with power

This comic has way too much airtime but:

@wesmckinn () Data analysis with pandas 10/17/2011 13 / 22

Page 14: Python for Financial Data Analysis with pandas

Staples of Python for science: MINS

(M) matplotlib: plotting and data visualization

(I) IPython: rich interactive computing and development environment

(N) NumPy: multi-dimensional arrays, linear algebra, FFTs, randomnumber generation, etc.

(S) SciPy: optimization, probability distributions, signal processing,ODEs, sparse matrices, ...

@wesmckinn () Data analysis with pandas 10/17/2011 14 / 22

Page 15: Python for Financial Data Analysis with pandas

Why did Python become popular in science?

NumPy traces its roots to 1995

Extremely easy to integrate C/C++/Fortran code

Access fast low level algorithms in a high level, interpreted language

The language itself

“It fits in your head”“It [Python] doesn’t get in my way” - Robert Kern

Python is good at all the things other scientific programminglanguages are not good at (e.g. networking, string processing, OOP)

Liberal BSD license: can use Python for commercial applications

@wesmckinn () Data analysis with pandas 10/17/2011 15 / 22

Page 16: Python for Financial Data Analysis with pandas

Some exciting stuff in the last few years

Cython

“Augmented” Python language with type declarations, for generatingcompiled extensionsC-like speedups with Python-like development time

IPython: enhanced interactive Python interpreter

The best research and software development env for PythonAn integrated parallel / distributed computing backendGUI console with inline plotting and a rich HTML notebook (more onthis later)

PyCUDA / PyOpenCL: GPU computing in Python

Transformed Python overnight into one of the best languages for doingGPU computing

@wesmckinn () Data analysis with pandas 10/17/2011 16 / 22

Page 17: Python for Financial Data Analysis with pandas

Where has Python historically been weak?

Rich data structures for data analysis and statistics

NumPy arrays, while powerful, feel distinctly “lower level” if you’reused to R’s data.frame

pandas has filled this gap over the last 2 years

Statistics libraries

Nowhere near the depth of R’s CRAN repositorystatsmodels provides tested implementations a lot of standardregression and time series modelsTurns out that most financial data analysis requires only fairlyelementary statistical models

@wesmckinn () Data analysis with pandas 10/17/2011 17 / 22

Page 18: Python for Financial Data Analysis with pandas

pandas library

Began building at AQR in 2008, open-sourced late 2009

WhyR / MATLAB, while good for research / data analysis, are not suitableimplementation languages for large-scale production systems

(I personally don’t care for them for data analysis)

Existing data structures for time series in R / MATLAB were toolimited / not flexible enough my needs

Core idea: indexed data structures capable of storing heterogeneousdata

Etymology: panel data structures

@wesmckinn () Data analysis with pandas 10/17/2011 18 / 22

Page 19: Python for Financial Data Analysis with pandas

pandas in a nutshell

A clean axis indexing design to support fast data alignment, lookups,hierarchical indexing, and more

High-performance data structures

Series/TimeSeries: 1D labeled vectorDataFrame: 2D spreadsheet-like structurePanel: 3D labeled array, collection of DataFrames

SQL-like functionality: GroupBy, joining/merging, etc.

Missing data handling

Time series functionality

@wesmckinn () Data analysis with pandas 10/17/2011 19 / 22

Page 20: Python for Financial Data Analysis with pandas

pandas design philosophy

“Think outside the matrix”: stop thinking about shape and startthinking about indexes

Indexing and data alignment are essential

Fault-tolerance: save you from common blunders caused by codingerrors (specifically misaligned data)

Lift the best features of other data analysis environments (R,MATLAB, Stata, etc.) and make them better, faster

Performance and usability equally important

@wesmckinn () Data analysis with pandas 10/17/2011 20 / 22

Page 21: Python for Financial Data Analysis with pandas

The pandas killer feature: indexing

Each axis has an index

Automatic alignment between differently-indexed objects: makes itnearly impossible to accidentally combine misaligned data

Hierarchical indexing provides an intuitive way of structuring andworking with higher-dimensional data

Natural way of expressing “group by” and join-type operations

Better integrated and more flexible indexing than anything availablein R or MATLAB

@wesmckinn () Data analysis with pandas 10/17/2011 21 / 22

Page 22: Python for Financial Data Analysis with pandas

Tutorial time

To the IPython console!

@wesmckinn () Data analysis with pandas 10/17/2011 22 / 22