Scipy 2011 Time Series Analysis in Python

29
Time Series Analysis in Python with statsmodels Wes McKinney 1 Josef Perktold 2 Skipper Seabold 3 1 Department of Statistical Science Duke University 2 Department of Economics University of North Carolina at Chapel Hill 3 Department of Economics American University 10 th Python in Science Conference, 13 July 2011 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29

Transcript of Scipy 2011 Time Series Analysis in Python

Page 1: Scipy 2011 Time Series Analysis in Python

Time Series Analysis in Python with statsmodels

Wes McKinney1 Josef Perktold2 Skipper Seabold3

1Department of Statistical ScienceDuke University

2Department of EconomicsUniversity of North Carolina at Chapel Hill

3Department of EconomicsAmerican University

10th Python in Science Conference, 13 July 2011

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29

Page 2: Scipy 2011 Time Series Analysis in Python

What is statsmodels?

A library for statistical modeling, implementing standard statisticalmodels in Python using NumPy and SciPy

Includes:

Linear (regression) models of many formsDescriptive statisticsStatistical testsTime series analysis...and much more

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29

Page 3: Scipy 2011 Time Series Analysis in Python

What is Time Series Analysis?

Statistical modeling of time-ordered data observations

Inferring structure, forecasting and simulation, and testingdistributional assumptions about the data

Modeling dynamic relationships among multiple time series

Broad applications e.g. in economics, finance, neuroscience, signalprocessing...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29

Page 4: Scipy 2011 Time Series Analysis in Python

Talk Overview

Brief update on statsmodels development

Aside: user interface and data structures

Descriptive statistics and tests

Auto-regressive moving average models (ARMA)

Vector autoregression (VAR) models

Filtering tools (Hodrick-Prescott and others)

Near future: Bayesian dynamic linear models (DLMs), ARCH /GARCH volatility models and beyond

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29

Page 5: Scipy 2011 Time Series Analysis in Python

Statsmodels development update

We’re now on GitHub! Join us:

http://github.com/statsmodels/statsmodels

Check out the slick Sphinx docs:

http://statsmodels.sourceforge.net

Development focus has been largely computational, i.e. writingcorrect, tested implementations of all the common classes ofstatistical models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29

Page 6: Scipy 2011 Time Series Analysis in Python

Statsmodels development update

Major work to be done on providing a nice integrated user interface

We must work together to close the gap between R and Python!

Some important areas:

Formula framework, for specifying model design matricesNeed integrated rich statistical data structures (pandas)Data visualization of results should always be a few keystrokes awayWrite a “Statsmodels for R users” guide

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29

Page 7: Scipy 2011 Time Series Analysis in Python

Aside: statistical data structures and user interface

While I have a captive audience...

Controversial fact: pandas is the only Python library currentlyproviding data structures matching (and in many places exceeding)the richness of R’s data structures (for statistics)

Let’s have a BoF session so I can justify this statement

Feedback I hear is that end users find the fragmented, incohesive setof Python tools for data analysis and statistics to be confusing,frustrating, and certainly not compelling them to use Python...

(Not to mention the packaging headaches)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29

Page 8: Scipy 2011 Time Series Analysis in Python

Aside: statistical data structures and user interface

We need to “commit” ASAP (not 12 months from now) to a highlevel data structure(s) as the “primary data structure(s) for statisticaldata analysis” and communicate that clearly to end users

Or we might as well all start programming in R...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29

Page 9: Scipy 2011 Time Series Analysis in Python

Example data: EEG trace data

0500

10001500

20002500

30003500

4000600

500

400

300

200

100

0

100

200

300

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29

Page 10: Scipy 2011 Time Series Analysis in Python

Example data: Macroeconomic data

3.0

3.5

4.0

4.5

5.0

5.5

cpi

4.55.05.56.06.57.07.5

m1

19601964

19681972

19761980

19841988

19921996

20002004

2008

8.0

8.5

9.0

9.5realgdp

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29

Page 11: Scipy 2011 Time Series Analysis in Python

Example data: Stock data

20012002

20032004

20052006

20072008

20090

100

200

300

400

500

600

700

800

AAPLGOOGMSFTYHOO

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29

Page 12: Scipy 2011 Time Series Analysis in Python

Descriptive statistics

Autocorrelation, partial autocorrelation plots

Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)models

acf = tsa.acf(eeg , 50)

pacf = tsa.pacf(eeg , 50)

0 10 20 30 40 501.0

0.5

0.0

0.5

1.0Autocorrelation

0 10 20 30 40 501.0

0.5

0.0

0.5

1.0Partial Autocorrelation

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29

Page 13: Scipy 2011 Time Series Analysis in Python

Statistical tests

Ljung-Box test for zero autocorrelation

Unit root test for cointegration (Augmented Dickey-Fuller test)

Granger-causality

Whiteness (iid-ness) and normality

See our conference paper (when the proceedings get published!)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29

Page 14: Scipy 2011 Time Series Analysis in Python

Autoregressive moving average (ARMA) models

One of most common univariate time series models:

yt = µ+ a1yt−1 + ...+ akyt−p + εt + b1εt−1 + ...+ bqεt−q

where E (εt , εs) = 0, for t 6= s and εt ∼ N (0, σ2)

Exact log-likelihood can be evaluated via the Kalman filter, but the“conditional” likelihood is easier and commonly used

statsmodels has tools for simulating ARMA processes with knowncoefficients ai , bi and also estimation given specified lag orders

import scikits.statsmodels.tsa.arima_process as ap

ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]

nobs = 100

y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)

y += 4 # add in constant

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29

Page 15: Scipy 2011 Time Series Analysis in Python

ARMA Estimation

Several likelihood-based estimators implemented (see docs)

model = tsa.ARMA(y)

result = model.fit(order=(2, 1), trend=’c’,

method=’css-mle’, disp=-1)

result.params

# array([ 3.97, -0.97, -0.05, -0.13])

Standard model diagnostics, standard errors, information criteria(AIC, BIC, ...), etc available in the returned ARMAResults object

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29

Page 16: Scipy 2011 Time Series Analysis in Python

Vector Autoregression (VAR) models

Widely used model for modeling multiple (K -variate) time series,especially in macroeconomics:

Yt = A1Yt−1 + . . .+ ApYt−p + εt , εt ∼ N (0,Σ)

Matrices Ai are K × K .

Yt must be a stationary process (sometimes achieved bydifferencing). Related class of models (VECM) for modelingnonstationary (including cointegrated) processes

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29

Page 17: Scipy 2011 Time Series Analysis in Python

Vector Autoregression (VAR) models

>>> model = VAR(data); model.select_order(8)

VAR Order Selection

=====================================================

aic bic fpe hqic

-----------------------------------------------------

0 -27.83 -27.78 8.214e-13 -27.81

1 -28.77 -28.57 3.189e-13 -28.69

2 -29.00 -28.64* 2.556e-13 -28.85

3 -29.10 -28.60 2.304e-13 -28.90*

4 -29.09 -28.43 2.330e-13 -28.82

5 -29.13 -28.33 2.228e-13 -28.81

6 -29.14* -28.18 2.213e-13* -28.75

7 -29.07 -27.96 2.387e-13 -28.62

=====================================================

* Minimum

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29

Page 18: Scipy 2011 Time Series Analysis in Python

Vector Autoregression (VAR) models

>>> result = model.fit(2)

>>> result.summary() # print summary for each variable

<snip>

Results for equation m1

====================================================

coefficient std. error t-stat prob

----------------------------------------------------

const 0.004968 0.001850 2.685 0.008

L1.m1 0.363636 0.071307 5.100 0.000

L1.realgdp -0.077460 0.092975 -0.833 0.406

L1.cpi -0.052387 0.128161 -0.409 0.683

L2.m1 0.250589 0.072050 3.478 0.001

L2.realgdp -0.085874 0.092032 -0.933 0.352

L2.cpi 0.169803 0.128376 1.323 0.188

====================================================

<snip>

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29

Page 19: Scipy 2011 Time Series Analysis in Python

Vector Autoregression (VAR) models

>>> result = model.fit(2)

>>> result.summary() # print summary for each variable

<snip>

Correlation matrix of residuals

m1 realgdp cpi

m1 1.000000 -0.055690 -0.297494

realgdp -0.055690 1.000000 0.115597

cpi -0.297494 0.115597 1.000000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29

Page 20: Scipy 2011 Time Series Analysis in Python

VAR: Impulse Response analysis

Analyze systematic impact of unit “shock” to a single variable

irf = result.irf(10)

irf.plot()

0 2 4 6 8 100.2

0.0

0.2

0.4

0.6

0.8

1.0m1→m1

0 2 4 6 8 100.4

0.3

0.2

0.1

0.0

0.1

0.2realgdp→m1

0 2 4 6 8 100.40.30.20.10.00.10.20.30.4

cpi→m1

0 2 4 6 8 100.15

0.10

0.05

0.00

0.05

0.10

0.15

0.20m1→realgdp

0 2 4 6 8 100.2

0.0

0.2

0.4

0.6

0.8

1.0realgdp→realgdp

0 2 4 6 8 100.4

0.3

0.2

0.1

0.0

0.1

0.2cpi→realgdp

0 2 4 6 8 100.10

0.05

0.00

0.05

0.10

0.15

0.20m1→cpi

0 2 4 6 8 100.15

0.10

0.05

0.00

0.05

0.10

0.15realgdp→cpi

0 2 4 6 8 100.0

0.2

0.4

0.6

0.8

1.0cpi→cpi

Impulse responses

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29

Page 21: Scipy 2011 Time Series Analysis in Python

VAR: Forecast Error Variance Decomposition

Analyze contribution of each variable to forecasting error

fevd = result.fevd(20)

fevd.plot()

0 5 10 15 200.0

0.2

0.4

0.6

0.8

1.0m1

0 5 10 15 200.0

0.2

0.4

0.6

0.8

1.0

1.2realgdp

0 5 10 15 200.0

0.2

0.4

0.6

0.8

1.0

1.2cpi

Forecast error variance decomposition (FEVD) m1realgdpcpi

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29

Page 22: Scipy 2011 Time Series Analysis in Python

VAR: Statistical tests

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])

Granger causality f-test

=========================================================

Test statistic Critical Value p-value df

---------------------------------------------------------

1.248787 2.387325 0.289 (4, 579)

=========================================================

H_0: [’cpi’, ’realgdp’] do not Granger-cause m1

Conclusion: fail to reject H_0 at 5.00% significance level

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29

Page 23: Scipy 2011 Time Series Analysis in Python

Filtering

Hodrick-Prescott (HP) filter separates a time series yt into a trend τtand a cyclical component ζt , so that yt = τt + ζt .

19621966

19701974

19781982

19861990

19941998

20022006

4

2

0

2

4

6

8

10

12

14

InflationCyclical componentTrend component

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29

Page 24: Scipy 2011 Time Series Analysis in Python

Filtering

In addition to the HP filter, 2 other filters popular in finance andeconomics, Baxter-King and Christiano-Fitzgerald, are available

We refer you to our paper and the documentation for details on these:

1966

1971

1976

1981

1986

1991

1996

2001

2006

4

2

0

2

4

Inflation and Unemployment: BK Filtered

INFLUNEMP

1963

1968

1973

1978

1983

1988

1993

1998

2003

2008

4

2

0

2

4

Inflation and Unemployment: CF Filtered

INFLUNEMP

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29

Page 25: Scipy 2011 Time Series Analysis in Python

Preview: Bayesian dynamic linear models (DLM)

A state space model by another name:

yt = F ′tθt + νt , νt ∼ N (0,Vt)

θt = Gθt−1 + ωt , ωt ∼ N (0,Wt)

Estimation of basic model by Kalman filter recursions. Provideselegant way to do time-varying linear regressions for forecasting

Extensions: multivariate DLMs, stochastic volatility (SV) models,MCMC-based posterior sampling, mixtures of DLMs

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29

Page 26: Scipy 2011 Time Series Analysis in Python

Preview: DLM Example (Constant+Trend model)

model = Polynomial(2)

dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model

m0=m0, C0=C0, n0=n0, s0=s0, # priors

state_discount=.95) # discount factor

Nov 2008

Jan 2009

Mar 2009

May 2009

Jul 2009

Sep 2009

Nov 2009

50

100

150

200

Constant + Trend DLM

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29

Page 27: Scipy 2011 Time Series Analysis in Python

Preview: Stochastic volatility models

0 200 400 600 800 10000.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6JPY-USD Exchange Rate Volatility Process

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29

Page 28: Scipy 2011 Time Series Analysis in Python

Future: sandbox and beyond

ARCH / GARCH models for volatility

Structural VAR and error correction models (ECM) for cointegratedprocesses

Models with non-normally distributed errors

Better data description, visualization, and interactive research tools

More sophisticated Bayesian time series models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29

Page 29: Scipy 2011 Time Series Analysis in Python

Conclusions

We’ve implemented many foundational models for time seriesanalysis, but the field is very broad

User interface can and should be much improved

Repo: http://github.com/statsmodels/statsmodels

Docs: http://statsmodels.sourceforge.net

Contact: [email protected]

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29