Study Book

316
STA3303 Statistics for Climate Research Faculty of Sciences Study Book Written by Dr Peter Dunn Department of Mathematics & Computing Faculty of Sciences The University of Southern Queensland

Transcript of Study Book

Page 1: Study Book

STA3303

Statistics for ClimateResearchFaculty of Sciences

Study Book

Written by

Dr Peter DunnDepartment of Mathematics & ComputingFaculty of SciencesThe University of Southern Queensland

Page 2: Study Book

ii

Published by

University of Southern QueenslandToowoomba Queensland 4350Australia

http://www.usq.edu.au

©The University of Southern Queensland, 2007.2.

Copyrighted materials reproduced herein are used under the provisions of theCopyright Act 1968 as amended, or as a result of application to the copyrightowner.

No part of this publication may be reproduced, stored in a retrieval system ortransmitted in any form or by any means electronic, mechanical, photocopying,recording or otherwise without prior permission.

Produced using LATEX in the USQ style by the Department of Mathematics andComputing.

© USQ, February 21, 2007

Page 3: Study Book

Table of Contents

I Time Series Analysis 1

1 Introduction 3

2 Autoregressive (AR) models 23

3 Moving Average (MA) models 41

4 arma Models 59

5 Finding a Model 73

6 Diagnostic Tests 105

7 Non-Stationary Models 129

8 Markov chains 173

9 Other Models 205

II Multivariate Statistics 213

10 Introduction 215

iii

Page 4: Study Book

iv Table of Contents

11 Principal Components Analysis 225

12 Factor Analysis 255

13 Cluster Analysis 279

A Installing other packages in R 291

B Review of statistical rules 293

C Some time series tricks in R 299

D Time series functions in R 301

E Multivariate analysis functions in R 305

© USQ, February 21, 2007

Page 5: Study Book

Strand I

Time SeriesAnalysis

1

Page 6: Study Book

2

© USQ, February 21, 2007

Page 7: Study Book

Module 1Introduction

Module contents1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Time-series . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Signal and noise . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Simple methods . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 The r package . . . . . . . . . . . . . . . . . . . . . . . 131.5.2 Getting help in r . . . . . . . . . . . . . . . . . . . . . . 17

1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.6.1 Answers to selected Exercises . . . . . . . . . . . . . . . 19

Module objectives

Upon completion of this module students should be able to:

� recognize and define a time series;

3

Page 8: Study Book

4 Module 1. Introduction

� understand what defines a stationary time series;

� know the particular kinds of time series being discussed in this course;

� recognise the reasons for finding statistical models for time series;

� understand the notation used to designate a time series;

� understand that a time series consists of a signal plus noise;

� understand that the signal of a time series can be modelled and thatthe noise is random;

� list some simple time series modelling methods;

� know how to use the software package r to do basic manipulationswith time series data, including loading data, plotting the data, anddefining the data as time series data.

1.1 Introduction

This Module introduces time series and associated terminology. Some sim-ple methods are discussed for analysing time series, and the software usedin the course is also introduced.

1.2 Time-series

1.2.1 Definitions

A time series is a sequence of observations ordered by time. Examples in-clude the noon temperature measured daily at the Oakey airport, the annualsales of passenger cars in Australia, monthly average values of the southernoscillation index (SOI), the number of people receiving unemployment ben-efits in Queensland each month, and the number of bits of information sentthrough a computer line per second. In each case, the observations are takenat regular time intervals. This is not necessary, but greatly simplifies themathematics; we will only be concerned with time series where observationsare taken at regular intervals (that is, equally spaced: each month, each dayor each year for example). In this course, the emphasis is on climatologicalapplications; however time series are used in many branches of science andengineering, and are particularly common in business (sales forecasts, sharemarkets and so on).

© USQ, February 21, 2007

Page 9: Study Book

1.2. Time-series 5

A time series is interesting because the series is a function of past values ofitself, and so the series is somewhat predictable. The task of the scientistis to find out more about that relationship between observations. Unlikemost statistics, the observations in a time series are not independent (thatis, they are dependent). Time series are usually plotted using a time-plot,as in the next example.

Example 1.1: The monthly Southern Oscillation Index (the SOI) is avail-able for approximately the last 130 years. A plot of the monthly av-erage SOI (Fig. 1.1) has time on the horizontal axis, and the SOI onthe vertical axis. Generally, the observations are joined with a lineto indicate that the points are given in a particular order. (Note thehorizontal line at zero was added by me, and is not part of the defaultplot.)

Example 1.2: The seasonal SOI can also be examined. This series cer-tainly does not consist of independent observations. The seasonal SOIcan be plotted against the SOI for the previous season, the seasonbefore that, and so on (Fig. 1.2).

There is a reasonably strong relationship between the seasonal SOIand the previous season. The relationship between the SOI and theseason before that is still obvious; it is less obvious (but still present)with three seasons previous. There is basically no relationship betweenthe seasonal SOI and the SOI four seasons previous.

A stationary time series is a time series whose statistics do not changeover time. Such statistics are typically the mean and the variance (and thecovariance, discussed in Sect. 2.5.3). Initially, only stationary time series areconsidered in this course. In Module 7, methods are discussed for modellingnon-stationary time series and for identifying non-stationary time series. Atpresent, identify a non-stationary time series simply using a time series plotof the data, as shown in the next Example.

Example 1.3:

Consider the annual rainfall near Wendover, Utah, USA. (These dataare considered in more detail in Example 7.1.) A plot of the data(Fig. 1.3, top panel) suggests a non-stationary mean (the mean goesup and down a little). To check this, a smoothing filter was applied

© USQ, February 21, 2007

Page 10: Study Book

6 Module 1. Introduction

Time

SO

I

1880 1900 1920 1940 1960 1980 2000

−40

−30

−20

−10

0

10

20

30

Time

SO

I

1980 1985 1990 1995 2000

−30

−20

−10

0

10

20

Figure 1.1: A time-plot of the monthly average SOI. Top: the SOI fromfrom 1876 to 2001; Bottom: the SOI since 1980 showing more detail. (Inthis example, the SOI has been plotted using las=1; this just make the labelson the vertical axis easier to read in my opinion, but is not necessary.)

© USQ, February 21, 2007

Page 11: Study Book

1.2. Time-series 7

● ●

●●

●●●

●●●

●●

●●

●●●

●●● ●

●●

●● ●

●●●

● ●

●●● ●

●●

● ●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●● ●

●●

●●

●●

●●

●●●

● ●

●●●

● ●

●●●

●●

●●

●●●

●●●

●●

−30 −20 −10 0 10 20

−30

−20

−10

0

10

20

SOI vs SOI one season previous

SOI at time t

SO

I at t

ime

t−−1

● ●

● ●

●● ●

●● ●

●●

●●

●●●

●● ● ●●

●●

● ●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ● ●

●●

● ●

● ●

●●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●●

●● ●●

●●

●●

● ●

●●

●●●

● ●

● ●●

●●

● ●●

●●

●●

●●●

●● ●

●●

−30 −20 −10 0 10 20

−30

−20

−10

0

10

20

SOI vs SOI two seasons previous

SOI at time t

SO

I at t

ime

t−−2

● ●

● ●

●●

●●●

●●●

●●

●●

●●●

● ● ● ●●

●●

●●●

●● ●

●●

●●

●●●

●●

●●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

● ● ●

●●

●●

● ●

●● ●

●●

●●

●●●●

● ● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

●● ●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

● ●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

● ●●

●●

●●

●● ●

●●●

−30 −20 −10 0 10 20

−30

−20

−10

0

10

20

SOI vs SOI three seasons previous

SOI at time t

SO

I at t

ime

t−−3

● ● ●

●●

●●

●●●

●●●

●●

●●

●●●

● ● ● ●●

●●

●●●

●●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

● ●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●● ●

●●

● ●●

● ●

●●●

●●

●●

●● ●

●●●

−30 −20 −10 0 10 20

−30

−20

−10

0

10

20

SOI vs SOI four seasons previous

SOI at time t

SO

I at t

ime

t−−4

Figure 1.2: The seasonal SOI plotted against previous values of the SOI.

that computed the mean of each set of six observations at a time. Thissmooth gave the thick, dark line in the bottom panel of Fig. 1.3, andsuggests that the mean is perhaps non-stationary as this line is not(approximately) constant. However, it is not too bad. The middlepanel of Fig. 1.3 shows a series that is definitely non-stationary. Thisseries—the average monthly sea-level at Darwin—is not stationary asthe mean obviously fluctuates. However, the SOI from 1876 to 2001,plotted in the top panel of Fig. 1.3 (and seen in Example 1.1), isapproximately stationary.

All the time series considered in this part of the course will be equallyspaced (or regular). These are time series recorded at regular intervals—every day, year, month, second, etc. Until Module 8, the time series are allconsidered for continuous data. In addition, only stationary time series willbe considered initially (until Module 7).

1.2.2 Purpose

There are two main purposes of gathering time series:

© USQ, February 21, 2007

Page 12: Study Book

8 Module 1. Introduction

Time

SO

I

1880 1900 1920 1940 1960 1980 2000

−40

−30

−20

−10

0

10

20

30

Time

Sea

leve

l (in

m)

1988 1990 1992 1994 1996 1998 2000

3.8

3.9

4.0

4.1

4.2

Year

Ann

ual r

ainf

all (

in m

m)

1920 1940 1960 1980 2000

200

300

400

500

Figure 1.3: Stationary and non-stationary time series. Bottom: the annualrainfall near Wendover, Utah, USA in mm is plotted. The data is plottedwith a thin line, and the smoothed data in a thick line indicating that themean is perhaps non-stationary. Middle: the monthly average sea level(in metres) in Darwin, Australia is plotted. The data are definitely notstationary, as the mean fluctuates. Top: the average monthly SOI from1876 to 2001 is shown. This series looks approximately stationary.

© USQ, February 21, 2007

Page 13: Study Book

1.2. Time-series 9

1. First, it helps us understand the process underlying the observations.

2. Secondly, data is gathered to predict, or forecast, what may happennext. It is of great interest in climatology, for example, to predict thevalue of seasonal climatic indicators. In business, it is important to beable to predict future sales of products.

Forecasting is the process of estimating future values of numerical parame-ters on the basis of the past. To do this, a model is created. This model isan artificial equation that captures the important features of the data.

Example 1.4: Consider the average monthly sea level (in metres) in Dar-win, Australia (Fig. 1.3, middle panel).

Any useful model for this time series would need to capture the im-portant features of this time series. What are the important features?One obvious feature is that the series has a cyclic pattern: the averagesea level rises and falls on a regular basis. Is there also an indicationthat the average sea level has been rising since about 1994? Any goodmodel should capture these important features of the data. As notedin the previous Example, the series is not stationary.

Methods for modelling and forecasting time series are well established andrigourous and are sometimes quite accurate, but keep in mind the following:

� Any forecast is only as good as the information it is based on. Itis not possible for a good method of forecasting to make up for lackof information, or inaccurate information, about the process beingforecasted.

� Some processes may be impossible to forecast with any useful accuracy(for example, future outcomes of a coin tossing experiment).

� Some processes are usefully forecast by means of complex expensivemethods—for example, daily regional weather forecasting.

1.2.3 Notation

Consider a sequence of numbers {Xn} = {X1, X2, . . . , XN}, ordered bytime, so that Xa comes before Xb if a is less than b; that is, {Xn} is a timeseries. This notation indicates that the time series measures the variableX (which may be monthly rainfall, water temperatures or snowfall depths,

© USQ, February 21, 2007

Page 14: Study Book

10 Module 1. Introduction

for example). The subscript indicates particular observations in the series.Hence, X1 is the first observation, the first recorded in the data. (Note thatY , W or some other letter may be used in place of X.)

The notation Xt (or Xn, or similar) is used to indicate the value of the timeseries X at a particular point in time t. For different values of t, values ofthe time series at different points in time are indicated. That is, Xt+1 refersto the next term in the series following Xt.

The entire series is usually written {Xn}n≥1, indicating the variable X isa time sequence of numbers. Sometimes, the upper and lower limits arespecified explicitly as {Xn}n=1000

n=1 . Quite often, the notation is abbreviatedso that

{Xn} ≡ {Xn}n≥1.

1.3 Signal and noise

The observed and recorded time series, say {Xn}, consists of two compo-nents:

1. The signal. This is the component of the data that contains informa-tion, say {Sn} This is the component of the time series that can beforecast.

2. The noise. This is the randomness that is observed, which may bedue to numerous other variables affecting the signal, measurementimperfections, etc. Because the noise is random, it cannot be forecast.

The task of the scientist is to extract the signal (or information) from thetime series in the presence of noise. There is no way of knowing exactly whatthe signal is; instead, statistical methods are used to separate the randomnoise from the forecastable signal. There are many methods for doing this; inthis course, one of those methods will be studied in detail: the Box–Jenkinsmethod. Some other simple models are discussed in Sect. 1.4; more complexmethods are discussed in Module 9.

Example 1.5: Consider the monthly Pacific Decadal Oscillation, or PDO(obtained from monthly Sea-Surface Temperature (SST) anomalies inthe North Pacific Ocean). The data from January 1980 to December2000 (Fig. 1.4, top panel) is non-stationary. The data consist of asignal and noise. One way to extract the signal is to use a smoother.

© USQ, February 21, 2007

Page 15: Study Book

1.3. Signal and noise 11

Time

PD

O

1980 1985 1990 1995 2000

−2

−1

0

1

Time

PD

O s

igna

l

1980 1985 1990 1995 2000

−2

−1

0

1

Time

PD

O n

oise

1980 1985 1990 1995 2000

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Figure 1.4: The monthly Pacific Decadal Oscillation (PDO) from Jan 1980to Dec 2000 in the top plot. Middle: a lowess smooth is shown superimposedover the PDO. Bottom: the noise is shown (observations minus signal).

© USQ, February 21, 2007

Page 16: Study Book

12 Module 1. Introduction

A lowess smoother can be applied to the data1. (The details are notimportant—it is simply one type of smoother.) For one set of param-eters, the smooth is shown in Fig. 1.4 (middle panel). The smoothercaptures the important features of the time series, and ignores therandom noise. The noise is shown Fig. 1.4 (bottom panel), and if thesmooth is good, should be random. (In this example, the noise doesnot apear random, and so the model is probably not very good.)

One difficulty with using smoothers is that they have limited use forforecasting into the future, as the fitted smoother apply only for thegiven data. Consequently, other methods are considered here.

1.4 Simple methods

Many methods exist for modelling time series. These notes concentrate onlyon the Box–Jenkins method, though some other methods will be discussedvery briefly at the end of the course.

It is very important to use the appropriate forecasting technique for eachparticular application however. The Box–Jenkins technique is of generalapplicability, and has been used in many applications. In addition, studyingthe Box–Jenkins method will enable the student to learn other techniquesas appropriate: the language, basic techniques and skills are applicable toother methods also.

In this section, a variety of simple methods for forecasting are first discussed.Importantly, in some situations they are also the best method available.If this is the case, it may not be obvious—it might require some carefulstatistical analysis to show that a simple model is the best model.

Constant estimation The simplest possible approach is to use a constantforecast for all future values of the time series. This is appropriatewhen successive values of the time series are completely uncorrelatedbut do come from the same distribution.

Slope estimation If the time series appears to have a linear trend, it maybe appropriate to estimate this trend by fitting a straight line by linearregression. Future values can then be forecast by extrapolating thisline.

1Many statisticians would probably not identify a smoother as a statistical model (infact, I am one of them). But the use of a smoother here demonstrates a point.

© USQ, February 21, 2007

Page 17: Study Book

1.5. Software 13

Random walk model In some cases, the best estimate of a future valueis the most recent observation. This model is called a random walkmodel. For example, the best forecast of the future price of a share isusually quite close to the present price.

Smoothing Smoothing is the name given to a collection of techniques whichestimate future values of a time series by an average of past values.This approach makes sense when there is random variations which addonto a relatively stable trend in the process under study. An examplehas been seen in Example 1.5.

Regression Another method of forecasting is to relate the parameter understudy to some known parameter, or parameters, by means of a func-tional relationship which is statistically estimated using regression.

1.5 Software

Most standard statistical software packages—such as SPSS, SAS, r and S-Plus—can analyse time series data. In addition, many mathematical pack-ages (such as Matlab) can be used, but sometime require add-ons whichusually cost money.

1.5.1 The R package

This course uses the free software package r. r is a free, open source soft-ware project which is “not unlike” S-Plus, an expensive commercial softwarepackage. r is available for many operating systems from http://cran.r-project.org/, or http://mirror.aarnet.edu.au/pub/CRAN/ for resi-dents of Australia and New Zealand. More information about r, includingdocumentation, is found at http://www.r-project.org/. r is commandline driven like Matlab, but has a statistical rather than mathematicalfocus.

r is object orientated. This means to get the most benefit from r, objectsshould be correctly defined. For example, time series data should be declaredas time series data. When r knows that a particular data set is a time series,it has default mechanisms of working with the data. For example, plottingdata in r generally produces a dot-plot; if the data is declared as time seriesdata, the data are joined by lines which is the standard way of plotting timeseries data. The following example explains some of these details.

© USQ, February 21, 2007

Page 18: Study Book

14 Module 1. Introduction

In r, you can set the working directory using (for example)setwd("c:/My Documents/USQ/STA3303/data"). Check the cur-rent working directory using getwd(). It is usually sensible toset this working directory as soon as you start r to the locationof your data files. This will be assumed throughout these studynotes.

Example 1.6: In Example 1.1, the monthly average SOI was plotted. Ass-ming the current folder (or directory) is set as described above, thefollowing code reproduces this plot.

> soidata <- read.table("soiphases.dat", header = TRUE)

The data is loaded using read.table. The option header=TRUE meansthat the first row of the data contained header information (that is,names for the variables).

An alternative method for loading the data directly from the internetis:

> soidata <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/climatology/soiphases.dat",

+ header = TRUE)

Now, take a quick look at the variables:

> summary(soidata)

year month soiMin. :1876 Min. : 1.000 Min. :-38.80001st Qu.:1907 1st Qu.: 3.000 1st Qu.: -6.6000Median :1939 Median : 6.000 Median : 0.3000Mean :1939 Mean : 6.493 Mean : -0.15143rd Qu.:1970 3rd Qu.: 9.000 3rd Qu.: 6.7500Max. :2002 Max. :12.000 Max. : 33.1000

soiphaseMin. :0.0001st Qu.:2.000Median :3.000Mean :3.1483rd Qu.:5.000Max. :5.000

> soidata[1:5, ]

© USQ, February 21, 2007

Page 19: Study Book

1.5. Software 15

year month soi soiphase1 1876 1 10.8 22 1876 2 10.6 23 1876 3 -0.7 34 1876 4 7.9 45 1876 5 6.9 2

> names(soidata)

[1] "year" "month" "soi" "soiphase"

This shows the dataset (or object) soidata consists of four differentvariables. The one of interest now is soi, and this variable is referredto (and accessed) as soidata$soi. To use this variable first declare itas a time series object:

> SOI <- ts(soidata$soi, start = c(1876, 1),

+ end = c(2002, 2), frequency = 12)

The first argument is the name of the variable. The input start indi-cates the time when the data starts. For the SOI data, the data startsat January 1876, which is input to r as c(1876, 1) (the one meansJanuary, the first month). The command c means ‘concatenate’, orjoin together. The data set ends at February 2002; if an end is notdefined, r should be able to deduce it anyway from the rest of thegiven information. But make sure you check your time series to ensurer has interpreted the input correctly. The argument frequency indi-cates that the data have a cycle of twelve (that is, each twelve pointsmake one larger grouping—here twelve months make one year).

Now plot the data:

> plot(SOI, las = 1)

> abline(h = 0)

The plot (Fig. 1.5, top panel) is formatted correctly for time seriesdata. (The command abline(h=0) adds a horizontal line at y = 0.)In contrast, if the data is not declared as time series data, the defaultplot appears as the bottom panel in Fig. 1.5.

When the data are declared as a time series, the observations areplotted and joined by lines and the horizontal axis is labelled Time bydefault (the axis label is easily changed using the command:title(ylab="New y-axis label")).

Other methods also have a standard default if the data have beendeclared as a time series object.

© USQ, February 21, 2007

Page 20: Study Book

16 Module 1. Introduction

Time

SO

I

1880 1900 1920 1940 1960 1980 2000

−40

−30

−20

−10

0

10

20

30

●●

●●

●●

●●●●●●●●●●

●●●●

●●

●●●

●●●●

●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●

●●●

●●●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●●

●●●

●●

●●●●

●●●●●●

●●●●

●●●

●●●●●●

●●●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●

●●●

●●●●●●

●●●

●●●

●●

●●●●

●●

●●●●●●●

●●●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●●●●●

●●

●●●

●●

●●●

●●●●

●●

●●●●●●

●●

●●●●●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●

●●●●

●●●●●●●●●●

●●●

●●

●●●●●●●

●●

●●●●●●●

●●●●

●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●●●●

●●

●●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●●●

●●

●●●●

●●

●●●●●

●●●

●●●●●

●●

●●●●●

●●●●

●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

0 500 1000 1500

−40

−30

−20

−10

0

10

20

30

Index

soid

ata$

soi

Figure 1.5: A plot of the monthly average SOI from 1876 to 2001, withoutthe series declared as being a time series. Top: the data has been declaredas a time series; Bottom: the data has not been declared as a time series.

© USQ, February 21, 2007

Page 21: Study Book

1.5. Software 17

In the above example, the data was available in a file. If it is not avail-able, a data file can be created, or the data can be entered in to r. Thefollowing commands show the general approach to entering data in r. Thecommand c is very useful: it is used to create a list of numbers, and standsfor ‘concatenate’.

> data.values <- c(12, 14, 1, 8, 9, 10, 7)

> data <- ts(data.values, start = 1980)

The first line puts the observation into a list called data.values. The secondline designates the data as a time series starting in 1980 (and so r assumesthe values are annual measurements).

You can also use scan(); see ?scan.Data stops being read when a blank lineis entered if you use scan.

Other commands will be introduced as appropriate throughout the course.A full list of the time series functions available in r are given in Appendix D.

1.5.2 Getting help in R

Two commands of particular interest are the commands help and help.search.The help command gives help on a particular topic. For example, try typinghelp("names") or help("plot") at the r command prompt. (The quotesare necessary.) A short-cut is also available: typing ?names is equivalent totyping help("names"). Using the short-cut is generally more convenient.

The command help.search searches the help database for particular words.For example, try typing help.search("eigen") to find how to evaluateeigenvalues in r. (The quotes are necessary.) This function requires a rea-sonably specific search phrase. The command help.start starts the r helpin a web browser (if everything is configured correctly).

Further help and information is available at http://stat.ethz.ch/R/manual/doc/html/, including a Web-based manual An Introduction to r. Afterstarting r, look under the Help menu for available documentation.

© USQ, February 21, 2007

Page 22: Study Book

18 Module 1. Introduction

1.6 Exercises

Ex. 1.7: Start r and load in the data file qbo.dat. This data file is a timeseries of the monthly quasi-biennial oscillation (QBO) from January1948 to December 2001.

(a) Examine the variables in the data set using names.

(b) Declare the QBO as a time series, setting the start, end andfrequency parameters correctly.

(c) Plot the data.

(d) Is the data stationary? Explain.

(e) Determine the mean and variance of the series.

(f) List important features in the data (if any) that should be mod-elled.

Ex. 1.8: Start r and load in the data file easterslp.dat. This data file isa time series of sea-level air pressure anomalies at Easter Island fromJan 1951 to Dec 1995.

(a) Examine the variables in the data set using names.

(b) Declare the air pressures as a time series, setting the start, endand frequency parameters correctly.

(c) Plot the data.

(d) List important features in the data (if any) that should be mod-elled.

Ex. 1.9: Obtain the maximum temperature from your town or residence foras far back as possible up to, say, thirty days. This may be obtainedfrom a newspaper or website.

(a) Load the data into r.

(b) Declare the series a time series, and plot the data.

(c) List important features in the data (if any) that should be mod-elled.

(d) Compute the mean and variance of the series.

Ex. 1.10: The data in Table 1.1 shows the mean annual levels at LakeVictoria Nyanza from 1902 to 1921, relative to a fixed reference point(units are not given). The data are from Shaw [41], as quoted inHand [19].

(a) Enter the data into r as a time series.

© USQ, February 21, 2007

Page 23: Study Book

1.6. Exercises 19

Year Level Year Level

1902 −10 1912 −111903 13 1913 −31904 18 1914 −21905 15 1915 41906 29 1916 151907 21 1917 351908 10 1918 271909 8 1919 81910 1 1920 31911 −7 1921 −5

Table 1.1: The mean annual level of Lake Victoria Nyanza from 1902 to1921 relative to some fixed level (units are unknown).

(b) Plot the data. Make sure you give appropriate labels.

(c) List important features in the data (if any) that should be mod-elled.

Ex. 1.11: Many people believe that sunspots affect the climate on theearth. The mean number of sunspots from 1770 to 1869 for each yearare given in the data file sunspots.dat and are shown in Table 1.2.(The data are from Izenman [23] and Box & Jenkins [9, p 530], asquoted in Hand [19]).

(a) Enter the data into r as a time series by loading the data filesunspots.dat.

(b) Plot the data. Make sure you give appropriate labels.

(c) List important features in the data (if any) that should be mod-elled.

1.6.1 Answers to selected Exercises

1.7 (a) Here is one solution:

> qbo <- read.table("qbo.dat", header = TRUE)

> names(qbo)

[1] "Year" "Month" "QBO"

(b) One option is:

> qbo <- ts(qbo$QBO, start = c(qbo$Year[1],

+ 1), frequency = 12)

© USQ, February 21, 2007

Page 24: Study Book

20 Module 1. Introduction

Year Sunspots Year Sunspots Year Sunspots

1770 101 1804 48 1838 1031771 82 1805 42 1839 861772 66 1806 28 1840 631773 35 1807 10 1841 371774 31 1808 8 1842 241775 7 1809 2 1843 111776 20 1810 0 1844 151777 92 1811 1 1845 401778 154 1812 5 1846 621779 125 1813 12 1847 981780 85 1814 14 1848 1241781 68 1815 35 1849 961782 38 1816 46 1850 661783 23 1817 41 1851 641784 10 1818 30 1852 541785 24 1819 24 1853 391786 83 1820 16 1854 211787 132 1821 7 1855 71788 131 1822 4 1856 41789 118 1823 2 1857 231790 90 1824 8 1858 551791 67 1825 17 1859 941792 60 1826 36 1860 961793 47 1827 50 1861 771794 41 1828 62 1862 591795 21 1829 67 1863 441796 16 1830 71 1864 471797 6 1831 48 1865 301798 4 1832 28 1866 161799 7 1833 8 1867 71800 14 1834 13 1868 371801 34 1835 57 1869 741802 45 1836 1221803 43 1837 138

Table 1.2: The annual sunspot numbers from 1770 to 1869.

© USQ, February 21, 2007

Page 25: Study Book

1.6. Exercises 21

QBO from 1948 to 2001

Time

Qua

si−

bien

niel

osc

illat

ion

1950 1960 1970 1980 1990 2000

−30

−20

−10

0

10

Figure 1.6: The QBO from January 1948 to December 2001.

Here the square brackets [ . . . ] have been used; they are used byr to indicate elements of an array or matrix2. (Note that startmust have numeric inputs, so qbo$Month[1]will not work as itreturns Jan, which is a text string.)It is worth printing out qbo to ensure that r has interpretted yourstatements correctly. Type qbo at the prompt, and in particularcheck that the series ends in December 2001.

(c) The following code plots the graph:

> plot(qbo, las = 1, xlab = "Time", ylab = "Quasi-bienniel oscillation",

+ main = "QBO from 1948 to 2001")

The final plot is shown in Fig. 1.6.

1.10 Here is one way of doing the problem. (Note: The data can be enteredusing scan or by typing the data into a data file and loading the usualway. Here, we assume the data is available as the object llevel.)

> llevel <- ts(llevel, start = c(1902))

> plot(llevel, las = 1, xlab = "Time", ylab = "Level of Lake Victoria Nyanza",

+ main = "The (relative) Level of Lake Nyanza from 1902 to 1921")

2Matlab, for example, uses round brackets: ( . . . ).

© USQ, February 21, 2007

Page 26: Study Book

22 Module 1. Introduction

The (relative) Level of Lake Nyanza from 1902 to 1921

Time

Leve

l of L

ake

Vic

toria

Nya

nza

1905 1910 1915 1920

−10

0

10

20

30

Figure 1.7: The mean annual level of Lake Victoria Nyanza from 1902 toDecember 1921. The figures are relative to some fixed level and units areunknown.

The final plot is shown in Fig. 1.7. There is too little data to be sure ofany patterns of features to be modelled, but the series suggests theremay be some regular up-and-down pattern.

© USQ, February 21, 2007

Page 27: Study Book

Module 2Autoregressive (AR) models

Module contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Forecasting ar models . . . . . . . . . . . . . . . . . . . . 27

2.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 The backshift operator . . . . . . . . . . . . . . . . . . . 29

2.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.2 The variance . . . . . . . . . . . . . . . . . . . . . . . . 312.5.3 Covariance and correlation . . . . . . . . . . . . . . . . 322.5.4 Autocovariance and autocorrelation . . . . . . . . . . . 32

2.6 More on stationarity . . . . . . . . . . . . . . . . . . . . 35

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 39

23

Page 28: Study Book

24 Module 2. Autoregressive (AR) models

Module objectives

Upon completion of this module students should be able to:

� understand what is meant by an autoregressive (ar) model;

� use the ar(p) notation to define ar models;

� use ar models to develop forecasting formulae;

� understand the operation of the backshift operator;

� write ar models using the backshift operator;

� compute the mean of a time series written in ar form;

� understand that the variance is not easily computed from the ar formof a model;

� understand the concepts of autocorrelation and autocovariance;

� understand the term ‘lag’ used in the context of autocorrelation;

� compute the autocorrelation function (acf) for an ar model;

� know that the acf will always be one at a lag of zero;

� understand that the acf for a lower-order ar model will decay slowlytoward zero.

2.1 Introduction

In this Module, one particular type of time series model—an autoregressivemodel—is discussed. Subsequent Modules examine other types of models.

2.2 Definition

As stated in previously, the observations in a time series are somehow relatedto past values of the series, and the task of the scientist is to find out moreabout that relationship.

Recall that a time series consists of two components: the or information;and the or random error. If values in a time series are related to past valuesof the series, one possible model for the signal St is for the series Xt to beexpressed as a function of previous values of X. This is exactly the ideabehind an autoregressive model, denoted an ar model. An ar model is oneparticular type of model in the Box–Jenkins methodology.

© USQ, February 21, 2007

Page 29: Study Book

2.2. Definition 25

Example 2.1: Consider the model Wn+1 = 3.12+0.63Wn+en+1 for n ≥ 0.This model is an ar(1) model, since Wn+1 is a function of only onepast value of the series {Wn}. In this model, the information or signalis Sn+1 = 3.12 + 0.63Wn and the noise is en+1.

Example 2.2: An example of an ar(3) model is Tn = 0.9Tn−1−0.4Tn−2 +0.1Tn−3 + en for n ≥ 1. In this model, the information or signal isSn = 0.9Tn−1 − 0.4Tn−2 + 0.1Tn−3 and the noise is en.

A more formal definition of an autoregressive process follows.

Definition 2.1 An autoregressive model of order p, or an ar(p) model,satisfies the equation

Xn = m′ + en +p∑

k=1

φkXn−k

= m′ + en + φ1Xn−1 + φ2Xn−2 + · · ·+ φpXn−p (2.1)

for n ≥ 0, where {en}n≥0 is a series of independent, identically distributed(iid) random variables, and m′ is some constant.

The letter p denotes the order of autoregressive model, defining how manyprevious values the current value is related to. The model is called auto-regressive because the series is regressed on to past values of itself.

The error term {en} in Equation (2.1) refers to the noise in the time series.Above, the errors were said to be iid. Commonly, they are also assumed tohave a normal distribution with mean zero and variance σ2

e .

For the model in Equation (2.1) to be of use in practice, the scientist mustbe able to estimate the value of p (that is, how many terms are needed inthe ar model), and then estimate the values of φk and m′. Each of theseissues will be addressed in later sections.

Notice the subscripts are defined so that the first value of the series to appearon the left of the equation is always one. Now consider the ar(3) model inExample 2.2: When n = 1 (for the first observation in the time series), theequation reads

T1 = 0.9T0 − 0.4T−1 + 0.1T−2 + e1.

Bbut the series {T} only exists for positive indices. This means that themodel does not apply for the first three terms in the series, because the dataT0, T−1 and T−2 are unavailable.

© USQ, February 21, 2007

Page 30: Study Book

26 Module 2. Autoregressive (AR) models

Example 2.3: Using r, it is easy to simulate an ar model. For Exam-ple 2.1, the following r code simulates the series:

> noise <- rnorm(100, 0, 1)

> W <- array(dim = length(noise))

> W[1] <- 0

> for (i in 2:length(noise)) {

+ W[i] <- 3.12 + 0.63 * W[i - 1] + noise[i]

+ }

> plot(W, type = "l", las = 1)

Note type="l" means to use lines, not points (meaning it is an “ell”,not a numeral “one”).

More directly, a time series can be simulated using arima.sim as fol-lows:

> sim.ar1 <- arima.sim(model = list(ar = c(0.63)),

+ n = 100)

0 20 40 60 80 100

0

2

4

6

8

10

Index

W

Figure 2.1: One realization of the ar(1) model Wn+1 = 3.12+0.63Wn+en+1

The final plot is shown in Fig. 2.1. The data created in r are called arealization of the model. Every realization will be different, since eachwill be based on a different set of random {e}. The first few values

© USQ, February 21, 2007

Page 31: Study Book

2.3. Forecasting ar models 27

are not typical, as the model cannot be used for the first observation(when n = 0 in the ar(1) model in Example 2.1, W0 does not exist);it takes a few terms before the effect of this is out of the system.

Example 2.4: Chu & Katz [13] studied the seasonal SOI time series {Xt}from January 1935 to August 1983 (that is, the average SOI for (north-ern hemisphere) Summer, Spring, etc), and concluded the data waswell modelled using the ar(3) model

Xt = 0.6885Xt−1 + 0.2460Xt−2 − 0.3497Xt−3 + et.

An ar(3) model was alluded to in Example 1.2 (Fig. 1.2 on p 7).

2.3 Forecasting AR models

One purposes of having models for time series data is to make forecasts. Inthis section, ar models will be discussed. First, some notation is established.

2.3.1 Notation

Consider a time series {Xn}. Suppose the values of {Xn} are known fromn = 1 to n = 100. Then the forecast of {Xn} at n = 101 is written asX101|100. The ‘hat’ indicates the quantity is a forecast, not an observedvalue of the series. The subscript implies the value of {Xn} is known upto n = 100, and the forecast is for the value at n = 101. This is called aone-step ahead forecast, since the forecast is one-step ahead of the availabledata.

In general, the notation Xn+k|n indicates the value of the time series {Xn}is to be forecast for time n+ k assuming that the series is known up to timen. This forecast is a k-step ahead forecast. Note a k-step ahead forecast canbe written in many ways: Xn+k|n, Xn|n−k and Xn−2|n−k−2 are all k-stepahead forecasts.

Example 2.5: Consider the forecast Yt+3|t+1. This is a forecast of the timeseries {Yt} at time t + 3 if the time series is known to time t + 1. Thisis a two-step ahead forecast, since the forecast at t + 3 is two stepsahead of the available information, known up to time t + 1.

© USQ, February 21, 2007

Page 32: Study Book

28 Module 2. Autoregressive (AR) models

2.3.2 Forecasting

Forecasting using an ar model is quite simple. Consider the following ar(2)model:

Fn = 23 + 0.4Fn−1 − 0.2Fn−2 + en, (2.2)

where en has a normal distribution with a mean of zero and variance ofσ2

e = 5; that is, en ∼ N(0, 5). Suppose a one-step ahead forecast is requiredif the information about the time series {Fn} is known up to time n; thatis, Fn+1|n is required.

The value of Fn+1, if we knew exactly what is was, is found from Equa-tion (2.2) as

Fn+1 = 23 + 0.4Fn − 0.2Fn−1 + en+1 (2.3)

by adjusting the subscripts. Then conditioning on what we actually ‘know’gives

Fn+1|n = 23 + 0.4Fn|n − 0.2Fn−1|n + en+1|n

Adding ‘hats’ to all the terms, the forecast will be

Fn+1|n = 23 + 0.4Fn|n − 0.2Fn−1|n + en+1|n.

Now, since information is known up to time n, the value of Fn|n is knownexactly: it’s the value of F at time n, Fn. Likewise, Fn−1|n = Fn−1. Butwhat about the value of en+1|n? It is not known at time n as it is a futurerandom noise component. So what do we do with the en+1|n term?

If we know nothing about the value of en+1|n, a sensible approach would beto use the mean value of {en}, which is zero. Hence,

Fn+1|n = 23 + 0.4Fn − 0.2Fn−1 (2.4)

is the forecast.

The difference between Fn+1 and Fn+1|n determined from Equation (2.3)and (2.4) is

Fn+1 − Fn+1|n = (23 + en+1 + 0.4Fn − 0.2Fn−1)− (23 + 0.4Fn − 0.2Fn−1)= en+1.

Hence, the error in making the forecast is en+1, and so the terms {en} areactually the one-step ahead forecasting errors.

The same approach can be used for k-step ahead forecasts also, as shown inthe next example.

© USQ, February 21, 2007

Page 33: Study Book

2.4. The backshift operator 29

Example 2.6: Consider the ar(2) model in Equation (2.2). To determinethe two-step ahead forecast, first find

Fn+2 = 23 + 0.4Fn+1 − 0.2Fn + en+2.

HenceFn+2|n = 23 + 0.4Fn+1|n − 0.2Fn|n + en+2|n.

Now, information is known up to time n, so Fn|n = Fn. As before,en+2|n is not know, so is replaced by the mean value, which is zero. Butwhat about Fn+1|n? It is unknown, since information is only knownup to time n, so information at time n+1 is unknown. So what is thebest estimate of Fn+1|n?

Note that Fn+1|n is simply a one-step ahead forecast itself availablefrom Equation (2.4). So the two-step ahead forecast here is

Fn+2|n = 23 + 0.4Fn+1|n − 0.2Fn,

where Equation (2.4) can be substituted for Fn+1|n, but it is not nec-essary.

2.4 The backshift operator

This section introduces the backshift operator , a tool that enables compli-cated time series model to be written in a simple form, and also allows themodels to be manipulated. A full appreciation of the value of the backshiftoperator will not become apparent until later, when the models consideredbecome very complicated and cannot be written down in any other (practi-cal) way (see, for example, Example 7.22).

2.4.1 Definition

The backshift operator, B, is defined on a time series as follows:

Definition 2.2 Consider a time series {Xt}. The backshift operator, B,is defined so that BXt = Xt−1.

© USQ, February 21, 2007

Page 34: Study Book

30 Module 2. Autoregressive (AR) models

Note the backshift operator can be used more than once, so that

B2Xt = B.B.Xt = B(BXt) = BXt−1 = Xt−2.

In general,BrXt = Xt−r.

The backshift operator allows ar models to be written in a different form,which will later prove very useful.

Note the backshift operator only operates on time series (otherwise it makesno sense to “shift backward” in time). This implies that Bk = k if k is aconstant.

Example 2.7: Consider the ar(2) model

Yt+1 = 0.23Yt − 0.15Yt−1 + et+1.

Using the backshift operator notation, this model is written

Yt+1 − 0.23BYt+1 + 0.15B2Yt+1 = et+1

(1− 0.23B + 0.15B2)Yt+1 = et+1.

Example 2.8: The ar(3) model

Xt = et − 0.4Xt−1 + 0.6Xt−2 − 0.1Xt−3

is written using the backshift operator as

φ(B)Xt = et

where φ(B) = (1+0.4B− 0.6B2 +0.1B3). The notation φ(B) is oftenused to denote an autoregressive polynomial in B.

2.5 Statistics

In this Section, the important statistics of an ar model are studied.

© USQ, February 21, 2007

Page 35: Study Book

2.5. Statistics 31

2.5.1 The mean

In Equation (2.1), the general form of an ar(p) model is given. Takingexpected values of each term in this series gives

E[Xn] = E[m′] + E[en] + E[φ1Xn−1] + E[φ2Xn−2] + · · ·+ E[φpXn−p]= m′ + + φ1E[Xn−1] + φ2E[Xn−2] + · · ·+ φpE[Xn−p],

since E[en] = 0 (the average error is zero). Now, assuming the time series{Xk} is stationary, the mean of this series will be approximately constant atany time (that is, for any subscript). Let this constant mean be µ. (It onlymakes sense to talk about the ‘mean of a series’ if the series is stationary.)Then,

µ = m′ + φ1µ + φ2µ + · · ·+ φpµ,

and so, on solving for µ,

µ =m′

1− φ1 − φ2 − · · ·φp.

This enables the mean of the sequences to be computed from the ar model.

Example 2.9: In Equation (2.2), let the mean of the series be µ = E[F ].Taking expected values of each term,

µ = 23 + 0.4µ− 0.2µ + 0.

The mean of the series is µ = E[F ] = 23/0.8 = 28.75.

Example 2.10: Consider the ar(1) model of Example 2.3:

Wn+1 = 3.12 + 0.63Wn + en+1,

for n ≥ 0. Taking expectations, E[W ] = 8.43. The plot of the simu-lated data in Fig. 2.1 (page 26) confirms this.

2.5.2 The variance

(It may be useful to refer to Appendix B while reading this section.)

Consider the ar(1) model

Yt = 12 + 0.5Yt−1 + et,

© USQ, February 21, 2007

Page 36: Study Book

32 Module 2. Autoregressive (AR) models

where {en} ∼ N(0, 4). First, write as

Yt − 0.5Yt−1 = 12 + et,

and then taking the variance of both sides gives

var[Yt] + (−0.5)2var[Yt−1] + 2Covar[Yt, Yt−1] = var[et],

since the errors {en} are assumed to be independent of the time series {Yn}.Since the series is assumed stationary, the variance is constant at all timesteps; hence define σ2

Y = var[Yn]. Then,

1.25σ2Y + 2Covar[Yt, Yt−1] = 4,

since var[en] = 4 in this example. This equation cannot be simplified andsolved for σ2

Y unless there is some understanding of the covariance whichcharacterizes the time series.

2.5.3 Covariance and correlation

The covariance is a measure of how two variables change together. For tworandom variables X (with mean µX and variance σ2

X) and Y (with meanµY and variance σ2

Y ), the covariance is defined as

Covar[X, Y ] = E[(X − µX)(Y − µY )].

Then, the correlation is

Corr[X, Y ] =Covar[X, Y ]

σ2Xσ2

Y

.

A correlation of +1 indicates perfect positive correlation; a correlation of−1 indicates perfect negative correlation. A correlation of zero indicates nocorrelation at all between X and Y .

2.5.4 Autocovariance and autocorrelation

In the case of a time series, the autocovariance is defined between two pointsin the time series {Xn} (with a mean µ), say Xi and Xj , as

κij = E[(Xi − µ)(Xj − µ)].

Since the time series is stationary, the autocovariance is the same if the timeseries is shifted in time. For example, consider Example 1.2 which includes

© USQ, February 21, 2007

Page 37: Study Book

2.5. Statistics 33

a plot of the SOI. If we were to split the SOI series into (say) five equalperiod of time, and produce a plot like Fig. 1.2 (top panel) (p 7) for eachtime period, the correlation would be similar for each time period.

This all means the important information about Xi and Xj is the timebetween the two observations (that is, |i− j|). Arbitrarily, Xi can be set toX0 then, and hence the autocovariance can be written as

γk = Covar[X0, Xk]

for integer k. As with correlation, the autocorrelation is then defined as

ρk =γk

γ0

for integer k, where γ0 = Covar[X0, X0] is simply the variance of the timeseries.

The series {ρk} is known as the autocorrelation function, or acf, at lag k.For any given ar model, it is possible to determine the acf, which willbe unique to that ar model. For this reason, the acf is one of the mostimportant pieces of information to know about a time series. Later, the acfisused to determine which ar model is appropriate for our data.

The term lag indicates the time difference in the acf. Thus, “the acf atlag 2” means the term in the acf for k = 2, which is the correlation of anyterm in the series with the term two time steps before (or after, as the seriesis assumed stationary).

Note that since the autocorrelation is a series, the backshift operator can beused with the autocorrelation. It can be shown that the autocovariance foran ar(p) model is

γ(B) =σ2

e

φ(B)φ(B−1). (2.5)

Example 2.11: In Example 1.2 (p 5), the seasonal SOI was plotted againstthe seasonal SOI for one, two, three and four seasons ago. In r, thecorrelation coefficients were computed as

> soi <- read.table("soiseason.dat", header = TRUE)

> attach(soi)

> len <- length(soi$SOI)

> lags <- 5

> SOI0 <- soi$SOI[lags:len]

> SOI1 <- soi$SOI[(lags - 1):(len - 1)]

> SOI2 <- soi$SOI[(lags - 2):(len - 2)]

> SOI3 <- soi$SOI[(lags - 3):(len - 3)]

> SOI4 <- soi$SOI[(lags - 4):(len - 4)]

> cor(cbind(SOI0, SOI1, SOI2, SOI3, SOI4))

© USQ, February 21, 2007

Page 38: Study Book

34 Module 2. Autoregressive (AR) models

SOI0 SOI1 SOI2 SOI3SOI0 1.000000000 0.6319201 0.4098892 0.2001955SOI1 0.631920149 1.0000000 0.6327576 0.4111551SOI2 0.409889218 0.6327576 1.0000000 0.6336245SOI3 0.200195528 0.4111551 0.6336245 1.0000000SOI4 0.007600544 0.2018563 0.4119918 0.6340156

SOI4SOI0 0.007600544SOI1 0.201856306SOI2 0.411991828SOI3 0.634015609SOI4 1.000000000

The correlations between the SOI and lagged values of the SOI can bewritten as the series of autocorrelations:

{ρ} = {1, 0.632, 0.41, 0.2, 0.0076}.

Example 2.12:

The ar(2) model

Ut+1 = 0.3Ut − 0.2Ut−1 + et+1 (2.6)

is written using the backshift operator as

φ(B)Ut+1 = et+1

where φ(B) = 1− 0.3B +0.2B2. Suppose for the sake of example thatσ2

e = 10. Then, since φ(B−1) = 1− 0.3B−1 + 0.2B−2, the autocovari-ance is

γ(B) =10

(1− 0.3B1 + 0.2B2)(1− 0.3B−1 + 0.2B−2)

=10

(0.2B−2 − 0.36B−1 + 1.13− 0.36B + 0.2B−2).

By some detailed mathematics (Sect. 3.6.3), this equals

γ(B) = · · ·+11.11+2.78B1−1.39B2−0.97B3−0.0139B4+0.190B5+· · · ,

only quoting the terms for the non-negative lags (recall that the auto-correlation is symmetric). The terms in the autocorrelation are there-fore (quoting terms from the non-negative lags again):

{γ} = {γ0, γ1, γ2, . . . }= {11.11, 2.78,−1.39,−0.97,−0.0139, 0.190, 0.0598, · · · }.

© USQ, February 21, 2007

Page 39: Study Book

2.6. More on stationarity 35

The corresponding terms in the autocovariance are found by dividingby γ0 = var[U ] = 11.11, to give

{ρk} = {1, 0.25,−0.125,−0.0875,−0.00125, 0.0017125, 0.00053875, · · · }

The first term at lag zero always has an acf value of one (that is, eachterm is perfectly correlated with itself). It is usual to plot the acf,(Fig. 2.2).

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

lag

AC

F

Figure 2.2: The acf for the ar(2) model in Equation (2.6).

The plot is typical of an ar(2) model: the terms in the acf decayslowly towards zero. Indeed, any low order ar model (such as ar(1),ar(2), ar(3), or similar) shows similar behaviour: a slow decay of theterm towards zero.

2.6 More on stationarity

In an ar(1) model, it can be shown that the model is stationary only if|φ1| < 1, otherwise the model is non-stationary (Exercise 2.24).

© USQ, February 21, 2007

Page 40: Study Book

36 Module 2. Autoregressive (AR) models

For an ar(2) process to be stationary, the following conditions must besatisfied:

φ1 + φ2 < 1φ2 − φ1 < 1−1 < φ2 < 1

These inequalities define a triangular region in the (φ1, φ2) plane (Exer-cise 2.26).

2.7 Summary

In this Module, autoregressive models, or ar models, were studied. Fore-casting and the statistics of the models have been considered. In addition,the use of the backshift operator was studied.

2.8 Exercises

Ex. 2.13: Classify the following ar models (that is, state if they are ar(1),ar(4), etc.)

(a) Xn+1 = en+1 + 78.03− 0.56Xn − 0.23Xn−1 + 0.19Xn−2.

(b) Yn = 12.8− 0.22Yn−1 + en.

(c) Dt − 0.17Dt−1 + 0.18Dt−2 = et.

Ex. 2.14: Classify the following ar models (that is, state if they are ar(1),ar(4), etc.)

(a) Xn = en + 0.223Xn−1.

(b) At = 26.7 + 0.2At−1 − 0.2At−2 + et.

(c) Qt + 0.21Qt−1 + 0.034Qt−2 − 0.13Qt−3 = et.

Ex. 2.15: Determine the mean of each series in Exercise 2.13.

Ex. 2.16: Determine the mean of each series in Exercise 2.14.

Ex. 2.17: Write each of the models in Exercise 2.13 using the backshiftoperator.

Ex. 2.18: Write each of the models in Exercise 2.14 using the backshiftoperator.

© USQ, February 21, 2007

Page 41: Study Book

2.8. Exercises 37

Ex. 2.19: The time series {An} has a mean of 47.4. The following ar(2)model was fitted to the series:

An = m′ + 0.25An−1 + 0.17An−2 + en.

(a) Find the value of m′.

(b) Write the model using the backshift operator.

Ex. 2.20: The time series {Yn} has a mean of 12.26. The following ar(3)model was fitted to the series:

Yn = en + m′ − 0.31Yn−1 + 0.12Yn−2 − 0.10Yn−3.

(a) Find the value of m′.

(b) Write down formulae for forecasting the series one, two and threesteps ahead.

Ex. 2.21: Yao [52] fits numerous ar models to model the total June rainfall(in mm) at Shanghai, {Yt}, from 1932 to 1950. One of the fitted modelsis

Yt = 309.70− 0.44Yt−1 − 0.29Yt−2 + et.

(a) Classify the ar model fitted to the series.

(b) Determine the mean of the series {Yt}.(c) Write down formulae for forecasting the June rainfall in Shanghai

one and two years ahead.

(d) Write the model using the backshift operator.

Ex. 2.22: In Guiot & Tessier [18], ar(3) models are fitted to the widthsof tree rings. This is of interest as there is evidence that pollutionmay be affecting tree growth. Each observation in the series {Ct} isthe average of 30 tree-ring widths from 1900 to 1941 of a species ofconifer. Write down the general form of the model used to forecasttree-ring width.

Ex. 2.23: Woodward and Gray [51] use a number of models, including armodels, to study change in global temperature. One such ar modelis given in the paper (their Table 2) for modelling the InternationalPanel for Climate Change (IPCC) data series from 1968 to 1990 hasthe factor

(1 + 0.22B + 0.59B2)

when the model is written using backshift operators. Write out themodel without using the backshift operator.

© USQ, February 21, 2007

Page 42: Study Book

38 Module 2. Autoregressive (AR) models

Ex. 2.24: Write a short piece of R-code to simulate the ar model Xt =φXt−1+et where e ∼ N(0, 4) (see Example 2.3). Plot a simulated seriesof length 200 for each of the following eight values of φ: φ = −1.5,−1, −0.6, −0.2, 0, 0.5, 1, 1.5. Comment on your findings: What effectdoes the value of φ have on the stationarity of the series?

Ex. 2.25: Write a short piece of R-code to simulate the ar model Yn =0.2Yn−1 + en where e ∼ N(0, σ2

e) (see Example 2.3). Plot a simulatedseries of length 200 for each of the following four values of σ2

e : σ2e = 0.5,

1, 2, 4. Comment on your findings: What effect does changing thevalue of σ2

e have?

Ex. 2.26: The notes indicate that for an ar(2) process to be stationary,the following conditions must be satisfied:

φ1 + φ2 < 1φ2 − φ1 < 1−1 < φ2 < 1

These inequalities define a triangular region in the (φ1, φ2) plane.Draw this rectangular region, and then write some R-code to simu-late some ar(2) series with parameters in this region, and some withparameters outside this region. You should observe non-stationarytime series when the parameters are outside this triangular region.

Ex. 2.27: Consider the time series {G}, for which the last three observa-tions are: G67 = 40.3, G68 = 39.6, G69 = 50.1. A statistician hasdeveloped the ar(2) model

Gn = en − 0.3Gn−1 − 0.1Gn−2 + 63

for modelling the data.

(a) Determine the mean of the series {G}.(b) Develop a forecasting formula for forecasting {G} one-, two- and

three-steps ahead.

(c) Using the data above, compute numerical forecasts for G70|69,G71|69 and G72|69.

Ex. 2.28: Use r generate a time series from the ar(1) model

Ft+1 = 12 + 0.3Ft + et+1 (2.7)

of length 300 (see Example 2.3 for a guideline).

(a) Compute the mean of {F} from Equation (2.7).

© USQ, February 21, 2007

Page 43: Study Book

2.8. Exercises 39

(b) Compute the mean of your R-generated time series, ignoring thefirst 50 observations. (It usually takes a little while for the sim-ulations to stabilize; see Fig. 2.1.) Compare to your previousanswer, and comment.

(c) Develop a forecasting formula for forecasting {F} one-, two- andthree-steps ahead.

(d) Using your generated data set, compute numerical forecasts forthe next three observations.

2.8.1 Answers to selected Exercises

2.13 The models are: ar(3), ar(1) and ar(2).

2.15 (a) Let µ = E[X] and take expectations of each term. This gives:µ = 0 + 78.03− 0.56µ− 0.23µ + 0.19µ. Solving for µ shows thatµ = E[X] ≈ 48.77.

(b) In a similar manner, E[Y ] = 10.49.

(c) E[D] = 0.

2.17 (a) (1 + 0.56B1 + 0.23B2 − 0.19B3)Xn+1 = 78.3 + en+1;

(b) (1 + 0.22B)Yn = 12.8 + en;

(c) (1− 0.17B + 0.18B2)Dt = et.

2.19 (a) Taking expectations shows that 0.58E[A] = m′. Since E[A] =47.4, it follows that m′ = 27.492.

(b) (1− 0.25B − 0.17B2)An = en.

2.20 (a) Taking expectations, 1.29E[Y ] = m′. Since E[Y ] = 12.26, itfollows that m′ = 15.8154.

(b) The one-step ahead forecast is Yn+1|n = 15.8154 − 0.31Yn +0.12Yn−1 − 0.10Yn−2. The two-step ahead forecast is Yn+2|n =15.8154− 0.31Yn+1|n + 0.12Yn − 0.10Yn−1. The three-step aheadforecast is Yn+3|n = 15.8154− 0.31Yn+2|n + 0.12Yn+1|n − 0.10Yn.

2.23 If Gt is the global temperature, one model is Gt = −0.22Gt−1 −0.59Gt−2 + et.

© USQ, February 21, 2007

Page 44: Study Book

40 Module 2. Autoregressive (AR) models

© USQ, February 21, 2007

Page 45: Study Book

Module 3Moving Average (MA) models

Module contents3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 The backshift operator . . . . . . . . . . . . . . . . . . . 43

3.4 Forecasting ma models . . . . . . . . . . . . . . . . . . . 44

3.4.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . 453.4.3 Forecasting difficulties with ma models . . . . . . . . . . 47

3.5 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.2 The variance . . . . . . . . . . . . . . . . . . . . . . . . 483.5.3 Autocovariance and autocorrelation . . . . . . . . . . . 49

3.6 Why have different types of models? . . . . . . . . . . . 50

3.6.1 Two reasons . . . . . . . . . . . . . . . . . . . . . . . . . 503.6.2 Conversion of models . . . . . . . . . . . . . . . . . . . . 513.6.3 The acf for ar models . . . . . . . . . . . . . . . . . . 53

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 57

41

Page 46: Study Book

42 Module 3. Moving Average (MA) models

Module objectives

Upon completion of this module students should be able to:

� understand what is meant by a moving average (ma) model;

� use the ma(q) notation to define ma models;

� use ma models to develop forecasting formulae;

� use ma models to develop confidence intervals for forecasts;

� write ma models using the backshift operator;

� compute the mean and variance of a time series written in ma form;

� understand the need for both ar and ma models;

� convert ar models to ma models using appropriate methods;

� compute the autocorrelation function (acf) for an ma model;

� understand that the acf for an ma(q) model will have q non-zero terms(apart from the term at lag zero, which is always one).

3.1 Introduction

This Module introduces a second type of time series model: moving aver-age models. Together with autoregressive models, they form the two basicmodels in the Box–Jenkins methodology.

3.2 Definition

Another type of (Box–Jenkins) time series model is a Moving Average model,or ma model. ar models imply the time series signal can be expressed asa linear function of previous values on the time series. The error (or noise)term in the equation, en, is the one-step ahead forecasting error.

In contrast, ma models imply the signal can be expressed as a function ofprevious forecasting errors. This is sensible: it suggests ma models makeforecasts based on the errors made in the past, and so one can learn fromthe errors made in the past to improve later forecasts. (Colloquially, itmeans it learns from its own mistakes!)

© USQ, February 21, 2007

Page 47: Study Book

3.3. The backshift operator 43

Example 3.1: Consider the model Xt = et − 0.3et−1 + 0.2et−2 − 0.15et−3.The signal or information is Sn = −0.3et−1 + 0.2et−2 − 0.15et−3. Thisis an ma(3) model, since the is based on three previous error terms.The term et is the error.

Example 3.2: An example of an MA(2) model is Wn+1 = 12 + 0.9en −0.4en−1 + en+1 .

A more formal definition of a moving average model follows.

Definition 3.1 A moving average model of order q, or an ma(q) model, isof the form

Xn = m + en + θ1en−1 + θ2en−2 + · · ·+ θqen−q (3.1)

= m + en +q∑

k=1

θken−k (3.2)

for n ≥ 1 where θ1, . . . , θq are real numbers and m is a real number.

For the model in Equation (3.2) to be of use in practice, the scientist mustbe able to estimate the value of q (that is, how many terms are needed inthe ma model), and then estimate the values of θk and m. Each of theseissues will be addressed in later sections.

3.3 The backshift operator

The backshift operator can be used to write ma models in the same way asar models. Consider the model in Example 3.1. Using backshift operators,this is written

Xt = (1− 0.3B + 0.2B2 − 0.15B3)et

= θ(B)et.

© USQ, February 21, 2007

Page 48: Study Book

44 Module 3. Moving Average (MA) models

3.4 Forecasting MA models

3.4.1 Forecasting

The principles of forecasting were developed in Sect. 2.3.1 (it may be worthreading this section again) in the context of ar models. The same principlesapply for ma models. Consider the following ma(2) model:

Rn = 12 + en − 0.3en−1 − 0.12en−2, (3.3)

where en has a normal distribution with a mean of zero and variance ofσ2

e = 3; that is, en ∼ N(0, 3). Suppose a one-step ahead forecast is requiredif the information about the time series {Rn} is known up to time n; thatis, Rn+1|n is required.

Proceeding as before, first adjust the subscripts:

Rn+1 = 12 + en+1 − 0.3en − 0.12en−1;

then writeRn+1|n = 12 + en+1|n − 0.3en|n − 0.12en−1|n.

Now, en|n and en−1|n are both known at time n to be en and en−1, but en+1

is not known at time n. So what do we use for the value of en+1? Again,if we have no other information, use the mean value of {en}, which is zero.So the forecast is

Rn+1|n = 12− 0.3en − 0.12en−1. (3.4)

The same procedure is used for k-step ahead forecasts.

Example 3.3: A two-step ahead forecast for the ma(2) model in Equa-tion (3.3) is found by first adjusting the subscripts:

Rn+2 = 12 + en+2 − 0.3en+1 − 0.12en,

and then writing

Rn+2|n = 12 + en+2|n − 0.3en+1|n − 0.12en|n.

Of the terms on the right, only en|n is known; the rest must be replacedby the mean value of zero. So the two-step ahead forecast is

Rn+2|n = 12− 0.12en.

The forecasts for three-steps ahead is

Rn+3|n = 12,

which is also the forecast for further steps ahead as well.

© USQ, February 21, 2007

Page 49: Study Book

3.4. Forecasting ma models 45

3.4.2 Confidence intervals

Consider the ma(2) model in Equation (3.3):

Rn = 12 + en − 0.3en−1 − 0.12en−2. (3.5)

For this equation, one- and two-step ahead forecasts were developed. Theone-step ahead forecast is

Rn+1|n = 12− 0.3en − 0.12en−1.

The forecasting error is the difference between the value forecast, and thevalue actually observed. It is found as follows:

Rn+1 − Rn+1|n.

Now, even though Rn+1 is not known exactly, it can be expressed as

Rn+1 = 12 + en+1 − 0.3en − 0.12en−1,

from Equation (3.5). (The reason Rn+1 is not known exactly is that Rn+1

depends on the unknown random value of en+1; this is the error we makewhen we make our forecast, which is of course unknown.) This means thatthe forecasting error is

Rn+1 − Rn+1|n = [12 + en+1 − 0.3en − 0.12en−1]− [12− 0.3en − 0.12en−1]= en+1.

This tells us that the series {en} is actually just the one-step ahead forecast-ing errors. A confidence interval for the forecast of Rn+1 can also be formed.The actual error about to be made, en+1 is, of course, unknown. But thisinformation can be used to develop confidence intervals for the forecast.

The variance of {en} can generally be estimated by computing all the previ-ous forecasting errors (r computes these) and then computing the variance.

Suppose for the sake of example the variance of the errors is 5.8. Then thevariance of the forecast is

var[Rn+1 − Rn+1|n] = var[en+1] = 5.8.

Then a 95% confidence interval for the one-step ahead forecast is

Rn+1|n ± z∗√

var[Rn+1 − Rn+1|n]

Rn+1|n ± z∗√

5.8

© USQ, February 21, 2007

Page 50: Study Book

46 Module 3. Moving Average (MA) models

for the appropriate value of z∗. Generally, this is taken as 2 for a 95% con-fidence interval. (1.96 is more precise; t-values with an appropriate numberof degrees of freedom even more precise. In practice, however, the value of2 is often used.) So the confidence interval for the forecast is approximately

Rn+1|n ± 2×√

5.8

or Rn+1|n ± 4.82.

The same principles apply for other forecasts.

Example 3.4: In Example 3.3, the following two-step ahead forecast wasobtained for Equation (3.5):

Rn+2|n = 12− 0.12en.

The actual value of Rn+2|n is

Rn+2 = 12 + en+2 − 0.3en+1 − 0.12en,

so the forecasting error is

Rn+2 − Rn+2|n = [12 + en+2 − 0.3en+1 − 0.12en]− [12− 0.12en]= en+2 − 0.3en+1.

The variance of the forecasting error is

var[en+2 − 0.3en+1] = var[en+2] + (−0.3)2var[en+1]= 5.8 + (0.09× 5.8) = 6.322.

The confidence interval becomes

Rn+2|n ± 2×√

6.322 = Rn+2|n ± 5.03.

The same principle is used for three-, four- and further steps ahead,when the confidence interval is

Rn+k|n ± 2√

6.40552 = Rn+k|n ± 5.06

when k > 2. Notice that the confidence interval gets wider as wepredict further ahead of our knowledge. This should be expected.

© USQ, February 21, 2007

Page 51: Study Book

3.5. Statistics 47

3.4.3 Forecasting difficulties with MA models

Consider the ma model Tn = en − 0.3en−1. The one-step ahead forecastingformula is

Tn+1|n = −0.3en.

Suppose we seek a forecast; the last three observations are: T8 = 4.6; T9 =−3.0; T10 = 0.1. Let’s use the forecasting formula to produce a forecast forT11|10: We would use

T11|10 = −0.3e10.

So we need to know the one-step ahead forecasting error at n = 10; that ise10. What is this forecasting error? We know the actual observed value at10: it is T10 = 0.1. But to know the one-step ahead error in forecasting T10,we need to know T10|9. What is this value?

By the forecasting formula, it is computed using

T10|9 = −0.3e9.

And so we need the one-step ahead forecasting error for n = 9, which requiresknowledge of T9|8 . From the forecasting formula, we find this using

T9|8 = −0.3e8.

And so the cycle continues, right back to the start of the series.

In practice, we need to compute all the one-step ahead forecasting errors. rcan compute these errors and produce predictions without having to worryabout these difficulties in a real (data-driven) situation; see Sect. 5.4.

3.5 Statistics

In this Section, the important statistics of a model are found.

3.5.1 The mean

In Equation (3.2), the general form of an ma(q) model is given. Takingexpected values of each term in this series gives

E[Xn] = E[m] + E[en] + E[θ1en−1] + E[θ2en−2] + · · ·+ E[θpen−p]= m,

since the average error is zero. Hence, for an ma model, the constant termm is actually the mean of the series {Xn}.

© USQ, February 21, 2007

Page 52: Study Book

48 Module 3. Moving Average (MA) models

Example 3.5: In Equation (3.3), let the mean of the series be µ = E[R].Then taking expected values of each term gives

µ = 12,

so that the mean of the series is µ = E[R] = 12. This should not beunexpected given the forecasts in Example 3.3

3.5.2 The variance

The variance of a time series written in ma(1) form is found by taking thevariance of each term. Consider again Equation (3.2); taking the varianceof each term gives

var[Rn] = 12 + var[en] + (−0.3)2var[en−1] + (−0.12)2var[en−2],

where {en} ∼ N(0, 3), since the errors {en} are independent of the timeseries {Rn} and independent of each other. This gives

var[Rn] = {1 + (−0.3)2 + (−0.12)2}var[en],

and so var[R] = 1.1044× 3 = 3.3132. This approach can be applied to otherma models also.

Example 3.6: The above results can be checked numerically in r as fol-lows (set.seed() sets the random number seed so these results arereproducible):

> set.seed(100)

> ma.sim <- arima.sim(model = list(ma = c(-0.3,

+ -0.12)), n = 10000, sd = sqrt(3))

> var(ma.sim)

[1] 3.321068

> ma.sim <- arima.sim(model = list(ma = c(-0.3,

+ -0.12)), n = 10000, sd = sqrt(3))

> var(ma.sim)

[1] 3.309557

© USQ, February 21, 2007

Page 53: Study Book

3.5. Statistics 49

3.5.3 Autocovariance and autocorrelation

The autocovariance for a time series is written, as shown earlier, as

γk = Covar[X0, Xk]

for integer k. The is then defined as

ρk =γk

γ0

for integer k, where γ0 = Covar[X0, X0] is simply the variance of the timeseries. The series {γk} is the autocorrelation function, or . For any mamodel, the acf can be computed, which will be unique to that ma model.For this reason, the acf is one of the most important pieces of informationthat we can know about a time series. Later, the acf will be used todetermine which ma model might be appropriate for our data.

Note that since the autocorrelation is a series, it can be written using thebackshift operator. It can be shown that the autocovariance for an ma(p)model is

γ(B) = θ(B)θ(B−1)σ2e .

Example 3.7: The ma(2) model Vn+1 = en+1 − 0.39en − 0.22en−1 can bewritten

Vn+1 = θ(B)en+1

where θ(B) = 1 − 0.39B1 − 0.22B2. Suppose for the sake of examplethat σ2

e = 2. Then, since θ(B−1) = 1 − 0.39B−1 − 0.22B−2, theautocovariance is

γ(B) = 2(1− 0.39B−1 − 0.22B−2)(1− 0.39B1 − 0.22B2)= 2(−0.22B−2 − 0.3042B−1 + 1.2005− 0.3042B1 − 0.22B−2)= −0.44B−2 − 0.6084B−1 + 2.4010− 0.6084B1 − 0.44B2.

The terms in the autocovariance are therefore (quoting only the termsfor the non-zero lags, as the autocorrelation is symmetric):

{γ} = {2.4010,−0.6084,−0.4400}.

and so the corresponding terms in the autocorrelation are ρk = γk/γ0,where γ0 = 2.4010. Hence

{ρ} = {1,−0.253,−0.183}.

The first element of the autocorrelation is always one. It is usual toplot the acf(Fig. 3.1).

© USQ, February 21, 2007

Page 54: Study Book

50 Module 3. Moving Average (MA) models

0 1 2 3 4 5 6 7

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 3.1: The acf for the ma(2) model in Example 3.7.

The plot is typical of an ma(2) model: there are two terms in the acfthat are non-zero (apart from the term at a lag of zero, which is alwaysone).

In general, the acf of an ma(q) model has q non-zero terms excludingthe term at lag zero which is always one.

3.6 Why have different types of models?

Why are both ar and ma models necessary? ar models are far more popularin the literature than ma models, so why not just have ar models? Thereare two important reasons why both ma and ar models are necessary.

3.6.1 Two reasons

The first reason is that only ma models can be used to create confidenceintervals on forecasts (Sect. 3.4.2). If an ar model is developed, it must bewritten as ma model to produce confidence intervals for the forecasts.

© USQ, February 21, 2007

Page 55: Study Book

3.6. Why have different types of models? 51

Secondly, it is necessary to again recall one of the principles of statisticalmodelling: to find the simplest possible model that captures the importantfeatures of the data. In some applications, the only suitable ar model hasa large number of parameters. In these situation, there will probably bean ma model that will be almost identical in terms of forecasting ability,but has fewer parameters to estimate. In this case, the ma model would bepreferred. In other applications, a simpler ar model will be preferred overa more complicated ma model.

3.6.2 Conversion of models

This discussion implies that it is possible to convert ar models into mamodels, and ma models into ar models. This is indeed true, and the vehiclethrough which this is done is the backshift operator.

Consider an ar model, written using backshift notation as φ(B)Xn = en. Ifit is possible and sensible to divide by φ(B), expressed the model as

Xn =1

φ(B)en.

Denoting 1/φ(B) by θ(B) gives

Xn = θ(B)en,

which looks like an ma model. This is exactly the way models are convertedfrom ar to ma.

Consider writing the ar(1) model Xn = 0.6Xn−1+en as an ma model. Thereare three ways of proceeding. The first can only be used for ar(1) modelsas it uses a mathematical result relevant only then. The second approachis more difficult, but is used for any ar model. The third approach uses r,and so is the easiest but of no use in the examination.

Using the first approach, write the model using the backshift operator as:φ(B)Xn = en, where φ(B) = 1 − 0.6B. Then divide by φ(B) to obtainXn = θ(B)en, where θ(B) = 1/φ(B). So,

θ(B) =1

1− 0.6B. (3.6)

The mathematical result for the sum of a geometric series1 is then used toobtain

θ(B) =1

1− 0.6B= 1 + 0.6B + (0.6)2B2 + (0.6)3B3 + · · · .

11 + r + r2 + r3 + · · · = 1/(1 − r) if |r| < 1.

© USQ, February 21, 2007

Page 56: Study Book

52 Module 3. Moving Average (MA) models

So, the corresponding ma model is the infinite ma model (written ma(∞))

Xn = en + 0.6en−1 + (0.6)2en−2 + (0.6)3en−3 + · · · .

This shows an ar(1) model has an equivalent ma(∞) form. Since both areequivalent, the simpler ar(1) form would be preferred, but the ma form isnecessary for computing confidence intervals of forecasts.

In the second approach, start with Equation (3.6), and equate it to an un-known infinite sequence of θ’s:

11− 0.6B

= 1 + θ1B + θ2B2 + · · · .

Then multiply both sides by 1− 0.6B to get

1 = (1− 0.6B)(1 + θ1B + θ2B2 + · · · )

= 1 + B(θ1 − 0.6) + B2(θ2 − 0.6θ1) + · · · ,

and then equate the powers of B on both sides of the equation. For example,looking at constants, there is one on both sides. Looking at powers of B, zeroare on the left, and −0.6 + θ1 on the right after multiplying out. Equating,we find that θ1 = 0.6 (as before). Then equating powers of B2, the left handside has zero, and the right hand side has θ2 − 0.6θ1. Substituting θ1 = 0.6and solving gives θ2 = (0.6)2 (as before). A general pattern emerges, givingthe same result as before.

Remember the second method is used to convert any ar model into an mamodel (and also any ma model into an ar model).

The third approach uses r. This is useful, but you will need to know othermethods for the examination. Naturally, the answers are the same as usingthe other two methods.

> imp <- as.ts(c(1, rep(0, 19)))

> phi <- 0.6

Note the one is not needed in the list of ar components as it is always one!Confusingly, the sign is different for the φ.

> theta <- filter(imp, phi, method = "recursive")

> theta

© USQ, February 21, 2007

Page 57: Study Book

3.6. Why have different types of models? 53

Time Series:Start = 1End = 20Frequency = 1[1] 1.000000e+00 6.000000e-01 3.600000e-01[4] 2.160000e-01 1.296000e-01 7.776000e-02[7] 4.665600e-02 2.799360e-02 1.679616e-02[10] 1.007770e-02 6.046618e-03 3.627971e-03[13] 2.176782e-03 1.306069e-03 7.836416e-04[16] 4.701850e-04 2.821110e-04 1.692666e-04[19] 1.015600e-04 6.093597e-05

3.6.3 The ACF for AR models

Briefly, we digress to again consider the acf for ar models, seen previouslyin Sect. 2.5.4, Equation 2.5, and Example 2.12 (p 34) in particular. In thisexample, the following in stated:

. . . the autocovariance is

γ(B) =10

(1− 0.3B1 + 0.2B2)(1− 0.3B−1 + 0.2B−2)

=10

(0.2B−2 − 0.36B−1 + 1.13− 0.36B + 0.2B−2).

By some detailed mathematics (covered in Sect. 3.6), this equals

γ(B) = · · ·+11.11+2.78B1−1.39B2−0.97B3−0.0139B4+0.190B5+· · · ,(3.7)

only quoting the terms for the non-negative lags

Since this is Sect. 3.6, we had better deliver!

The way to convert to Equation (3.7) is to proceed as in this section. First,write

γ(B) = · · ·+ γ−2B−2 + γ−1B

−1 + γ0 + γ1B + γ2B2 + · · ·

(recalling that the autocovariance in a series in both directions, but is sym-metric.) Then, rearrange the original equation to get

10 = γ(B)(0.2B−2 − 0.36B−1 + 1.13− 0.36B + 0.2B−2)= (· · ·+ γ1B

−1 + γ0 + γ1B + γ2B2 + · · · )×

(0.2B−2 − 0.36B−1 + 1.13− 0.36B + 0.2B−2)

© USQ, February 21, 2007

Page 58: Study Book

54 Module 3. Moving Average (MA) models

Then, expand and equate powers of B as before in this section. In thissituation, it is just a lot trickier.

On the left, the constant term is 10; on the right, a constant can be foundfrom:

γ0(1.13) + γ1(−0.36)︸ ︷︷ ︸γ1B−1(−0.36B)

+ γ2(0.2)︸ ︷︷ ︸γ2B−2(0.2B)

+ γ1(−0.36)︸ ︷︷ ︸γ1B1(−0.36B−1)

+ γ2(0.2)︸ ︷︷ ︸γ2B2(0.2B−1)

So we have

10 = γ0(1.13) + γ1(−0.36) + γ2(0.2) + γ1(−0.36) + γ2(0.2)

Proceed for other powers of B also, and develop a set of equations to besolved for γ1, γ2, and so on.

Far easier is to use r after first converting to an ma model, whose parameterswe call theta:

> imp <- as.ts(c(1, rep(0, 99)))

> theta <- filter(imp, c(0.3, -0.2), "recursive")

That’s the AR model found. Note that the first component is 1 and isassumed; it should not be included.

> theta[1:4]

[1] 1.000 0.300 -0.110 -0.093

> gamma <- convolve(theta, theta) * 10

> gamma[1:4]

[1] 11.1111111 2.7777778 -1.3888889 -0.9722222

> rho <- gamma/gamma[1]

> rho[1:4]

[1] 1.0000 0.2500 -0.1250 -0.0875

© USQ, February 21, 2007

Page 59: Study Book

3.7. Summary 55

3.7 Summary

In this Module, moving average models were studied, including forecasting,establishing confidence intervals on forecasts, and writing using the backshiftoperator. In addition, three methods were shown that can be used to convertar models to ma models.

3.8 Exercises

Ex. 3.8: Classify the following ma models (that is, state if they are ma(3),ma(2), etc.)

(a) At+1 = et+1 + 8.39− 0.06et + 0.35et−1.

(b) Xn = −0.12en−1 + en.

(c) Yt − 0.29et−1 + 0.19et−2 + 0.62et−3 − 0.26et−4 − et = 12.40.

Ex. 3.9: Classify the following ma models (that is, state if they are ma(3),ma(2), etc.)

(a) Bt = 0.1et−1 + et.

(b) Yn = 0.036en−2 − 0.36en−1 + en.

(c) Wt + 0.39et−1 + 0.25et−2 − 0.21et−3 − et = 8.00.

Ex. 3.10: Determine the mean of each series in Exercise 3.8.

Ex. 3.11: Determine the mean of each series in Exercise 3.9.

Ex. 3.12: Write each of the models in Exercise 3.8 using the backshift op-erator.

Ex. 3.13: Write each of the models in Exercise 3.9 using the backshift op-erator.

Ex. 3.14: Convert the ar model

Xt+1 = et+1 + 0.4Xt

into the equivalent ma model using each of the three methods outlinedin Sect. 3.6, and confirm that they give the same answer.

Ex. 3.15: Convert the ma(2) model

Yn = en + 0.3en−1 − 0.1en−2

into the equivalent ar model using one of the three methods outlinedin Sect. 3.6.

© USQ, February 21, 2007

Page 60: Study Book

56 Module 3. Moving Average (MA) models

Ex. 3.16: Convert the AR model

Yn = 0.25Yn−1 − 0.13Yn−2 + en

into the equivalent ma model using one of the three methods outlinedin Sect. 3.6.

Ex. 3.17: Compute forecasting formula for each of the ma models in Exer-cise 3.8 for one-, two- and three-steps ahead, and compute confidenceintervals for each forecast in terms of the error variance σ2

e .

Ex. 3.18: Compute forecasting formula for each of the ma models in Exer-cise 3.9 for one-, two- and three-steps ahead, and compute confidenceintervals for each forecast. In each case, assume σ2

e = 2.

Ex. 3.19: Write a short piece of r-code to simulate the ma model Xt =et + θet−1 where e ∼ N(0, 1) (see Example 2.3 for a guideline). Plot asimulated series of length 200 for each of the following eight values ofθ: θ = −1.5, −1, −0.6, −0.2, 0, 0.5, 1, 1.5. Comment on your findings:What effect does the value of θ have on the stationarity of the series?

Ex. 3.20: Consider the ma(1) model

Xn = 0.4en−1 + en

where e ∼ N(0, 3).

(a) Write the model using backshift operators.

(b) Find the autocovariance series {γ}.(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.21: Consider the ma(1) model

Sn+1 = 0.2en + en+1,

where e ∼ N(0, 2).

(a) Write the model using backshift operators.

(b) Find the autocovariance series {γ}.(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.22: Consider the time series model

Zt = 0.2et−1 − 0.1et−2 + et,

where e ∼ N(0, 5).

© USQ, February 21, 2007

Page 61: Study Book

3.8. Exercises 57

(a) Write the model using backshift operators.

(b) Find the autocovariance series {γ}.(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.23: Consider the ar(1) model

Wn = 0.3Wn−1 + en,

where e ∼ N(0, 2.5).

(a) Write the model using backshift operators.

(b) Find the autocovariance series {γ} using R.

(c) Compute the autocorrelation function (acf), {ρ}.

Ex. 3.24: Consider the AR model

Yt = 0.45Yt−1 − 0.2Yt−2 + et,

where e ∼ N(0, 5).

(a) Write the model using backshift operators.

(b) Find the autocovariance series {γ} using R.

(c) Compute the autocorrelation function (acf), {ρ}.

3.8.1 Answers to selected Exercises

3.8 The models are: ma(2); ma(1) and ma(4).

3.10 The means are: E[A] = 8.39; E[X] = 0; and E[Y ] = 12.40.

3.12 (a) At+1 = (1− 0.06B + 0.35B2)et+1 + 8.39;

(b) Xn = (1− 0.12B)en;

(c) Yt = (1 + 0.29B − 0.19B2 − 0.62B3 + 0.26B4)et + 12.40.

3.14 First, convert to backshift operator notation: (1− 0.4B)Xt+1 = et+1.The infinite ma model is given by

Xt+1 =1

1− 0.4Bet+1.

Then the equivalent ma model, using any method, is

Xt+1 = (1 + 0.4B + (0.4)2B2 + (0.4)3B3 + · · · )et+1,

orXt+1 = et+1 + 0.4et + 0.16et−1 + 0.064et−2 + · · · .

© USQ, February 21, 2007

Page 62: Study Book

58 Module 3. Moving Average (MA) models

3.17 For (a) only:

(a) One-step ahead: At+1|t = 8.39 − 0.06et + 0.35et−1; var[At+1|t −At+1] = var[et] = σ2

e ; the CI is At+1|t ± 2σ2e .

(b) Two-steps ahead: At+2|t = 8.39 + 0.35et; var[At+2|t − At+2] =(1 + (0.06)2)var[et] = 1.0036σ2

e ; the CI is At+1|t ± 2√

1.0036σ2e .

(c) Three-steps ahead: At+3|t = 8.39; var[At+3|t − At+3] = (1 +(0.06)2+(0.35)2)var[et] = 1.1261σ2

e ; the CI is At+1|t±2√

1.1261σ2e .

3.20 (a) Xn = (1 + 0.4B)en, or Xn = θ(B)en where θ(B) = (1 + 0.4B).

(b) The autocovariance using the backshift operator is γ(B) = θ(B)θ(B−1)σ2e ,

so γ(B) = 3(1 + 0.4B)(1 + 0.4B−1) = 1.2B−1 + 3.48 + 1.2B, sothe series is {1.2, 3.48, 1.2}.

(c) Dividing the autocovariance by γ0 = 3.48 gives the acf series as{0.345, 1, 0.345}

3.23 (a) (1− 0.3B)Wn = en.

(b) > imp <- as.ts(c(1, rep(0, 99)))

> theta <- filter(imp, c(-0.3), "recursive")

> gamma <- convolve(theta, theta) * 2.5

> gamma[1:6]

[1] 2.747252747 -0.824175824 0.247252747[4] -0.074175824 0.022252747 -0.006675824

(c) > rho <- gamma/gamma[1]

> rho[1:6]

[1] 1.00000 -0.30000 0.09000 -0.02700 0.00810[6] -0.00243

© USQ, February 21, 2007

Page 63: Study Book

Module 4ARMA Models

Module contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 The backshift operator for arma models . . . . . . . . . 62

4.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . 624.4.2 The autocovariance and autocorrelation . . . . . . . . . 63

4.5 Conversion of arma models to ar and ma models . . . 64

4.6 Forecasting arma models . . . . . . . . . . . . . . . . . . 65

4.6.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 654.6.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . 664.6.3 Forecasting difficulties with arma models . . . . . . . . 67

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 70

Module objectives

Upon completion of this module students should be able to:

59

Page 64: Study Book

60 Module 4. arma Models

� understand what is meant by an autoregressive moving average (arma)model;

� use the arma(p, q) notation to define arma models;

� use arma models to develop forecasting formulae;

� develop confidence intervals for forecasts from arma models;

� write arma models using the backshift operator;

� compute the mean of a time series written in arma form;

� understand the need for ar, ma and arma models;

� convert arma models to ma and ar models using appropriate meth-ods;

� compute the autocorrelation function (acf) for an arma model.

4.1 Introduction

This Module examines models with both autoregressive and moving averagecomponents.

4.2 Definition

The principle of parsimony—that the best model is the simplest model thatcaptures the important features of the data—has been mentioned before,where it was noted that a complex ar model can often be replaced by asimpler ma model.

Sometimes, however, neither a simple ar model or simple ma model ex-ists. In these cases, a combination of ar and ma models will almost alwaysproduce a simple model. These models are called AutoRegressive MovingAverage models, or arma models. Once again, p is used for the number ofautoregressive components, and q for the number of moving average compo-nents. Consider first some examples.

Example 4.1: An example of an arma(2, 1) model is

Wn+1 = 0.56 + 0.8Wn − 0.4Wn−1︸ ︷︷ ︸2 ar components

+en+1 + 0.5en.︸ ︷︷ ︸1 ma component

© USQ, February 21, 2007

Page 65: Study Book

4.2. Definition 61

The signal or information is of the form Sn+1 = 0.56 + 0.8Wn −0.4Wn−1 + 0.5en, and has both ar and ma components.

Example 4.2: An example of an arma(1, 3) model is

Xt = 120.78 + 0.88Xt−1︸ ︷︷ ︸The ar(1) component

+et

− 0.41et−1 − 0.15et−2 + 0.08et−3︸ ︷︷ ︸The ma(3) component

.

A more formal definition follows.

Definition 4.1 The form of an arma(p, q) model is the equation:

Xn −p∑

k=1

φkXn−k = m′ + en +q∑

j=1

θjen−j , n ≥ 0 , (4.1)

where {Xn}n≥1, m′ is some constant, and the φk and θj are defined as forar and ma models respectively.

Example 4.3: Chu & Katz [13] studied the monthly SOI time series fromJanuary 1935 to August 1983, and concluded the data could be mod-elled by an arma(1, 1) model.

Example 4.4: Davis & Rappoport [15] use an arma(2, 2) model for thePalmer Drought Index, {Yt}. The final fitted model is

Yt = 1.344Yt−1 − 0.431Yt−2 + et − 0.419et−1 + 0.034et−2.

Katz & Skaggs [26] claim the equivalent ar(2) model is almost as goodas the model given by Davis & Rappoport, yet has half the number ofparameters. For this reason, they prefer the ar(2) model.

© USQ, February 21, 2007

Page 66: Study Book

62 Module 4. arma Models

4.3 The backshift operator for ARMA models

arma models have both ar and ma components; the model is easily writ-ten using the backshift operator by following the guidelines for ar and mamodels.

Example 4.5: Consider the arma(1, 2) model

Zt = 0.83 + et − 0.66et−1 + 0.72et−2 − 0.29Zt−1. (4.2)

First, re-write as

Zt + 0.29Zt−1 = 0.83 + et − 0.66et−1 + 0.72et−2;

then use the backshift operator to get

φ(B)Zt = m′ + θ(B)et

(1 + 0.29B)Zt = 0.83 + (1− 0.66B + 0.72B2)et.

4.4 Statistics

4.4.1 The mean

In Equation (4.1), the general form of an arma(p, q) model is given. Takingexpected values of each term in this series gives

E[Xn] = E[m′] + E[en] + E[θ1en−1] + · · ·+ E[θqen−q] ++ E[φ1Xn−1] + E[φ2Xn−2] + · · ·+ E[φpXn−p]

= m′ + φ1E[Xn−1] + φ2E[Xn−2] + · · ·+ φpE[Xn−p],

since the average error is zero. Now, since the assumption is that the timeseries {X} is stationary, the mean of this series is approximately constant,so the expected value of the series will be the same at any time step. Letthis constant mean be µ. Then,

µ = m′ + φ1µ + φ2µ + · · ·+ φpµ,

and so, on solving for µ,

µ =m′

1− φ1 − φ2 − · · ·φp

This enables the mean of the sequences to be computed from the armamodel.

© USQ, February 21, 2007

Page 67: Study Book

4.4. Statistics 63

Example 4.6: The mean of {Zt}, say µ, in the arma(1, 2) model in Equa-tion (4.2) is found by taking expectations of each term:

E[Zt] = 0.83 + E[et]− 0.66E[et−1] + 0.72E[et−2]− E[0.29Zt−1]µ = 0.83− 0.29µ

so that E[Z] = µ = 0.643.

See also Example 4.8.

4.4.2 The autocovariance and autocorrelation

For an arma(p, q) model, the autocovariance can be expressed as

γ(B) =σ2

eθ(B)θ(B−1)φ(B)φ(B−1)

.

Example 4.7: In Example 4.5, the following were found:

φ(B) = (1 + 0.29B1)θ(B) = (1− 0.66B1 + 0.72B2).

Suppose for the sake of example that σ2e = 4. Then, the autocovariance

is

γ(B) =4(1− 0.66B1 + 0.72B2)(1− 0.66B−1 + 0.72B−2)

(1 + 0.29B)(1 + 0.29B−1).

This can be converted into the series

γ(B) = · · · 5.4430B−2 − 8.8380B−1 + 11.9381−8.8380B + 5.4430B2 − 1.5785B3 + 0.4578B4 + · · ·

so that the autocorrelation is

{ρ} = {1.0000,−0.7403, 0.4559,−0.1322, 0.0383, . . . }.

© USQ, February 21, 2007

Page 68: Study Book

64 Module 4. arma Models

4.5 Conversion of ARMA models to AR and MA mod-els

Using similar approaches as used before in Sect. 3.6, arma models can beconverted to pure ar or pure ma models.

Example 4.8: Consider the arma(1, 2) model

Xt = 0.3Xt−1 + et + 0.4et−1 − 0.1et−2 + 10. (4.3)

For the moment, ignore the constant term m′ = 10 and write

(1− 0.3B)Xt = (1 + 0.4B − 0.1B2)et

φ(B)Xt = θ(B)et.

To write the model as a pure ma model,

Xt =θ(B)φ(B)

et = θ′(B)et (4.4)

where θ′(B) = θ′0 + θ′1B + θ′2B + · · · . To convert to this pure ma form,the values of θ′0, θ′1, and so on must be found. Rearrange Equation (4.4)to obtain

θ(B) = θ′(B)φ(B)1 + 0.4B − 0.1B2 = (θ′0 + θ′1B + θ′2B

2 + θ′3B3 + · · · )(1− 0.3B)

= θ′0 + B(−0.3θ′0 + θ′1)+ B2(−0.3θ′1 + θ′2)+ B3(−0.3θ′2 + θ′3) + · · ·

Now, equate powers of B so that both sides of the equation are equal.Equating constant terms: 1 = θ′0 as expected. Equating terms in B:

0.4 = −0.3θ′0 + θ′1,

so that θ′1 = 0.4 + 0.3θ′0 = 0.7.

Equating terms in B2:

−0.1 = −0.3θ′1 + θ′2,

so that θ′2 = −0.1 + 0.3θ′1 = 0.11.

Equating terms in B3:

0 = −0.3θ′2 + θ′3,

© USQ, February 21, 2007

Page 69: Study Book

4.6. Forecasting arma models 65

so that θ′3 = 0.3θ′2 = 0.11(0.3).

Continuing, a pattern emerges showing that θ′k = (0.3)k−2(0.11) whenk ≥ 2. Hence,

θ′(B) = 1 + 0.7B + 0.11B2 + 0.11(0.3)B3 + · · ·+ 0.11(0.3)k−2Bk.

This means the arma(1, 2) model has an equivalent ma(∞) represen-tation of

Xt = et + 0.7et−1 + 0.11et−2 + · · ·+ 0.11(0.3)k−2et−k.

However, there will probably be a constant term in the model yet tobe found, so that

Xt = m′ + et + 0.7et−1 + 0.11et−2 + · · ·+ 0.11(0.3)k−2et−k. (4.5)

Taking expectations of Equation (4.3) shows that the mean of the seriesis E[X] = 10/0.7 ≈ 14.2857. Taking expectations of Equation (4.5)shows that m′ = E[X] ≈ 14.2857. So the arma(1, 2) model has theequivalent ma model

Xt = 14.2857 + et + 0.7et−1 + 0.11et−2 + · · ·+ 0.11(0.3)k−2et−k.

Note that I have yet to determine how to do these conversion in r.

4.6 Forecasting ARMA models

4.6.1 Forecasting

Forecasting arma models uses the same principles as for forecasting ma andar models. This procedures is called the hat principle, summarized below:

The forecasting equation for an arma model is obtained from the modelequation by “placing hats” on all the terms of the equation, and adjustingsubscripts accordingly. The “hat” designates the best linear estimate of thequantity underneath the hat. This equation is then adjusted by noting:

1. An ek|j for which k is in the future (i.e. k > j) just equals zero (themean of {ek}), while one for which k is in the present or past (k ≤ j)just equals ek. In other words, hats change future ek’s to zeros andthey fall off present and past eks.

© USQ, February 21, 2007

Page 70: Study Book

66 Module 4. arma Models

2. A Xk|j for which k is in the present or past (i.e. k ≤ j) just equalsXk, while one for which k is in the future can be expressed in termsof another forecasting equation, which ultimately will allow it to beexpressed in terms of known quantities. In other words, hats fall offpresent and past Xk’s and they stay on future ones.

Example 4.9: Consider the arma(2, 1) model

Wn = 0.72 + 0.44Wn−1 + 0.17Wn−2 + en − 0.26en−1. (4.6)

A one-step ahead forecast is

Wn+1|n = 0.72 + 0.44Wn|n + 0.17Wn−1|n + en+1|n − 0.26en|n.

Since en+1|n is in the future, it is replaced by the mean of the {ek},which is zero. In contrast, en|n = en. Likewise, Wn|n = Wn andWn−1|n = Wn−1, so the forecasting formula is

Wn+1|n = 0.72 + 0.44Wn + 0.17Wn−1 − 0.26en. (4.7)

Using the sample principles, the two-step ahead forecasting formula is

Wn+2|n = 0.72 + 0.44Wn+1|n + 0.17Wn.

Again, Wn+1|n can be replaced by Equation (4.7) (though this is notnecessary) to get

Wn+2|n = 0.72+0.44 {0.72 + 0.44Wn + 0.17Wn−1 − 0.26en}+0.17Wn,

which can be simplified if you wish.

4.6.2 Confidence intervals

As with ar models, arma models must be first converted to pure ma modelsbefore confidence intervals can be computed for forecasts. After conversionto a pure ma form, the same principles as used in Sect. 3.4.2 are used.

Example 4.10: Consider the arma(1, 2) model from Example 4.8:

Xt = 0.3Xt−1 + et + 0.4et−1 − 0.1et−2 + 10.

This model has the equivalent ma(∞) form

Xt = 14.2857 + et + 0.7et−1 + 0.11et−2 + · · ·+ 0.11(0.3)k−2et−k.

© USQ, February 21, 2007

Page 71: Study Book

4.7. Summary 67

The one-step ahead forecast of the model in ma form is

Xt+1|t = 14.2857 + 0.7et + 0.11et−1 + · · ·+ 0.11(0.3)k−2et−k+1,

whereas the exact (but unknown) value will be

Xt+1 = 14.2857 + et+1 + 0.7et + 0.11et−1 + · · ·+ 0.11(0.3)k−2et−k+1.

The difference between them is et+1, and so the forecasting error isjust the error variance, say σ2

e .

The two-step ahead forecast is

Xt+2|t = 14.2857 + 0.11et + · · ·+ 0.11(0.3)k−2et−k+2,

whereas the exact (but unknown) value is

Xt+2 = 14.2857 + et+2 + 0.7et+1 + 0.11et + · · ·+ 0.11(0.3)k−2et−k+2.

The difference between them is

Xt+2 − Xt+2|t = et+2 + 0.7et+1,

so the variance of the forecast is σ2e(1 + 0.72) = 1.49σ2

e . Confidenceintervals can be constructed from the values of the error variance.

Continuing in the same manner, the variance of a three-step aheadforecast is 1.5021σ2

e and a four-step ahead forecast is 1.503189σ2e .

4.6.3 Forecasting difficulties with ARMA models

In Sect. 3.4.3, some difficulties forecasting with ma models were presented.In short, the one-step ahead forecasting errors need to be determined rightto the beginning of the series. Because aspects of ma models are present inarma models, this same difficulty is also present. Of course, r can computethese errors and produce predictions without having to worry about thesedifficulties in a real (data-driven) situation; see Sect. 5.4.

4.7 Summary

In this Module, a combination of autoregressive and moving average models,called arma models, was discussed. Forecasting methods were also exam-ined for these models.

© USQ, February 21, 2007

Page 72: Study Book

68 Module 4. arma Models

4.8 Exercises

Ex. 4.11: Classify the following models as ar, ma or arma, and state theorders of the models (for example, an answer may be arma(1, 3)):

(a) At = 12.6− 0.44At−1 + 0.37et−1 + et;

(b) Xn − 0.24Xn−1 + 0.38Xn−2 − 14.8 = en;

(c) Yt+1 = et+1 − 0.19et − 0.44Yt;

(d) Rn = 0.46en−1 + en;

(e) Pn+1 = 8.69 + en+1 − 0.35Pn − 0.26en − 0.18en−1 + 0.11en−2.

Ex. 4.12: Classify the following models as ar, ma or arma, and state theorders of the models (for example, an answer may be arma(1, 3)):

(a) An − 0.1An−1 = 7.40 + 0.22en−1 + en;

(b) Bn − 0.5Bn−1 = en;

(c) Xt − et = 0.61Xt−1 − 0.67et−1;

(d) Zt+1 = 0.26et + 0.10et−1 + 0.17Zt − 0.16Zt−1 + et+1;

(e) Xt+1 − 0.2et + 0.2et−1 = et+1 + 7;

(f) Yn = −2.2 + en + 0.23Yn−1 − 0.19en−1 − 0.18en−2 + 0.17en−3.

Ex. 4.13: Find the mean of each series in Exercise 4.11.

Ex. 4.14: Find the mean of each series in Exercise 4.12.

Ex. 4.15: Write each model in Exercise 4.11. using the backshift operator.

Ex. 4.16: Write each model in Exercise 4.12. using the backshift operator.

Ex. 4.17: Consider the arma(1, 1) model

Xn = 0.2Xn−1 + en − 0.1en−1

where var[en] = 9.3.

(a) Write the model using the backshift operator.

(b) Find a one- and two-step ahead forecast for the model.

(c) Convert the model into a pure ma model.

(d) Find 95% confidence intervals for the forecasts in (b).

© USQ, February 21, 2007

Page 73: Study Book

4.8. Exercises 69

Ex. 4.18: Consider the arma(1, 1) model

Yn = 0.3Yn−1 + en + 0.2en−1

where var[en] = 7.0.

(a) Write the model using the backshift operator.

(b) Find a one-, two- and three-step ahead forecast for the model.

(c) Convert the model into a pure ma model.

(d) Find 95% confidence intervals for the forecasts in (b).

Ex. 4.19: Consider the arma(1, 1) model

Wt+1 + 0.2Wt = 2 + en + 0.2en−1

where var[en] = 7.0.

(a) Write the model using the backshift operator.

(b) Find a one-, two- and three-step ahead forecast for the model.

(c) Convert the model into a pure ma model.

(d) Find 95% confidence intervals for the forecasts in (b).

Ex. 4.20: Give two reasons why it is sometimes necessary to convert arand arma models into pure ma models.

Ex. 4.21: Claps & Morrone [14] give the following model for modellingrunoff Dt under certain conditions:

Dt − exp{−1/K3}Dt−1 =(1− c3 exp{−1/K3})It − exp{−1/K3}(1− c3)It−1,

where c3 is a recharge coefficient (constant in any given problem), It

is the effective rainfall input, and K3 is a storage coefficient (constantin any given problem). The authors state that if the effective rainfallinput It is white noise, then the model is equivalent to an arma(1, 1)model. Use that (1 − c3 exp{−1/K3})It = et to show that this is thecase.

Ex. 4.22: Sales, Pereira & Vieira [40] discuss numerous arma-type modelsin connection with the Brazilian Electrical Sector. A significant pro-portion of electricity is sourced from hydroelectricty in Brazil. In theirpaper, the author use arma-type models to model natural monthly av-erage flow rate (in cubic metres per second) of the reservoir of Furnason the Grande River in Brazil. Initially, the logarithm of the data wasfound to create a time series {Ft}, and then an arma(1, 1) model wasfitted. The information in Table 4.1 comes from their Table 2.

© USQ, February 21, 2007

Page 74: Study Book

70 Module 4. arma Models

Table 4.1: Parameters estimates and standard errors for the arma(1, 1)model fitted by Sales, Pereira & Vieira [40].

Parameter Estimate Standard Error

φ1 0.8421 0.0237θ1 −0.2398 0.0426σ2

e 0.4343

(a) Write down the fitted model.

(b) Convert the model to a pure ma model.

(c) Develop one-, two- and three- step ahead forecasts for the log ofthe flowrate.

(d) Determine 95% confidence intervals for each of these forecasts.

Ex. 4.23: Consider the arma(2, 2) model for the Palmer Drought Indexseen in Example 4.4. Write this model using the backshift operator.Then create forecasting formulae for forecasting one-, two-, three- andfour-steps ahead.

4.8.1 Answers to selected Exercises

4.11 The models are arma(1, 1); ar(2) (or arma(2, 0)); arma(1, 1); ma(1)(or arma(0, 1)); arma(1, 3).

4.13 The means are: E[A] = 8.75; E[X] ≈ 13.0; E[Y ] = 0; E[R] = 0;E[P ] ≈ 6.44.

4.15 (a) (1 + 0.44B)At = 12.6 + (1 + 0.37B)et;

(b) (1− 0.24B + 0.38B2)Xn = 14.8 + en;

(c) (1 + 0.44B)Yt+1 = (1− 0.19B)et+1;

(d) Rn = (1 + 0.46B)en;

(e) (1 + 0.35B)Pn+1 = 8.69 + (1− 0.26B − 0.18B2 + 0.11B3)en+1.

4.17 (a) (1− 0.2B1)Xn = (1− 0.1B1)en;

(b) The one-step ahead forecast is Xn+1|n = 0.2Xn − 0.1en. Thetwo-step ahead forecast is Xn+2|n = 0.2Xn+1|n.

(c) We have θ′(B) = (1 − 0.1B)/(1 − 0.2B). Solving shows thatθ′(B) = 1 + 0.1B + 0.1(0.2)B2 + · · · + 0.1(0.2)k−1Bk. The pureMA models is therefore

Xt = et + 0.1et−1 + 0.1(0.2)et−2 + · · ·+ 0.1(0.2)k−1et−k.

© USQ, February 21, 2007

Page 75: Study Book

4.8. Exercises 71

(d) The variance of the forecasting error for the one-step ahead fore-cast is σ2

e = 9.3. For the two-step ahead forecast, the variance ofthe forecast error is σ2

e + (0.1)2σ2e = 9.393. The 95% confidence

intervals therefore are Xt+1|t±2√

9.3 for the one-step ahead fore-cast; and Xt+2|t ± 2

√9.393 for the two-step ahead forecast.

4.20 Firstly, models must be in ma form to compute confidence intervalsfor forecasts; secondly, sometimes the ma model will be the simplestmodel in a given situation.

4.21 Hint: First write φ = exp(−1/K3), and the right-hand side looks likethe ar(1) part. Then, use the given relationship between It and et tofind It−1 and hence show that θ = φ(1 − c3)/(1 − c3φ) for the ma(1)part.

© USQ, February 21, 2007

Page 76: Study Book

72 Module 4. arma Models

© USQ, February 21, 2007

Page 77: Study Book

Module 5Finding a Model

Module contents5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Identifying a Model . . . . . . . . . . . . . . . . . . . . . 75

5.2.1 The Autocorrelation Function . . . . . . . . . . . . . . . 755.2.2 Sample acf . . . . . . . . . . . . . . . . . . . . . . . . 755.2.3 Sample pacf . . . . . . . . . . . . . . . . . . . . . . . . 805.2.4 Tips for using the sample acf and pacf . . . . . . . . . 835.2.5 Model selection using aic . . . . . . . . . . . . . . . . . 835.2.6 Selecting arma models . . . . . . . . . . . . . . . . . . 84

5.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . 85

5.3.1 Preliminary estimation for ar models: The Yule–Walkerequations . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2 Parameter estimation in R . . . . . . . . . . . . . . . . 865.4 Forecasting using R . . . . . . . . . . . . . . . . . . . . . 88

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.6.1 Answers to selected Exercises . . . . . . . . . . . . . . . 95

73

Page 78: Study Book

74 Module 5. Finding a Model

Module objectives

Upon completion of this module students should be able to:

� understand the information contained in the sample autocorrelationfunction (acf);

� understand the information contained in the sample partial acf (PACF);

� use the sample acf and sample pacf to select ar and ma models fortime series data;

� use r to plot the sample acf and pacf for time series data;

� write down the fitted ar or ma model given the r output;

� use the Akaike Information Criterion (aic) to select the order of armodels for time series data using r;

� understand that selecting arma models is more difficult than selectingar and ma models;

� compute initial parameter estimates of an ar model using the Yule–Walker equations;

� use r to compute predictions from an ar or ma model;

� use r to compute parameter estimates for ar and ma models of a givenorder.

5.1 Introduction

In this Module, methods are discussed for finding the best model for a par-ticular time series. This consists of two stages: first, determining which typeof model is appropriate for the given data (for example, ar(1) or ma(2));then secondly estimating the parameters in the chosen model.

The choice of ar, ma and arma models are discussed, as well as the numberof parameters necessary for the chosen type of model.

The two most important tools in making these decisions are the sampleautocorrelation function (acf) and sample partial autocorrelation function(pacf).

© USQ, February 21, 2007

Page 79: Study Book

5.2. Identifying a Model 75

5.2 Identifying a Model

5.2.1 The Autocorrelation Function

The autocorrelation function, or acf, was studied in earlier Modules. Theapproach then was to take a given model and deduce the acf is characteristicof that particular model.

In practice, the scientist doesn’t start with a known model, but instead startswith data for which a model is sought. Using software, the acf is estimatedfrom the data (using a sample acf), and the characteristics of the sampleacf used to select the best model.

5.2.2 Sample ACF

The autocorrelation function is estimated from the data using the formulae

γk =1N

N−k∑i=1

(Xi − µ)(Xi+k − µ) k ≥ 0

ρk =γk

γ0k ≥ 0

where N is the number of terms in the time series and µ is the sample meanof the time series. Of course, the actual computations are performed bycomputer, using a package such as r. Since the quantities γk (and henceρk) are estimated, there will be some sampling error. Formulae exist forestimation of the sampling error but will not be given here. However, r usesthese formulae to produce approximate 95% confidence intervals for ρk.

Consider the ma(2) model as used in Example 3.7 (p 49): Vn+1 = en+1 −0.39en − 0.22en−1, where σ2

e = 2. In that example, the theoretical acf wascomputed as

{ρ} = {1,−0.253,−0.183}.

The series {Vn} is simulated in r as follows:

> ma.terms <- c(-0.39, -0.22)

> sim.ma2 <- arima.sim(model = list(ma = ma.terms),

+ n = 1000, sd = sqrt(2))

Note the variance of the errors is given as 2.

The sample acf of this data is found as follows (Fig. 5.1):

© USQ, February 21, 2007

Page 80: Study Book

76 Module 5. Finding a Model

0 5 10 15 20 25 30

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

F

Series sim.ma2[10:1000]

Figure 5.1: The sample acf for the ma(2) model in Example 2.2 (p 35).

> acf(sim.ma2[10:1000])

Note the first few terms have been ignored; this allows the simulation torecover from the initial (arbitrary) choice of errors needed to begin the sim-ulation.

First, note the dotted horizontal lines on the plot. These indicate the approx-imate 95% confidence intervals for ρk. In other words, if the autocorrelationvalue lies within the dotted lines, the value can be considered as zero; thereason it is not exactly zero is due to sampling error only.

We would expect that the sample acf would demonstrate the features of theacf for the model. Compare Figures 3.1 (p 50) and 5.1; the sample acf andacf do look similar—they both show two components in the plot that arelarger than the rest when we ignore the term at a lag of zero which will alwaysbe one. (Recall that only two acf values are outside the dotted confidencebands, so the rest can be considered as zero, and that the first term willalways be one so is of no importance.) Notice there are two components inthe acf that are non-zero for a two-parameter ma model (that is, ma(2)).

In fact, this is typical. Here is one of the most important rules for identifyingtime series models:

© USQ, February 21, 2007

Page 81: Study Book

5.2. Identifying a Model 77

If the sample acf has k non-zero components from 1 to k, thenan ma(k) model is appropriate.

In r, the sample acf is produced by typing acf( time.series ) at the rprompt, where time.series is the name of the time series.

Example 5.1: Consider the ar(2) model

Xn = 0.4Xn−1 − 0.3Xn−2 + en.

The theoretical acf can be computed and plotted in r (by first con-verting to an ma model):

> imp <- as.ts(c(1, rep(0, 99)))

> ar.terms <- c(0.4, -0.3)

> theta <- filter(imp, ar.terms, "recursive")

> errorvar <- 1

> gamma <- convolve(theta, theta) * sqrt(errorvar)

> rho <- gamma/gamma[1]

Note we used σ2e = 1; it doesn’t matter what we use since we eventually

compute ρ anyway.

> plot(c(1, 10), c(1, -0.2), type = "n", las = 1,

+ main = "Actual ACF ", xlab = "Lag", ylab = "ACF")

> lines(rho, type = "h", lwd = 2)

> abline(h = 0)

This theoretical acf is shown in the top panel of Fig. 5.2.

Suppose we generated some random numbers from this time series andcomputed the sample acf; we would expect the sample acf to looksimilar to Fig. 5.2. Proceed:

> ar2.sim <- arima.sim(model = list(ar = ar.terms),

+ n = 1000)

> acf(ar2.sim[10:1000], lwd = 2, las = 1, lag.max = 10,

+ main = "Sample ACF")

This sample acf is shown in the bottom panel of Fig. 5.2. They arevery similar as expected.

Example 5.2:

Parzens [36] studied a time series of yearly snowfall in Buffalo from1910 to 1972 (recorded to the nearest tenth of an inch):

© USQ, February 21, 2007

Page 82: Study Book

78 Module 5. Finding a Model

2 4 6 8 10

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Actual ACF

Lag

AC

F

0 2 4 6 8 10

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Sample ACF

Figure 5.2: Top: the theoretical acf for the ar(2) model in Example 5.1;Bottom: the sample acf for data simulated from the ar(2) model in Ex-ample 5.1.

© USQ, February 21, 2007

Page 83: Study Book

5.2. Identifying a Model 79

Time

sf

1910 1920 1930 1940 1950 1960 1970

40

60

80

100

120

0 5 10 15

−0.

20.

20.

61.

0

Lag

AC

F

Series sf

Figure 5.3: Yearly Buffalo snowfall from 1910 to 1972. Top: the plot of thedata; Bottom: the sample acf.

> bs <- read.table("buffalosnow.dat", header = TRUE)

> sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1)

The data are plotted in the top panel of Fig. 5.3. The time seriesis small, but the series appears to be approximately stationary. Thesample acf for the data has been computed in r(Fig. 5.3, bottompanel).

The acf has two non-zero terms (ignoring the term at lag zero, whichis always one), suggesting an ma(2) model is appropriate for modellingthe data. Note the confidence bands are approximate only. Here issome of the code used to produce the plots:

> plot(sf, las = 1)

> acf(sf, lwd = 2)

© USQ, February 21, 2007

Page 84: Study Book

80 Module 5. Finding a Model

5.2.3 Sample PACF

In the previous section, the acf was introduced to indicate the order of thema model appropriate for a dataset. How do we choose the appropriateorder of the ar model? To identify ar models, a partial acf is used, whichis explained below.

Consider three random variable X, Y and Z. Suppose X and Y are corre-lated, and Y and Z are correlated. Does this mean X and Z will be cor-related? Generally yes—because both are correlated with Y . If Y changes,both X and Z will change, and so there will be a non-zero correlation be-tween X and Z. Partial correlation measures the correlation between Xand Z after removing the effect of the variable Y on both X and Z.

Likewise, the partial autocorrelation measures the correlation between Xi

and Xi+k after removing the effect of the joint correlations with Xi+1, Xi+2,. . . , Xi+(k−1). The number of non-zero terms in the partial acf or pacfsuggests the order of the ar model. Here is the second of the most importantrules for identifying time series models:

If the sample pacf has k non-zero components from 1 to k, thenan ar(k) model is appropriate.

In r, the sample pacf is produced by typing pacf( time.series ) at ther prompt, where time.series is the name of the time series. Note there isno term at a lag of zero for the sample pacf, as it makes no sense given theexplanation above about removing the effect of intermediate observations.

Example 5.3: Consider the ar(2) model from Example 5.1. As this is anar(2) model, the sample pacf from the simulated data is expectedto have two significant terms. The sample pacf (Fig. 5.4) has twosignificant terms as expected.

As explained, note there is no term at a lag of zero for the samplepacf.

Example 5.4: In Example 5.2 (p 77), the annual Buffalo snowfall datawas examined using the acf and an ma(2) model was found to besuitable.

© USQ, February 21, 2007

Page 85: Study Book

5.2. Identifying a Model 81

2 4 6 8 10

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

Lag

Par

tial A

CF

Sample Partial ACF

Figure 5.4: The sample pacf of data simulated from an ar(2) model.

5 10 15

−0.

2−

0.1

0.0

0.1

0.2

0.3

Lag

Par

tial A

CF

Series sf

Figure 5.5: The sample pacf of yearly Buffalo snowfall from 1910 to 1972.

© USQ, February 21, 2007

Page 86: Study Book

82 Module 5. Finding a Model

Time

Dat

a

0 100 200 300 400 500

−4

−2

02

46

Correct AR(2) modelIncorrect ARIMA(9,2,9) model

Figure 5.6: Simulated ar(2) data. Two models have been used to make pre-dictions; the simple model is better for prediction. Note the more complexmodel predicts snowfall wil increase linearly over time!

The sample pacf for the data has been computed in r(Fig. 5.5, bottompanel); there is no term at a lag of zero for the sample pacf.

The pacf has only one non-zero term, suggesting an ar(1) model isappropriate for modelling the data. Recall the acf suggested an ma(2)model. Which model do we choose? Since the one-parameter ar modelis simpler than the two-parameter ma model, the ar(1) model wouldbe chosen as the best model. We almost certainly do not need both ama(2) and ar(1) term in the model. (Later, we will learn about othercriteria to use that helps make this decision also.)

Now that an ar(1) model is chosen, it remains to estimate the param-eters of the model. This will be discussed in Sect. 5.3.

Example 5.5: Consider some simulated ar(2) data. An ar(2) model anda more complicated model (an arima(9, 2, 9); we learn about arimamodels in Module 7.4) are fitted to the data. Predictions can be madeusing both models; these predictions are compared in Fig. 5.6.

The simple model is far better for making predictions!

© USQ, February 21, 2007

Page 87: Study Book

5.2. Identifying a Model 83

Table 5.1: Typical features of a sample acf and sample pacf for ar andma models. The ‘slow decay’ may not always be observed.

acf pacf

ar(k) model slow decay k non-zero termsma(k) model k non-zero terms slow decay

5.2.4 Tips for using the sample ACF and PACF

When using the sample acf and pacf it is important to realize they areobtained from sample information. This means they have sampling error. Toallow for this, the dotted lines produced by r represent confidence intervals(95% by default). This implies a small number of terms (about 1 in 20) willlie outside the dotted lines even if they are truly zero. In addition, theseconfidence intervals are approximate only. Since 5% (or 1 in 20) componentsare expected to be outside these approximate limits anyway, it is importantto not place too much emphasis on term in the sample acf and pacf aremarginal. For example, if the sample acf has two significant terms, butone is just over the confidence bands, perhaps an ma(1) model will be justas good as an ma(2). Tools for assisting in making this decision will beconsidered in Module 6.

An ar(k) model is implied by a sample pacf with non-zero terms from 1to k, and typically (but not always) the terms in the sample acf will decayslowly toward zero. Similarly, a ma(k) model will be implied by a sample acfwith k non-zero terms from 1 to k, and typically (but not always) the termsin the sample pacf will decay slowly toward zero. Table 5.1 summarizesthese very important facts for selecting time series models.

5.2.5 Model selection using AIC

Another method of selecting the order of the ar model is appropriate is touse the Akaike Information Criterion (aic). The aic is used in many areasof statistics, and details will not be considered here. The aic, in generalterms, determines the size of the errors by evaluating the log-likelihood, butalso penalizes overfitting of models by including a penalty term (usuallytwice the number of parameters used). While including extra (but possiblyunnecessary) parameters in the model will reduce the size of the errors,the penalty function ensures these unnecessary terms will be less attractivewhen using the aic. There are numerous variations of the aic which usedifferent forms for the penalty function, and often produce different models

© USQ, February 21, 2007

Page 88: Study Book

84 Module 5. Finding a Model

than produced using the aic. In each case, the model with the minimumaic is selected.

In r, the function ar uses the aic to select the order of the ‘best’ ar model;unfortunately, ma and arma models are not considered.

The advantage of this method it is automatic, and any two people usingthe same data and software will select the same model. The disadvantage isthe computer is very strict in its decision making and does not allow for ahuman’s expert knowledge or interpretation of the information.

Example 5.6: Using the snowfall data from Example 5.4 (p 80), the func-tion ar can be used to select the order of the ar model.

> sf.armodel <- ar(sf)

> sf.armodel

Call:ar(x = sf)

Coefficients:1 2

0.2379 0.2229

Order selected 2 sigma^2 estimated as 500.7

(We will consider writing down the actual model in Sect. 5.3.2).

Thus the ar function recommends an ar(2) model (from line 10)Order selected 2). There are therefore three models to consider:an ma(2) from the sample acf an ar(1) from the sample pacf andnow an ar(2) from r using the aic. Which do we choose?

This predicament happens often in time series analysis: there are oftenmany good models from which to choose. In Module 6, some methodswill be discussed for evaluating various models. If one of the modelappears better than the others using these methods, that model shouldbe chosen. But what if they all appear to be equally good? In thatcase, the simplest model would be chosen—the ar(1) model in thiscase.

5.2.6 Selecting ARMA models

Selecting arma models is not easy from the acf and the pacf. To selectarma models, it is first necessary to study some diagnostics of ar and ma

© USQ, February 21, 2007

Page 89: Study Book

5.3. Parameter estimation 85

models in the next Module. The issue of selecting arma models will bereconsidered in Sect. 6.3.

5.3 Parameter estimation

Previous sections have given the basis for selecting an ar or ma model for agiven data set, and to determine the order of the model. This section nowdiscusses how to estimate the unknown parameters in the model using r.The actual mathematics is not discussed and indeed, it is not easy.

5.3.1 Preliminary estimation for AR models: The Yule–Walkerequations

Consider the ar model in Equation (2.1) (p 2.1). If the number of terms pin the series is finite, it is possible to write down a system of equations forcalculating the autoregressive coefficients {φk}p

k=1 from the autocorrelationcoefficients, {ρk}k≥0.

Multiplying Equation (2.1) by Xn−k and taking expectations, obtain

γk = φ1γk−1 + · · ·+ φpγk−p

for k ≥ 0. Dividing through by γ0,

ρk = φ1ρk−1 + · · ·+ φpρk−p , (5.1)

for k ≥ 0. The set of equations (5.1) with k = 0, . . . p are written as amatrix equation, and can be solved for the coefficients φk. These are knownas the . In matrix form, we have

1 ρ1 ρ2 · · · ρp−1

ρ1 1 ρ1 · · · ρp−2...

......

. . ....

ρp−1 ρp−2 ρp−3 · · · 1

φ1

φ2...

φp

=

ρ1

ρ2...

ρp

. (5.2)

This matrix equation can be solved for the coefficients {φk}pk=0 via the for-

mula φ1

φ2...

φp

=

1 ρ1 ρ2 · · · ρp−1

ρ1 1 ρ1 · · · ρp−1...

......

. . ....

ρp−1 ρp−2 ρp−3 · · · 1

−1

ρ1

ρ2...

ρp

. (5.3)

© USQ, February 21, 2007

Page 90: Study Book

86 Module 5. Finding a Model

Example 5.7: Suppose we have a set of time series data. A plot of thereveals that the first few non-zero terms of the (and hence ρk val-ues) are 0.36, −0.14, 0.01 and −0.03. We could use the to determineapproximate values for φk:

φ1

φ2

φ3

φ4

=

1 0.36 −0.14 0.01

0.36 1 0.36 −0.14−0.14 0.36 1 0.360.01 −0.14 0.36 1

−1

0.36−0.14

0.01−0.03

,

which gives φ = (0.6032,−0.5247, 0.3708,−0.2430). Using more termswould give estimate for φk for k > 4, but this is sufficient to demon-strate the use of the Yule–Walker equation.

The Yule–Walker equations are used to find an initial estimate of the pa-rameters. Note also that they are based on finding parameters for an armodel only.

5.3.2 Parameter estimation in R

The function used by r to estimate parameters in arma models in thefunction arima. To demonstrate how to use this function, consider againthe yearly Buffalo snowfall from Example 5.2 (p 77), Example 5.4 (p 80)and Example 5.6 (p 84). In these examples, the following models wereconsidered: ar(1) (from the pacf); ma(2) (from the acf); and a ar(2)(from the aic).

Example 5.8: To fit the ar(1) model, use

> snow.ar1 <- arima(sf, order = c(1, 0, 0))

> snow.ar1

Call:arima(x = sf, order = c(1, 0, 0))

Coefficients:ar1 intercept

0.3302 80.8809s.e. 0.1236 4.1722

sigma^2 estimated as 496.8: log likelihood = -285.01, aic = 576.01

© USQ, February 21, 2007

Page 91: Study Book

5.3. Parameter estimation 87

Importantly, r always fits a model to the mean-corrected time series.That is, the mean of the series is subtracted from the observationsbefore computing the acf and pacf. Hence, if yearly Buffalo snowfallis {Bt}, the output indicates the fitted model is

Bt − 80.88 = 0.3302(Bt−1 − 80.88) + et.

Rearranging produces the model

Bt = 54.17 + 0.3302Bt−1 + et.

The parameter estimates are also given in the output. Either form isacceptable as the final model.

Example 5.9: Similarly, the ar(2) model is found thus:

> snow.ar2 <- arima(sf, order = c(2, 0, 0))

> snow.ar2

Call:arima(x = sf, order = c(2, 0, 0))

Coefficients:ar1 ar2 intercept

0.2542 0.2373 81.5422s.e. 0.1262 0.1262 5.2973

sigma^2 estimated as 469.6: log likelihood = -283.3, aic = 574.59

This indicates the ar(2) model is

Bt − 81.54 = 0.2542(Bt−1 − 81.54) + 0.2373(Bt−2 − 81.54) + et.

Rearranging produces

Bt = 41.46 + 0.2542Bt−1 + 0.2373Bt−2 + et.

Comparing the aic for both the ar models show that the ar(2) modelis only slightly better using this criterion than the ar(1) model.

The output from using the function ar can also be used to write downthe fitted model but it doesn’t estimate the intercept; see Example 5.6.The estimates are also slightly different as a different algorithm is usedfor estimating the parameters.

© USQ, February 21, 2007

Page 92: Study Book

88 Module 5. Finding a Model

Example 5.10: To fit the ma(1) model, use

> snow.ma1 <- arima(sf, order = c(0, 0, 1))

> snow.ma1

Call:arima(x = sf, order = c(0, 0, 1))

Coefficients:ma1 intercept

0.2104 80.5421s.e. 0.0982 3.4616

sigma^2 estimated as 517.6: log likelihood = -286.27, aic = 578.53

This indicates the ma(1) model is

Bt − 80.54 = +et + 0.2104et−1.

orBt = 80.54 + et + 0.2104et−1.

In general, the model is fitted using arima using the order option. The firstcomponent in order is the order of the ar component, and the third is theorder of the ma component. What is the second term?

The second term is only necessary if the series is non-stationary. The nextModule discusses this issue, where the meaning of the second term in theorder parameter will be discussed.

5.4 Forecasting using R

Once a model has been found, r can be used to make forecasts. The functionto use is predict. The following example shows how to use this function.

Example 5.11: To demonstrate how to use this function, consider againthe yearly Buffalo snowfall recently seen in Examples 5.8 to 5.10. Thedata contain the annual snowfall in Buffalo up to 1972.

Consider just Example 5.8, where an ar(1) model was fitted. To makea forecast, the following commands are used (note that the objectsnow.ar1 was created earlier by fitting an ar(1) model to the data):

© USQ, February 21, 2007

Page 93: Study Book

5.5. Summary 89

> snow.pred <- predict(snow.ar1, n.ahead = 10)

> snow.pred

$predTime Series:Start = 1973End = 1982Frequency = 1[1] 90.49534 84.05536 81.92903 81.22696 80.99516[6] 80.91862 80.89335 80.88500 80.88225 80.88134

$seTime Series:Start = 1973End = 1982Frequency = 1[1] 22.28815 23.47162 23.59705 23.61068 23.61217[6] 23.61233 23.61235 23.61235 23.61235 23.61235

r has made predictions for the next ten years based on the ar(1)model, and has included the standard errors of the forecasts as well.(This make it easy to compute the confidence intervals.) Notice theforecasts from about six years ahead and further are almost the same.This implies that the model has little skill at forecasting that far ahead(which is not surprising). Forecasts a long way into the future tend tobe the mean, which is reasonable.

The data and the forecasts can be plotted together (Fig. 5.7) as follows:

> snow.and.preds <- ts.union(sf, snow.ar1$pred)

> plot(snow.and.preds, plot.type = "single",

+ lty = c(1, 2), lwd = 2, las = 1)

Similar forecasts and plots can be constructed from the other types ofmodels (that is, ma or arma models) in a similar way. The forecastsare shown for each of these models in Table 5.2.

5.5 Summary

This Module considered the identification of ar and ma models for a givenset of stationary time series data, primarily using the acf and the pacf.The Akaike Information Criterion (aic) was also considered.

© USQ, February 21, 2007

Page 94: Study Book

90 Module 5. Finding a Model

Time

snow

.and

.pre

ds

1910 1920 1930 1940 1950 1960 1970 1980

40

60

80

100

120

Figure 5.7: Forecasting the Buffalo snowfall data ten years ahead. There islittle skill in the forecast after a few years. The forecasts are shown using adashed line.

Table 5.2: Comparison of the predictions for forecasting ten-steps aheadusing the ar(1), ar(2) and ma(2) models for the Buffalo snowfall data.

ar(1) ar(2) ma(2)

1 90.50 92.44 86.182 84.06 91.06 85.903 81.93 86.55 80.904 81.23 85.07 80.905 81.00 83.63 80.906 80.92 82.91 80.907 80.89 82.38 80.908 80.89 82.08 80.909 80.88 81.88 80.90

10 80.88 81.76 80.90

© USQ, February 21, 2007

Page 95: Study Book

5.6. Exercises 91

Note that most time series (including climatological time series) are notstationary, but the methods developed so far apply only to stationary data.In Module 7, non-stationary time series will be examined.

5.6 Exercises

Ex. 5.12: Consider a time series {L}. The fitted model is an arma(1, 0)model.

(a) The model is a special case of an arma model. What is anotherway of expressing the model?

(b) Write this model using the backshift operator.

(c) Sketch the possible sample acf and pacf that lead to the selec-tion of this model.

Ex. 5.13: Consider a time series {Y }. The fitted model is an arma(0, 2)model.

(a) The model is a special case of an arma model. What is anotherway of expressing the model?

(b) Write this model using the backshift operator.

(c) Sketch the possible sample acf and pacf that lead to the selec-tion of this model.

Ex. 5.14: The mean annual streamflow in Cache River at Forman, Illinois,from 1925 to 1988 is given in the file cacheriver.dat. (The data arenot reported by calendar year, but by ‘water year’. A water year startsin October of the calendar year one year less than the water year andends in September of the calendar year the same as the water year. Forexample, water year 1980 covers the period October 1, 1979 throughSeptember 30, 1980. However, this does not affect the model or youranalysis.) There are two variables of interest: Mean reports the meanannual flow, and Max reports the maximum flow each water year, eachmeasured in cubic feet per second. (The data have been obtained fromUSGS [4].)

(a) Use r to find a suitable model for the mean annual stream flowusing the acf and pacf.

(b) Use r to find a suitable model for the maximum annual streamflow using the function ar and the sample acf and sample pacf.

(c) Using your chosen model, produce forecasts up to three-stepsahead.

© USQ, February 21, 2007

Page 96: Study Book

92 Module 5. Finding a Model

0.77 1.74 0.81 1.20 1.95 1.20 0.47 1.433.37 2.20 3.00 3.09 1.51 2.10 0.52 1.621.31 0.32 0.59 0.81 2.81 1.87 1.18 1.354.75 2.48 0.96 1.89 0.90 2.05

Table 5.3: Thirty consecutive days of precipitation in inches at Minneapolis,St Paul. The data should be read across the rows.

Ex. 5.15: Simulate the ar(2) model

Rn+1 = 0.2Rn − 0.4Rn−1 + en+1

where {e} ∼ N(0, 4). Compute the sample acf and sample pacf fromthis simulated data. Do they show the features you expect?

Ex. 5.16: Simulate the ma(2) model

Xt = −0.3et−1 − 0.2et−2 + et

where {e} ∼ N(0, 8). Compute the sample acf and sample pacf fromthis simulated data. Do they show the features you expect?

Ex. 5.17: The data in Table 5.3 are thirty consecutive values of Marchprecipitation in inches for Minneapolis, St. Paul obtained from Handet al. [19]. The years are not given. (The data are available in thedata file minn.txt.)

(a) Load the data into r and find a suitable model (ma or ar) forthe data.

(b) Produce forecasts up to three-steps ahead with your chosen model.

Ex. 5.18: The data in the file lake.dat give the mean annual levels atLake Victoria Nyanza from 1902 to 1921, relative to a fixed referencepoint (units are not given). The data are from Shaw [41] as quoted inHand et al [19]. Explain why an ar, ma or arma cannot be fitted tothis data set.

Ex. 5.19: The Easter Island sea level air pressure anomalies from 1951 to1995 are given in the data file easterslp.dat, which were obtainedfrom the IRI/LDEO Climate Data Library (http://ingrid.ldgo.columbia.edu/). Find a suitable ar or ma model for the series usingthe sample acf and pacf. Use this model to forecast up to threemonths ahead.

© USQ, February 21, 2007

Page 97: Study Book

5.6. Exercises 93

Ex. 5.20: The Western Pacific Index (WPI) measures the mode of low-frequency variability over the North Pacific. The time series in the datafile wpi.txt is from the Climate Prediction Center [3] and the ClimateDiagnostic Centre [2], and gives the monthly WPI from January 1950to December 2001.

(a) Confirm that the data are approximately stationary by plottingthe data.

(b) Find an appropriate model for the data using the acf and pacf.

(c) Find an appropriate model using the ar function.

(d) Which model is your preferred model? Explain your answer.

(e) Find parameter estimates for your preferred model.

Ex. 5.21: The seasonal average SOI from (southern hemisphere) summer1876 to (southern hemisphere) summer 2001 is given in the file soiseason.dat

(a) Confirm that the data are approximately stationary by plottingthe data.

(b) Find an appropriate model for the data using the acf and pacf.

(c) Find an appropriate model using the ar function.

(d) Which model is your preferred model? Explain your answer.

(e) Find parameter estimates for your preferred model.

Ex. 5.22: The monthly average solar flux from January 1948 to December2002 is given in the file solarflux.txt.

(a) Confirm that the data are approximately stationary by plottingthe data.

(b) Find an appropriate model for the data using the acf and pacf.

(c) Find an appropriate model using the ar function.

(d) Which model is your preferred model? Explain your answer.

(e) Find parameter estimates for your preferred model.

Ex. 5.23: The acf in Fig. 5.8 was produced for a time series {P}. In thisquestion, the Yule–Walker equations are used to form initial estimatesfor the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walkerequations, and solve for the ar parameters. (Any terms withinthe confidence limits can be assumed to be zero.)

(b) Repeat, but use four terms of the acf. Compare you answers tothose in part (a).

© USQ, February 21, 2007

Page 98: Study Book

94 Module 5. Finding a Model

0 2 4 6 8 10

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.8: The acf for the time series {P}.

Ex. 5.24: The acf in Fig. 5.9 was produced for a time series {Q}. In thisquestion, the Yule–Walker equations are used to form initial estimatesfor the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walkerequations, and solve for the ar parameters. (Any terms withinthe confidence limits can be assumed to be zero.)

(b) Repeat, but use four terms of the acf. Compare you answers tothose in part (a).

(c) Repeat, but use five terms of the acf. Compare you answers tothose in parts (a) and (b).

Ex. 5.25: The acf in Fig. 5.10 was produced for a time series {R}. In thisquestion, the Yule–Walker equations are used to form initial estimatesfor the values of φ.

(a) Use the first three terms in the acf to set up the Yule–Walkerequations, and solve for the ar parameters. (Any terms withinthe confidence limits can be assumed to be zero.)

(b) Repeat, but use four terms of the acf. Compare you answers tothose in part (a).

© USQ, February 21, 2007

Page 99: Study Book

5.6. Exercises 95

0 2 4 6 8 10

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.9: The acf for the time series {Q}.

(c) Repeat, but use five terms of the acf. Compare you answers tothose in parts (a) and (b).

5.6.1 Answers to selected Exercises

5.12 (a) ar(1) model.

(b) (1− φB)Ln = en for some value of φ.

(c) The possible sample acf and pacf are shown in Fig. 5.11. Theactual details of the s are not important; what is important isthat there is only one significant term in the sample pacf andthe sample acf takes a long time to decay (and the term at lagzero in the acf is one as always).

5.14 (a) The time series is plotted in Fig. 5.12. The data appears to beapproximately stationary. The sample acf and pacf are shownin Fig. 5.13.The sample acf has no significant terms, suggesting no particularma model will be useful. The sample pacf has only one termmarginally significant at a lag of 14. This suggests that there is

© USQ, February 21, 2007

Page 100: Study Book

96 Module 5. Finding a Model

0 2 4 6 8 10

−0.4

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.10: The acf for the time series {R}.

no obvious ar model. What is the conclusion? The conclusion isthat there is no suitable ar or ma model for modelling the data.In fact, it suggests that the observations are actually random,and therefore unpredictable. Using the function ar suggests thesame.Notice that a lot of work is sometimes needed to come to the con-clusion that no model is useful. This does not mean the exercisehas been a waste of time—after all, it is now known that thereis no useful ar or ma model, which in in itself is useful informa-tion. If {St} is the mean annual streamflow, then the model isSt = m + et for the appropriate value of m (which will be themean in this case). Since the mean value of the mean stream-flow is 299.3, the model is St = 299.3 + et. The forecasts up tothree-steps ahead are all 299.3.

(b) Using the function ar, a suitable ar model is an ar(5) model:

> cr <- read.table("cacheriver.dat", header = TRUE)

> ar(cr$Max)

Call:ar(x = cr$Max)

© USQ, February 21, 2007

Page 101: Study Book

5.6. Exercises 97

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

2 4 6 8 10

0.0

0.2

0.4

0.6

Lag

Par

tial A

CF

Figure 5.11: A possible sample acf and pacf for an ar(1) model. The acfis shown in the top plot; the pacf in the bottom plot.

Coefficients:1 2 3 4 5

-0.2096 -0.1298 -0.3270 -0.2821 -0.2111

Order selected 5 sigma^2 estimated as 4662001

In contrast, using the acf and pacf would suggest that the dataare random. This is an example of a situation where the human isprobably correct, and the computer doesn’t actually know best.

(c) The chosen model is St = 4133+ et where 4133 is the mean. Theforecasts are all 4133.

5.19 The time series is plotted in Fig. 5.14. The data appears to be approx-imately stationary. The sample acf and pacf are shown in Fig. 5.15.

The sample acf has seven significant terms, suggesting an ma(7)

© USQ, February 21, 2007

Page 102: Study Book

98 Module 5. Finding a Model

Time

rflo

w

1930 1940 1950 1960 1970 1980 1990

200

400

600

800

Figure 5.12: A plot of the mean annual streamflow in cubic feet per secondat Cache River, Illinois, from 1925 to 1988.

© USQ, February 21, 2007

Page 103: Study Book

5.6. Exercises 99

0 5 10 15

−0.

20.

20.

61.

0

Lag

AC

F

5 10 15

−0.

3−

0.1

0.1

Lag

Par

tial A

CF

Figure 5.13: The sample acf and pacf of the mean annual streamflow incubic feet per second at Cache River, Illinois, from 1925 to 1988. Top: thesample acf; Bottom: the sample pacf.

© USQ, February 21, 2007

Page 104: Study Book

100 Module 5. Finding a Model

Time

eisl

p

1950 1960 1970 1980 1990

−6

−4

−2

0

2

4

6

Figure 5.14: A plot of the Easter Island sea level air pressure anomaly from1951 to 1995.

© USQ, February 21, 2007

Page 105: Study Book

5.6. Exercises 101

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Lag

AC

F

0.0 0.5 1.0 1.5 2.0

−0.

10.

00.

10.

2

Lag

Par

tial A

CF

Figure 5.15: The sample acf and pacf of the Easter Island sea level airpressure anomaly. Top: the sample acf; Bottom: the sample pacf.

© USQ, February 21, 2007

Page 106: Study Book

102 Module 5. Finding a Model

model. It is likely that a more compact ar model can be found. Thesample pacf suggests an ar(3) model may be appropriate (the termsat lag 5 and 6 are so marginal, they can probably be ignored.) Thesecond term is not significant, but the third term in the pacf is sig-nificant, so we need to use an ar(3) model if the significant term atlag 3 is to be taken. Given the choice of either ma(7) or ar(3), themore compact ar model is to be preferred. The code used to generatethe above plots is shown below

> ei <- read.table("easterslp.dat", header = TRUE)

> eislp <- ts(ei$slpa, start = c(1951, 1), frequency = 12)

> plot(eislp, main = "", las = 1)

> acf(eislp, main = "")

> pacf(eislp, main = "")

To estimate the parameters, use

> eislp.model <- arima(eislp, order = c(3, 0,

+ 0))

> eislp.model$coef

ar1 ar2 ar3 intercept0.25139496 0.02663228 0.16891009 -0.15173751

The fitted ar model is therefore

Et = 0.251Et−1 + 0.0266Et−2 + 0.1689Et−3 + en

if {Et} is the Easter Island sea level air pressure anomaly.

The one-step ahead forecast is

Et+1|t = 0.251Et + 0.0266Et−1 + 0.1689Et−2

The last few values in the series are:

> length(eislp)

[1] 540

> eislp[535:540]

[1] 0.1 3.0 2.9 -1.2 3.2 -1.0

So the one-step ahead forecast is

Et+1|t = 0.251× (−1.0) + 0.0266× 3.2 + 0.1689× (−1.2)= −0.36896,

and likewise for further steps ahead.

© USQ, February 21, 2007

Page 107: Study Book

5.6. Exercises 103

5.23 From the acf ρ1 ≈ 0.3, ρ3 ≈ −0.2 and ρ3 ≈ 0.2 (and the rest areessentially zero). So the matrix equation is 1 0.3 −0.2

0.3 1 0.3−0.2 0.3 1

φ1

φ2

φ3

=

0.3−0.20.2

with solution 0.5416667

−0.50.4583333

.

In r:

> Mat <- matrix(data = c(1, 0.3, -0.2, 0.3,

+ 1, 0.3, -0.2, 0.3, 1), byrow = FALSE,

+ nrow = 3, ncol = 3)

> rhs <- matrix(nrow = 3, data = c(0.3, -0.2,

+ 0.2))

> sol1a <- solve(Mat, rhs)

> Mat <- matrix(data = c(1, 0.3, -0.2, 0.2,

+ 0.3, 1, 0.3, -0.2, -0.2, 0.3, 1, 0.3,

+ 0.2, -0.2, 0.3, 1), byrow = FALSE, nrow = 4,

+ ncol = 4)

> rhs <- matrix(nrow = 4, data = c(0.3, -0.2,

+ 0.2, 0))

> sol1b <- solve(Mat, rhs)

> sol1b

[,1][1,] 0.7870968[2,] -0.7677419[3,] 0.7483871[4,] -0.5354839

The solutions are very different. In practice, all the available informa-tion is used (and hence very large matrices result).

© USQ, February 21, 2007

Page 108: Study Book

104 Module 5. Finding a Model

© USQ, February 21, 2007

Page 109: Study Book

Module 6Diagnostic Tests

Module contents6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2 Residual acf and pacf . . . . . . . . . . . . . . . . . . . . 107

6.3 Identification of arma models . . . . . . . . . . . . . . . 110

6.4 The Box–Pierce test (Q-statistic) . . . . . . . . . . . . . 116

6.5 The cumulative periodogram . . . . . . . . . . . . . . . 117

6.6 Significance of parameters . . . . . . . . . . . . . . . . . 118

6.7 Normality of residuals . . . . . . . . . . . . . . . . . . . . 119

6.8 Alternative models . . . . . . . . . . . . . . . . . . . . . 120

6.9 Evaluating the performance of a model . . . . . . . . . 121

6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 124

Module objectives

Upon completion of this module students should be able to:

� use r to create residual acf and pacf plots;

105

Page 110: Study Book

106 Module 6. Diagnostic Tests

� understand the information contained in residual acf and pacf plots;

� use the residual acf and pacf to identify arma models;

� write down the arma model from the r output;

� use r to make forecasts from a fitted arma model;

� use r to evaluate the Box–Pierce statistic and Ljung–Box statistic andunderstand what they imply about the fitted model;

� use r to create a cumulative periodogram and understand what itimplies about the fitted model;

� use r to create Q–Q plot and understand what it implies about thefitted model;

� use r to test the signifcance of fitted parameters in a fitted model;

� fit competing models to a time series, and use the appropriate tests tocompare the possible models;

� select a good model for given stationary time series data.

6.1 Introduction

Once a model is fitted, it is important to know if the model is a ‘good’ model,or if it can be improved. But first, what is a ‘good’ model? A good modelshould be able to capture the important features of the data, or, in otherwords, capture the signal . After removing the signal from the time series,only random noise should remain. So to test if a model is a good modelor not, the noise is usually tested to ensure it is indeed random (and henceunpredictable). If the residuals are somehow predictable, the model shouldbe refined so the residuals are unpredictable and random.

In addition, a good model is as simple as possible. To ensure the model isas simple as possible, each term in the model should be tested to make sureit is significant; otherwise, the insignificant parameters should be removedfrom the model.

The process of evaluating a model is called diagnostic testing. A number ofdiagnostic tests are considered in this Module.

© USQ, February 21, 2007

Page 111: Study Book

6.2. Residual acf and pacf 107

6.2 Residual ACF and PACF

Since the residuals should be white noise (that is, are independent andcontain no elements are predictable), the acf and pacf of the residualsshould contain no hint of being forecastable. In other words, the terms of theresidual acf and residual pacf should all lie between the (approximate) 95%confidence limits. If not, there are elements in the residuals are forecastable,and these forecastable aspects should be included in the signal of model.

Example 6.1: In Sect. 5.3 (p 85), numerous models were fitted to theyearly Buffalo snowfall data first introduced in Example 5.2 (p 77).Two of those models were ar models. Here, consider the ar(1) model.The model was fitted in Example 5.8 (p 86).

There are two ways to do diagnostic tests in r. The first way is touse the tsdiag function; this function plots the standardized residualsin order and plots the acf of the residuals. (It also produces anotherplot studied in Sect. 6.4). Here is how the function can be used:

> par(mfrow = c(1, 1))

> bs <- read.table("buffalosnow.dat", header = TRUE)

> sf <- ts(bs$Snow, start = c(1910, 1), frequency = 1)

> ar1 <- arima(sf, order = c(1, 0, 0))

> tsdiag(ar1)

The result is shown in Fig. 6.1. The middle panel in Fig. 6.1 indicatesthe residual acf is fine and no model could be fitted to the residuals.

The second method involves using the output object from the arimacommand, as shown below.

> ar1 <- arima(sf, order = c(1, 0, 0))

> names(ar1)

[1] "coef" "sigma2" "var.coef" "mask"[5] "loglik" "aic" "arma" "residuals"[9] "call" "series" "code" "n.cond"[13] "model"

The residuals are given by ar1$resid, or more directly as resid(ar1):

> summary(resid(ar1))

Min. 1st Qu. Median Mean 3rd Qu. Max.-65.6600 -14.6800 1.4540 -0.2791 16.8700 47.3700

© USQ, February 21, 2007

Page 112: Study Book

108 Module 6. Diagnostic Tests

Standardized Residuals

Time

1910 1920 1930 1940 1950 1960 1970

−3

−2

−1

01

2

0 5 10 15

−0.

20.

20.

61.

0

Lag

AC

F

ACF of Residuals

●● ●

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

p values for Ljung−Box statistic

lag

p va

lue

Figure 6.1: Diagnostic plots after fitting an ar(1) model to the yearly Buffalosnowfall data. This is the output of using the tsdiag command in r.

© USQ, February 21, 2007

Page 113: Study Book

6.2. Residual acf and pacf 109

0 5 10 15

−0.

20.

20.

61.

0

Lag

AC

F

Series resid(ar1)

5 10 15

−0.

20.

00.

10.

2

Lag

Par

tial A

CF

Series resid(ar1)

Figure 6.2: Diagnostic plots after fitting an ar(1) model to the yearly Buffalosnowfall data. Top: the residual acf; Bottom: the residual pacf.

© USQ, February 21, 2007

Page 114: Study Book

110 Module 6. Diagnostic Tests

These residuals can be used to perform diagnostic tests. For example,the residual acf and residual pacf are shown in Fig. 6.2.

The residual acf and pacf indicate the residuals (or the noise) arenot forecastable. This suggests the ar(1) model fitted in Sect. 5.3 isadequate, considering this single criterion.

6.3 Identification of ARMA models

Using the residual acf and pacf is often how arma models are fitted. Aresearcher may look at the sample acf and sample pacf and conclude anar(2) model is appropriate. After fitting such a model, an examination ofthe residual acf and residual pacf indicates an ma(1) model now seemsappropriate. The best model for the data then be an arma(2, 1) model.The researcher would hope the residuals from this arma(2, 1) would bewhite noise. As was alluded to in Sect. 6.2, using the residual acf and pacfallows arma models to be identified.

Example 6.2: In Example 4.3, Chu & Katz [13] were said to fit an arma(1, 1)model to the monthly SOI time series from January 1935 to August1983. In this example we see how that model may have been chosen.Keep in mind that selecting arma models is very much an art andrequires experience to do well.

As with any time series, the data must be stationarity. (Fig. 1.3 bot-tom panel, p 8), which it appears to be. The next step is to look atthe acf and pacf; (Fig. 6.3). The acf suggests a very large order mamodel; the pacf suggests possibly an ar(2) model or an ar(4) model.To begin, select an ar(2) model as it is simpler and the terms at lags 3and 4 are only just significant; if an ar(4) model is necessary, it willbecome apparent in the diagnostic analysis. The code so far:

> ms <- read.table("soiphases.dat", header = TRUE)

> acf(ms$soi)

> pacf(ms$soi)

> ms.ar2 <- arima(ms$soi, order = c(2, 0, 0))

The residuals can be examined now to see if the fitted ar(2) modelis adequate using the residual acf and pacf from the ar(2) model(Fig. 6.4).

The residual acf suggests the model is reasonable, but the residualpacf suggests at least one ma term at lag 2 may be necessary. (There

© USQ, February 21, 2007

Page 115: Study Book

6.3. Identification of arma models 111

0 5 10 15 20 25 30

−0.

20.

20.

61.

0

Lag

AC

F

Series ms$soi

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

Lag

Par

tial A

CF

Series ms$soi

Figure 6.3: The acf and pacf of monthly SOI. Top: the acf; Bottom: thepacf.

© USQ, February 21, 2007

Page 116: Study Book

112 Module 6. Diagnostic Tests

0 5 10 15 20 25 30

0.0

0.4

0.8

Lag

AC

F

Series resid(ms.ar2)

0 5 10 15 20 25 30

−0.

100.

000.

05

Lag

Par

tial A

CF

Series resid(ms.ar2)

Figure 6.4: The acf and pacf of residuals for the ar(2) model fitted to themonthly SOI. In (a), the acf; in (b), the pacf.

© USQ, February 21, 2007

Page 117: Study Book

6.3. Identification of arma models 113

are significant terms at lags 5, 6, 14 and 15 also; it is more commonthat observations will be strongly related to more recent observationsthan those some time ago. Initially, then, deal with problem at lag 2; ifthe problems at the others lags persist, they can be dealt with later.)This is surprising as we fitted an ar(2) model which we would ex-pect to account for significant terms at lag 2. This suggests trying toadd an ma(2) component to the ar(2) component above, making anarma(2, 2) model. Fit this and look again at the residual plots:

> acf(ms.ar2$residuals)

> pacf(ms.ar2$residuals)

> ms.arma22 <- arima(ms$soi, order = c(2, 0,

+ 2))

> acf(ms.arma22$residuals)

> pacf(ms.arma22$residuals)

Again the residual acf looks fine; the residual pacf looks better, butstill not ideal (Fig. 6.5). The significant term at lag 2 has gone aswell as those at lags 5 and 6 however; this is more important than thesignificant terms at lags 14 and higher (as lags 14 time steps away areless likely to be of importance). So perhaps the arma(2, 2) model willsuffice. Here’s the model:

> ms.arma22

Call:arima(x = ms$soi, order = c(2, 0, 2))

Coefficients:ar1 ar2 ma1 ma2 intercept

0.9192 -0.0473 -0.4273 -0.0131 -0.0903s.e. 0.3801 0.3250 0.3792 0.1451 0.8158

sigma^2 estimated as 53.19: log likelihood = -5156.87, aic = 10325.74

Note the second ar term and the second ma term are both unnecessary(the estimate divided by the standard errors are much less than one).This suggests the second ar term and the second ma term should beexcluded from the model. In other words, try fitting an arma(1, 1)model.

> ms.arma11 <- arima(ms$soi, order = c(1, 0,

+ 1))

> acf(ms.arma11$residuals)

> pacf(ms.arma11$residuals)

© USQ, February 21, 2007

Page 118: Study Book

114 Module 6. Diagnostic Tests

0 5 10 15 20 25 30

0.0

0.4

0.8

Lag

AC

F

Series ms.arma22$residuals

0 5 10 15 20 25 30

−0.

10−

0.05

0.00

0.05

Lag

Par

tial A

CF

Series ms.arma22$residuals

Figure 6.5: The acf and pacf of residuals for the arma(2, 2) model fittedto the monthly SOI. Top: the acf; Bottom: the pacf.

© USQ, February 21, 2007

Page 119: Study Book

6.3. Identification of arma models 115

0 5 10 15 20 25 30

0.0

0.4

0.8

Lag

AC

F

Series ms.arma11$residuals

0 5 10 15 20 25 30

−0.

10−

0.05

0.00

0.05

Lag

Par

tial A

CF

Series ms.arma11$residuals

Figure 6.6: The acf and pacf of residuals for the arma(1, 1) model fittedto the monthly SOI. Top: the acf; Bottom: the pacf.

The residual acf and pacf from this model (Fig. 6.6) look very similarto those in Fig. 6.5, suggesting the arma(1, 1) model is better thanthe arma(2, 2) model, and also simpler.

Here’s the arma(1, 1) model:

> ms.arma11

Call:arima(x = ms$soi, order = c(1, 0, 1))

Coefficients:ar1 ma1 intercept

0.8514 -0.3698 -0.1183s.e. 0.0196 0.0355 0.7927

© USQ, February 21, 2007

Page 120: Study Book

116 Module 6. Diagnostic Tests

sigma^2 estimated as 53.25: log likelihood = -5157.63, aic = 10323.26

The aic implies this is a better model than the arma(2, 2) model, andso the arma(1, 1) is appropriate for the data.

6.4 The Box–Pierce test (Q-statistic)

Another test to apply to the residuals is to calculate the Box–Pierce statistic,or the Q-statistic, also known as a Portmanteau test. The purpose of the testis to check if the residuals are independent. The null hypothesis is that theresiduals are independent, and the alternative is they are not independent.

This test computes the sum of the squares of the first m (e.g. m = 15)sample acf coefficients, multiplied by the length of the time series (say N)and calls this Q:

Q = N

m∑k=1

ρ2k.

If the residuals are taken from a white noise process, the Q statistic will haveapproximately a chi-square (χ2) distribution with m−N degrees of freedom,where m is the number of autocorrelation coefficients used in computing thestatistic (15 above), and N is the number of autoregressive and movingaverage components estimated for the model. Some authors use m ratherthan m−N degrees of freedom (as does r). Chatfield [11, p 62] and othersnote the test is really only useful when the time series has more than 100observations. An alternative test to use which is better for shorter series is

Q = N(N + 2)m∑

k=1

ρ2k

N − k,

called the Ljung–Box test. Both tests, however, may lack statistical power.

In r, the function Box.test is used for both tests.

Example 6.3: In Example 6.1, the yearly Buffalo snowfall data were con-sidered. In that Example, the residual acf and pacf showed theresiduals were not forecastable using an ar(1) model. To test if theresiduals appear to be independent, use the Box.test function in r.The input variables are the residuals from the fitted model, and thenumber of terms in the acf to be used to compute the statistic. Thedefault value is one, which is far too few. Typically, a value such as 15is used (it is often more if the series is longer or is seasonal, and shorterif the time series is short).

© USQ, February 21, 2007

Page 121: Study Book

6.5. The cumulative periodogram 117

> Box.test(resid(ar1), lag = 15)

Box-Pierce test

data: resid(ar1)X-squared = 7.3209, df = 15, p-value = 0.9481

> Box.test(resid(ar1), lag = 15, type = "Ljung-Box")

Box-Ljung test

data: resid(ar1)X-squared = 8.1009, df = 15, p-value = 0.9197

The P -value indicates there is no evidence that the residuals are de-pendent. The conclusion from the Ljung–Box test is similar. Thisfurther confirms that the ar(1) model is adequate. If the P -value wasbelow about 0.05, there would be some cause for concern: it wouldimply that the terms in the acf are too large to be a white noise.

Note the r function tsdiag produces a plot using P -value of the Box–Piercestatistic for various value of the lag; see the third (bottom) panel in Fig. 6.1.The dotted line in the plot corresponds to a P -value of 0.05.

6.5 The cumulative periodogram

Another test applied to the residuals is to calculate the cumulative (or in-tegrated) periodogram and apply a Kolmogorov–Smirnov test to check theassumption that the residuals form a white noise process. The r functioncpgram performs this test.

The cumulative periodogram from a white noise process will lie close to thecentral diagonal line. Thus, if the residuals do form a white noise process asthey should do approximately if the model is correct, the cumulative peri-odogram of the residuals will lie within the indicated bounds with probability95%.

Example 6.4: In Example 6.1, the yearly Buffalo snowfall data were con-sidered and an ar(1) model fitted. The cumulative periodogram isfound as follows:

© USQ, February 21, 2007

Page 122: Study Book

118 Module 6. Diagnostic Tests

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Figure 6.7: The cumulative periodogram after fitting an ar(1) model to theyearly Buffalo snowfall data.

> cpgram(ar1$resid, main = "")

The result (Fig. 6.7) indicates that the model is adequate as it remainsbetween the confidence bands.

6.6 Significance of parameters

The next important test to perform is to check on the statistical significanceof the parameters. Standard errors of the parameter estimates are computedand shown by r when the model is fitted using arima.

Roughly speaking, the parameters of a model are accepted as significantif the estimated value of the parameter is twice the standard error of thisestimate or more. This is made a little more precise by using a statisticaltest (the t-test), however in practice this amounts to almost the same thing.If a parameter shows up as not significant, it should be removed from themodel.

© USQ, February 21, 2007

Page 123: Study Book

6.7. Normality of residuals 119

Example 6.5: In Example 6.1, the yearly Buffalo snowfall data were con-sidered. An ar(1) model was fitted to the data. There were twoestimated parameters: the constant term in the model, m′, and thear term. The ar term can be tested for significance. (Recall that theintercept is of no interest to the structure of the model.) The param-eter estimates and the standard errors are shown in Example 5.8 (p 86).Dividing the estimate by the standard error produces an approximatet-score. The parameter estimates for the ar term has a t-score greaterthan two in absolute value, indicating that it is necessary in the model.

The actual t-scores can be computed using the output from the fittingof the model, as shown below.

> coef(ar1)

ar1 intercept0.3301765 80.8808921

> ar1$coef

ar1 intercept0.3301765 80.8808921

> ar1$var.coef

ar1 interceptar1 0.01528329 0.03975151intercept 0.03975151 17.40728000

> coef(ar1)/sqrt(diag(ar1$var.coef))

ar1 intercept2.670778 19.385655

The conclusion is that the ar parameter in the model is necessary,and so that the ar(1) model seems appropriate.

6.7 Normality of residuals

Throughout, the residual have been assumed to be normally distributed. Totest this, use a Q–Q plot of the residuals. If the residuals do have a normaldistribution, the points in the plot will lie close to the diagonal line.

© USQ, February 21, 2007

Page 124: Study Book

120 Module 6. Diagnostic Tests

●●

●●

●●

●●

−2 −1 0 1 2

−60

−40

−20

020

40

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 6.8: The cumulative periodogram after fitting an ar(1) model to theyearly Buffalo snowfall data.

Example 6.6: Continuing Example 6.1 (the yearly Buffalo snowfall), con-sider again the fitted ar(1) model. The Q–Q plot of the residu-als (Fig. 6.8) indicates the residuals are approximately normally dis-tributed.

> qqnorm(resid(ar1))

> qqline(resid(ar1))

(Note: qqnorm plots the points; qqline draw the diagonal line.)

6.8 Alternative models

The last type of test is to check if an alternative model might be better. Thisis open-ended, because there is an endless variety of alternative models fromwhich to choose. But, as seen before, there are sometimes a small numberof models that are suggested from which the researcher has to choose. If onemodel proves to be better using the diagnostic tests, that model should beused. If all perform similarly, choose the simplest model. But what if there

© USQ, February 21, 2007

Page 125: Study Book

6.9. Evaluating the performance of a model 121

is more than one model that perform similarly, and each are as simple as theother? If you can’t decide between them, then it probably doesn’t matter!

6.9 Evaluating the performance of a model

Finally, consider an evaluation tool that is slightly different than those pre-viously discussed. The idea is that the model is fitted to the first portion ofthe data (perhaps half the data), and then forecasts are made on the basis ofthat model fitted to this portion (called the training sets). One-step aheadforecasts are then made for each of the remaining data points (called thetesting set) to see how adequate the model can forecast—which, after all, isone of the main reasons for developing time series models.

This approach generally requires a time series with a large number of ob-servations to work well, since splitting the data into two parts halves theamount of information available for model selection. Obviously, smaller por-tions can be withheld from the model selection stage if necessary, as shownin the next example. The approach discussed here is called cross-validation.The ‘best’ model is the model whose predictions in the tesing set are clos-est to the actual observed values; this can be summarised by noting themean and variance of the differences. More sophisticated cross-validationtechniques are possible, but not discussed here.

Example 6.7: Because the Buffalo snowfall data is a short series, we with-hold only the last ten observations and retain those for model evalu-ation. The one-step ahead forecasts for the remaing ten observationsfor each model are shown in Table 6.1.

These one-step aheads predictions are plotted in Fig. 6.9. Table 6.1suggests little difference between the models; the ar(2) model hassmaller errors on average (compare the means), but the ar(1) modelis closest more consistent (compare the variances).

6.10 Summary

Before accepting a time series model, it must be tested. The main tests arebased on analysing the “residuals”—the one-step ahead forecast errors of themodel. Table 6.2 summaries the diagnostic tests discussed.

© USQ, February 21, 2007

Page 126: Study Book

122 Module 6. Diagnostic Tests

Table 6.1: The one-step ahead forecasts for the ar(1), ar(2) and ma(2)model after withholding the last ten observations and using the remainderas a training set.

Prediction from Model:

Actual ar(1) ar(2) ma(2)

1 89.80 87.08 91.67 86.912 71.50 83.21 88.52 84.593 70.90 77.10 80.89 77.144 98.30 76.89 75.88 73.975 55.50 86.05 82.53 85.086 66.10 71.75 79.17 79.267 78.40 75.29 70.43 66.648 120.50 79.40 76.31 79.199 97.00 93.46 90.05 95.84

10 110.00 85.61 95.39 93.69

Errors: Mean: 4.215 2.717 3.570Var: 414.9 444.7 427.0

1960 1962 1964 1966 1968 1970 1972

60

70

80

90

100

110

120

Years

Ser

ies

and

pred

ictio

ns

SeriesAR(1) preds AR(2) predsMA(2) preds

Figure 6.9: The cross-validation one-step ahead predictions for the ar(1),ar(2) and ma(2) models applied to the Buffalo snowfall data.

© USQ, February 21, 2007

Page 127: Study Book

6.11. Exercises 123

Assumption Test R Commandsto Test to Use to Use

Residuals Residual acf & use acf and pacf onunforecastable Residual pacf residuals

Residuals Box–Pierce test Box.testindependent

Residuals cumulative cpgramwhite noise periodogram

Simple significance of output from arimamodel parameters

Residuals Q–Q plot of qqnormnormally distributed residuals

Table 6.2: A summary of the diagnostic test to use on given time seriesmodels.

6.11 Exercises

Ex. 6.8: In Exercise 4.22, an arma(1, 1) model was discussed that wasfitted by Sales, Pereira & Vieira [40] to the natural monthly averageflow rate (in cubic metres per second) of the reservior of Furnas on theGrande River in Brazil. Table 4.1 (p 70) gave the parameter estimatesand their standard errors. Determine if each parameter is significantat the 95% level.

Ex. 6.9: In Exercise 5.14 (p 91), data concerning the mean annual stream-flow from 1925 to 1988 in Cache River at Forman, Illinois, given in thefile cacheriver.dat There are two variables of interest: Mean reportsthe mean annual flow, and Max reports the maximum flow each wateryear, each measured in cubic feet per second. Perform the diagnosticchecks to see if the model found for the variable Mean in that exerciseproduce adequate models.

Ex. 6.10: In Exercise 5.19 (p 92), the Easter Island sea level air pressureanomalies from 1951 to 1995, given in the data file easterslp.txt,were analysed. An ar(3) model was considered a suitable model. Per-form the appropriate diagnostic checks on this model, and determineif the model is adequate.

Ex. 6.11: In Exercise 4.4, Davis & Rappoport [15] were reported to usean arma(2, 2) model for modelling the Palmer Drought Index, {Yt}.Katz & Skaggs [26] claim the equivalent ar(2) model is almost as good

© USQ, February 21, 2007

Page 128: Study Book

124 Module 6. Diagnostic Tests

as the model given by Davis & Rappoport, yet has half the number ofparameters. For this reason, they prefer the ar(2) model.

Load the data into r and decide on the best model. Give reasons foryour solution, and include diagnostics analyses.

Ex. 6.12: In Exercise 5.20, a model was fitted to the Western Pacific Index(WPI). The time series in the data file wpi.txt gives the monthlyWPI from January 1950 to December 2001. Perform some diagnosticanalyses and select the ‘best’ model for the data, justifying your choiceand illustrating your answer with appropriate diagrams.

Ex. 6.13: In Exercise 5.21, the seasonal average SOI from (southern hemi-sphere) summer 1876 to (southern hemisphere) summer 2001 was stud-ied. The data is given in the file soiseason.dat. Fit an appropriatemodel to the data justifying your choice and illustrating your answerwith appropriate diagrams.

Ex. 6.14: In Exercise 5.22, the monthly average solar flux from December1950 to December 2001 was studied. The data is given in the filesolarflux.txt. Fit an appropriate model to the data justifying yourchoice and illustrating your answer with appropriate diagrams.

Ex. 6.15: The data file rionegro.dat contains the average monthly heightsof the Rio Negro river at Manaus from 1903–1992 in metres (relativeto an arbitrary reference point). Find a suitable model for the timesseries, including a diagnostic analysis of possible models.

6.11.1 Answers to selected Exercises

6.9 The model chosen for the variable Mean was simply that the data wererandom. Hence the residual acf and residual pacf are just the sam-ple acf and sample pacf as shown in Fig. 5.13. The cumulativeperiodogram shows no problems with this model; see Fig. 6.10. TheBox–Pierce test likewise indicates no problems. The Q–Q plot is notideal though (and looks better if an ar(3) model is fitted). Here issome of the code:

> Box.test(rflow)

Box-Pierce test

data: rflowX-squared = 0.865, df = 1, p-value = 0.3523

6.15 First, load and prepare the data:

© USQ, February 21, 2007

Page 129: Study Book

6.11. Exercises 125

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Series: rflow

●●

●●

●●

●●●

●●

−2 −1 0 1 220

040

060

080

0

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 6.10: The cumulative periodogram of the annual streamflow at CacheRiver; the plot suggests that the data are random. However, the Q–Q plotsuggests that data are perhaps not normally distributed.

> RN <- read.table("rionegro.dat", header = TRUE)

> ht <- ts(RN$Height, start = c(RN$Year[1],

+ RN$Month[1]), frequency = 12)

A plot of the data shows the series is reasonaboy stationary (Fig. 6.11).

See the acf and pacf (Fig. 6.12); the acf suggests a very large orderma model, while the pacf suggests an ar(3) model. Decide to startwith the ar(3) model!

> rn.ar3 <- arima(ht, order = c(3, 0, 0))

The residual acf and pacf are pretty good if not perfect (Fig. 6.13);there are a a couple of components outside the approximate confidencelimits, but probably nothing of importance.

Let’s examine more diagnostics (Fig. 6.14); the cumulative periodogramnlooks fine, but the normal probability plot looks bad. However, a his-togram shows the residuals have a decent distribution that loos slightlynormal, so things aren’t so bad (try hist(resid(rn.ar3))).

So, for some final diagnostics:

> Box.test(resid(rn.ar3))

Box-Pierce test

data: resid(rn.ar3)X-squared = 0.1847, df = 1, p-value = 0.6674

© USQ, February 21, 2007

Page 130: Study Book

126 Module 6. Diagnostic Tests

> plot(ht)

Time

ht

1900 1920 1940 1960 1980

−6

−4

−2

02

4

Figure 6.11: A plot of the Rio negro river data

> par(mfrow = c(1, 2))

> acf(ht)

> pacf(ht)

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series ht

0.0 0.5 1.0 1.5 2.0 2.5

−0.

20.

00.

20.

40.

60.

8

Lag

Par

tial A

CF

Series ht

Figure 6.12: The acf and pacf of the Rio negro river data

© USQ, February 21, 2007

Page 131: Study Book

6.11. Exercises 127

> par(mfrow = c(1, 2))

> acf(resid(rn.ar3))

> pacf(resid(rn.ar3))

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series resid(rn.ar3)

0.0 0.5 1.0 1.5 2.0 2.5

−0.

050.

000.

05

Lag

Par

tial A

CF

Series resid(rn.ar3)

Figure 6.13: The residual acf and pacf of the Rio Negro river data afterfitting the ar(3) model

> par(mfrow = c(1, 2))

> cpgram(resid(rn.ar3))

> qqnorm(resid(rn.ar3))

> qqline(resid(rn.ar3))

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Series: resid(rn.ar3)

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

−3 −1 0 1 2 3

−3

−2

−1

01

23

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 6.14: Further diagnsotic plot of the Rio Negro river data after fittingthe ar(3) model

© USQ, February 21, 2007

Page 132: Study Book

128 Module 6. Diagnostic Tests

> coef(rn.ar3)/sqrt(diag(rn.ar3$var.coef))

ar1 ar2 ar3 intercept38.72317372 -11.39726406 6.14219789 -0.01352855

Ther Box test shows no problems; all the parameters seem necessary.This model seems fine (if not perfect).

> rn.ar3

Call:arima(x = ht, order = c(3, 0, 0))

Coefficients:ar1 ar2 ar3 intercept

1.1587 -0.4985 0.1837 -0.0020s.e. 0.0299 0.0437 0.0299 0.1462

sigma^2 estimated as 0.567: log likelihood = -1226.89, aic = 2463.77

© USQ, February 21, 2007

Page 133: Study Book

Module 7Non-Stationary Models

Module contents7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.2 Non-stationarity in the mean . . . . . . . . . . . . . . . 131

7.3 Non-stationarity in the variance . . . . . . . . . . . . . 134

7.4 arima models . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1367.4.3 Backshift operator . . . . . . . . . . . . . . . . . . . . . 138

7.5 Seasonal models . . . . . . . . . . . . . . . . . . . . . . . 138

7.5.1 Identifying the season length . . . . . . . . . . . . . . . 1417.5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.5.3 The backshift operator . . . . . . . . . . . . . . . . . . . 1477.5.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.6 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.7 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.8 A summary of model fitting . . . . . . . . . . . . . . . . 154

7.9 A complete example . . . . . . . . . . . . . . . . . . . . . 156

7.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.11.1 Answers to selected Exercises . . . . . . . . . . . . . . . 164

129

Page 134: Study Book

130 Module 7. Non-Stationary Models

Module objectives

Upon completion of this module students should be able to:

� identify time series that are not stationary in the mean;

� use differences to remove non-stationarity in the mean;

� identify time series that are not stationary in the variance;

� use logarithms to remove non-stationarity in the variance;

� understand what is meant by an arima model;

� use the arima(p, d, q) notation to define arima models;

� develop forecasting formulae for arima models;

� develop confidence intervals for forecasts for arima models;

� write arima models using the backshift operator;

� identify seasonal time series;

� identify the length of a season in a seasonal time series;

� use the arima(p, d, q) (P,D,Q)s notation to define seasonal arimamodels;

� develop forecasting formulae for seasonal arima models;

� develop confidence intervals for forecasts for seasonal arima models;

� write seasonal arima models using the backshift operator;

� use r to estimate the parameters in seasonal arima models;

� use r to fit an appropriate Box–Jenkins model to time series data.

7.1 Introduction

Up to now, all the time series considered have been assumed stationary. Thisassumption was crucial to the definitions of the autocorrelation and partialautocorrelation. In practice, however, many time series are not stationary.In this Module, methods for identifying non-stationary series are considered,and then models for modelling these series are examined.

In this Module, three types of non-stationarity are discussed:

© USQ, February 21, 2007

Page 135: Study Book

7.2. Non-stationarity in the mean 131

1. series that have a non-stationary mean;

2. series that have a non-stationary variance; and

3. series with a periodic or seasonal component.

Many series may exhibit more than one of these types of non-stationarity.

7.2 Non-stationarity in the mean

One common type of non-stationarity is a non-stationary mean. Typically,the mean of the series tends to increase or fluctuate. This is easiest toidentify by looking at a plot of the data. Sometimes, the sample acf mayindicate a non-stationary mean if the terms take a long time to decay tozero.

If a dataset exhibits a non-stationary mean, the solution is to take differ-ences. That is, if a time series {X} is non-stationary in the mean, computethe differences Yn = Xn−Xn−1. Generally, this makes any time series witha non-stationary mean into a time series with a stationary mean {Y }. Oc-casionally, the differenced time series {Y } will also be non-stationary in themean, and another set of differences will be needed. It is rare to ever needmore than two sets of differences. When differences of this kind are taken(soon another type of difference is considered), this is referred to as takingfirst differences.

Note that each time a set of differences is calculated, the new series has oneless observation than the original. In r, differences are created using diff.

Example 7.1:

Consider the annual rainfall near Wendover, Utah, USA. The dataappear to have a non-stationary mean (Fig. 7.1) as the mean goes upand down, though it is not too severe. To check this, a smoothingfilter was applied computing the mean of each set of six observationsat a time. This smooth (Fig. 7.1, top panel) suggests the mean isprobably non-stationary as this line is not (approximately) constant.The following code fragment shows how the differences series was foundin r.

> rfdata <- read.table("./rainfall/wendover.dat",

+ header = TRUE)

> rf <- rfdata[(rfdata$Year > 1907) & (rfdata$Year <

+ 1999), ]

© USQ, February 21, 2007

Page 136: Study Book

132 Module 7. Non-Stationary Models

Year

Ann

ual r

ainf

all (

in m

m)

1920 1940 1960 1980 2000

200

300

400

500

Year

Diff

eren

ces

of A

nnua

l rai

nfal

l (in

mm

)

1920 1940 1960 1980 2000

−200

−100

0

100

200

Figure 7.1: The annual rainfall near Wendover, Utah, USA in mm. Top:the original data is plotted with a thin line, and a smooth in a thick line,indicating that the mean is non-stationary. Bottom: the differenced data isplotted with a thin line, and a smooth in a thick line. Since the smooth isrelatively flat, the differenced data has a stationary mean.

> ann.rain <- tapply(rfdata$Rain, list(rfdata$Year),

+ sum)

> ann.rain <- ts(as.vector(ann.rain), start = rfdata$Year[1],

+ end = rfdata$Year[length(rfdata$Year)])

> plot(ann.rain, type = "l", las = 1, ylab = "Annual rainfall (in mm)",

+ xlab = "Year")

> ar.l <- lowess(ann.rain, f = 0.1)

> lines(ar.l, lwd = 2)

If differences are applied, the series appears more stationary in themean (Fig. 7.1, bottom panel).

© USQ, February 21, 2007

Page 137: Study Book

7.2. Non-stationarity in the mean 133

AMO

Time

amo

1950 1960 1970 1980 1990

−0.1

0.0

0.1

0.2

One difference of AMO

Time

dam

o

1950 1960 1970 1980 1990

−0.04

−0.02

0.00

0.02

Two differences of AMO

Time

ddam

o

1950 1960 1970 1980 1990

−0.03

−0.02

−0.01

0.00

0.01

0.02

0.03

0.04

Figure 7.2: The Atlantic Multidecadal Oscillation from 1948 to 1994. Top:the plot shows the data is not stationary. Middle: the first differences arealso not stationary. Bottom: taking two sets of differences has produced astationary series.

Example 7.2: Enfield et al. [16] used the Kaplan SST to compute a ten-year running mean of detrended Atlantic SST anomalies north of theequator. This data series is called the Atlantic Multidecadal Oscilla-tion (AMO). The data, obtained from the NOAA Climatic DiagnosticCenter [2], are stored as amo.dat. A plot of the data shows the series isnon-stationary in the mean; (Fig. 7.2, top panel). The first differencesare also non-stationary; (Fig. 7.2, middle panel). Taking one moreset of differences produces approximately stationary data; (Fig. 7.2,bottom panel).

Here is the code used.

> amo <- read.table("amo.dat", header = TRUE)

> amo <- ts(amo$AMO, start = c(amo$Year[1]),

© USQ, February 21, 2007

Page 138: Study Book

134 Module 7. Non-Stationary Models

+ frequency = 1)

> par(mfrow = c(3, 1))

> plot(amo, main = "AMO", las = 1)

> damo <- diff(amo)

> plot(damo, main = "One difference of AMO",

+ las = 1)

> ddamo <- diff(damo)

> plot(ddamo, main = "Two differences of AMO",

+ las = 1)

7.3 Non-stationarity in the variance

A less common type of non-stationarity with climate data is non-stationarityin the variance. A non-stationary variance is a common difficulty, however,in many business applications. Generally, a series that is non-stationaryin the variance has a variance that gets larger over time (that is, as timeprogresses, the observations become more variable). In these case, usuallytaking logarithms of the time series will help. Another possible difficulty isthat the time series contains negative values (for example, SOI series). Inthese cases, add a sufficiently large constant to the data (which won’t affectthe variance), and then take logarithms. If the time series is non-stationaryin the mean and the variance, logs should be taken before differences (toavoid taking logs of negative values).

7.4 ARIMA models

Once a non-stationary time series has been made stationary, it can beanalysed like any other (stationary) time series. These models, which in-clude some differencing, are called Autoregressive Integrated Moving Aver-age models, or arima models.

7.4.1 Notation

ar, ma or arma models in which differences have been taken are collectivelycalled autoregressive integrated moving average models, or arima models.Consider an arima model in which the original time series has been differ-enced d times (d is mostly 1, sometimes 2, and almost never greater than2). If this now-stationary time series can be well modelled by an arma(p, q)

© USQ, February 21, 2007

Page 139: Study Book

7.4. arima models 135

0 5 10 15

−0.

50.

00.

51.

0

Lag

AC

F

5 10 15

−0.

4−

0.2

0.0

0.2

Lag

Par

tial A

CF

Figure 7.3: The differences of the annual rainfall near Wendover, Utah, USAin mm. Top: the sample acf. Bottom: the sample pacf.

model, then the final model is said to be an arima(p, d, q) model, where dis the number of sets of differences needed to make the series stationary.

Example 7.3: In Example 7.1, the annual rainfall near Wendover, Utah,say {Xn}, was considered. The time series was non-stationary, anddifferences were taken. The differenced time series, say {Yn}, is nowstationary. The sample acf and pacf of the stationary series {Yn} isshown in Fig. 7.3.

The sample acf suggests an ma(1) model is appropriate (again re-calling that the term at lag 0 is always one), while the sample pacfsuggests an ar(2) model is appropriate. The AIC recommends anar(1) model. If the ar(2) model is chosen, the model would be anarima(2, 1, 0). If the ma(1) model is chosen, the model would be anarima(0, 1, 1). If the ar(1) model is chosen, the model would be an

© USQ, February 21, 2007

Page 140: Study Book

136 Module 7. Non-Stationary Models

arima(1, 1, 0), since there is one set of differences.

Here is some of the code used:

> rf <- read.table("wendover.dat", header = TRUE)

> rf <- rf[(rf$Year > 1907) & (rf$Year < 1999),

+ ]

> ann.rain <- tapply(rf$Rain, list(rf$Year),

+ sum)

> ann.rain <- ts(as.vector(ann.rain), start = rf$Year[1],

+ end = rf$Year[length(rf$Year)])

> plot(ann.rain, type = "n", las = 1, ylab = "Annual rainfall (in mm)",

+ xlab = "Year")

> lines(ann.rain)

> ann.rain.d <- diff(ann.rain)

> plot(ann.rain.d, type = "n", las = 1, ylab = "Differences of Annual rainfall (in mm)",

+ xlab = "Year")

> acf(ann.rain.d, main = "")

> pacf(ann.rain.d, main = "")

Example 7.4: An example of an arima(2, 1, 1) model is

Wt = 0.3Wt−1 − 0.1Wt−2 + et − 0.24et−1,

where Wt = Yt − Yt−1 is the stationary, differenced time series. Themodel for the original series, {Yt}, is therefore

(Yt − Yt−1) = 0.3(Yt−1 − Yt−2)− 0.1(Yt−2 − Yt−3) + et − 0.24et−1

⇒ Yt = 1.3Yt−1 − 0.4Yt−2 + 0.1Yt−3 + et − 0.24et−1.

7.4.2 Estimation

The r function arima can be used to fit arima models, with only a simplechange to what was seen for stationary models.

Example 7.5: In Example 7.3, three models are considered. To fit thearima(0, 1, 1) model, use the code

> ann.rain.ma1 <- arima(ann.rain, order = c(0,

+ 1, 1))

> ann.rain.ma1

© USQ, February 21, 2007

Page 141: Study Book

7.4. arima models 137

Call:arima(x = ann.rain, order = c(0, 1, 1))

Coefficients:ma1

-0.7036s.e. 0.1208

sigma^2 estimated as 5548: log likelihood = -516, aic = 1035.99

We have now seen what the second element of order is for: it indicatesthe order of the differencing necessary to make the series stationary.The fitted model for the first differences of the annual rainfall seriesis therefore Wt = −0.7036et−1 + et where Wt = Yt − Yt−1, and {Y } isthe original time series of annual rainfall (since first differences weretaken). This can be written as

Yt − Yt−1 = −0.7036et−1 + et

and further unravelled to

Yt = Yt−1 − 0.7036et−1 + et.

To fit the arima(1, 1, 0) model, proceed as follows:

> ann.rain.ar1 <- arima(ann.rain, order = c(1,

+ 1, 0))

> ann.rain.ar1

Call:arima(x = ann.rain, order = c(1, 1, 0))

Coefficients:ar1

-0.4494s.e. 0.0933

sigma^2 estimated as 6296: log likelihood = -521.46, aic = 1046.92

So the model for the first difference of annual rainfall is

Wt = −0.4494Wt−1 + et,

where Wt = Yt − Yt−1 and {Y } is the original rainfall series. This canbe also expressed as

Yt = 0.5506Yt−1 + 0.4494Yt−2 + et

in terms of the original rainfall series.

© USQ, February 21, 2007

Page 142: Study Book

138 Module 7. Non-Stationary Models

7.4.3 Backshift operator

When differences are taken of a time series {Xt}, this is written using thebackshift operator as Yt = (1−B)Xt.

Example 7.6: In Example 7.4 an arma(2, 1) model was fitted to a station-ary series {Wt} (hence making an arima(2, 1, 1) model). The modelcan be written using the backshift operator as

(1− 0.3B + 0.1B2)Wt = (1− 0.24B)et.

Since {Wt} is a differenced time series, Yt = (1−B)Wt, so the modelfor {Yt} written using the backshift operator is

(1− 0.3B + 0.1B2)(1−B)Yt = (1− 0.24B)et.

This expression can be expanded to give

(1− 1.3B + 0.4B2 − 0.1B3)Yt = (1− 0.24B)et,

producing the same model as before.

Example 7.7: In Example 7.2, the AMO from 1948 to 1994 was examined.Two sets of differences were required to make the data stationary.Looking at the sample acf and sample pacf of the twice-differenceddata shows that no model is necessary. The fitted model is thereforeis an arima(0, 2, 0) model. Using the backshift operator, the model is(1−B)2At = et, where {A} is the AMO series.

7.5 Seasonal models

The most common type of non-stationarity is when the time series exhibitsa ‘seasonal’ pattern. ‘Seasonal’ does not necessarily have anything to dowith the seasons of Winter, Spring, and so on. It means that there is somekind of regular pattern in the data. This type of non-stationarity is verycommon in climatological and meteorological applications, where there isoften an annual pattern evident in the data. Seasonal data is time seriesdata that shows regular fluctuation aligned usually with some natural timeperiod (not just the actual seasons of Winter, Spring, etc). The length ofa season is the time period over which the pattern repeats. For example,monthly data might show an annual pattern with a season of length 12, asthe data may have a pattern that repeats each year (that is, each twelvemonths). These patterns usually appear in the sample acf and pacf.

© USQ, February 21, 2007

Page 143: Study Book

7.5. Seasonal models 139

Example 7.8:

The average monthly sea level at Darwin, Australia (in millimetres),obtained from the Joint Archive for Sea Level [1], is plotted in the toppanel of Fig. 7.4. The sample acf and sample pacf are also shown.

The code used to produce these Figure is given below:

> sealevel <- read.table("darwinsl.txt", header = TRUE)

> sl <- ts(sealevel$Sealevel/1000, start = c(1988,

+ 1), end = c(1999, 12), frequency = 12)

> plot(sl, ylab = "Sea level (in m)", las = 1)

> acf(sl, lag.max = 40, main = "")

> pacf(sl, lag.max = 40, main = "")

The data show a seasonal pattern—the sea level has a regular rise andfall according to the months of the year (as expected). The length ofthe season is therefore twelve, since the pattern is of length twelve,when the pattern then repeats. This seasonality also appears in thesample acf.

Seasonal time series have a non-stationary mean, but the non-stationarity isof a regular kind (that is, every year or every month a cycle repeats). Thesetype of time series can be represented using a model that explicitly allowsfor the seasonality.

In seasonal models, ar and ma components can be introduced at the valueof the season. For example, the model

Xt = et − 0.23Xt−12

might be used to model monthly data (where the season length is twelve, asthe data might be expected to repeated each year). This model explicitlymodels the seasonal pattern by incorporating an autoregressive term at alag of twelve.

This model is a seasonal ar model. More generally, a seasonal arma modelmay have the usual non-seasonal ar and ma components (or the “ordinary”ar and ma components) but also seasonal ar and ma components. Themodel in the previous paragraph is a seasonal ar(1) model, since the onear term is one season before. Similarly, for a time series with a season oflength twelve, an example of a seasonal ar(2) model is

Yn+1 = en+1 + 0.17Yn−11 − 0.55Yn−23,

© USQ, February 21, 2007

Page 144: Study Book

140 Module 7. Non-Stationary Models

Time

Sea

leve

l (in

m)

1988 1990 1992 1994 1996 1998 2000

3.8

3.9

4.0

4.1

4.2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

40.

00.

40.

8

Lag

AC

F

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

40.

00.

40.

8

Lag

Par

tial A

CF

Figure 7.4: The monthly average sea level at Darwin, Australia in metres.Top: the data are plotted. Centre: the sample acf and Bottom: the samplepacf.

© USQ, February 21, 2007

Page 145: Study Book

7.5. Seasonal models 141

since the first ar term is one ‘season’ (12 time steps) behind, and the secondar term is two ‘seasons’ (2× 12 = 24 time steps) behind.

Sometimes it is also necessary to take seasonal differences. If a time series{Xt} shows a very strong seasonal component with a season of length s,then a seasonal difference of the form

Yt = Xt −Xt−s

is used to create a more stationary time series. Again, the r function diffis used with an optional parameter given to indicate the season length.

Example 7.9: The Darwin sea average monthly sea level data (Exam-ple 7.8, p 139) has a strong seasonal pattern. Taking seasonal differ-ences seems appropriate:

> dsl <- diff(sl, 12)

> plot(dsl, las = 1)

The plot of the seasonally differenced data (Fig. 7.5, top panel) sug-gests the series is still possibly non-stationary in the mean, so takingordinary (non-seasonal) differences also seems appropriate:

> ddsl <- diff(dsl)

> plot(ddsl, las = 1)

The plot of the twice-differenced data (Fig. 7.5, bottom panel) is nowapproximately stationary.

Example 7.10: Karner & Rannik [25] using the seasonal ma model

xt − xt−12 = et −Θ1et−12

to model cloud amount, where seasonal difference have been initiallytaken.

7.5.1 Identifying the season length

Sometimes it is easy to identify the length of a season as it is aligned with ayearly or seasonal cycle. If this is the case, the season length should be madeto aligned with the natural season. But for many climatological variablesthis is not true. In these cases, identifying the season length can be difficult.

© USQ, February 21, 2007

Page 146: Study Book

142 Module 7. Non-Stationary Models

Time

dsl

1990 1992 1994 1996 1998 2000

−0.2

−0.1

0.0

0.1

0.2

0.3

Time

ddsl

1990 1992 1994 1996 1998 2000

−0.10

−0.05

0.00

0.05

0.10

Figure 7.5: The differences in monthly average sea level at Darwin, Australiain metres (see also Fig. 7.4). Top: the seasonal differences are plotted, whilein the bottom plot, both seasonal and non-seasonal differences have beentaken.

© USQ, February 21, 2007

Page 147: Study Book

7.5. Seasonal models 143

Time

qbo

1960 1970 1980 1990 2000

−30

−20

−10

0

10

0 1 2 3 4 5 6

1e−

031e

−01

1e+

011e

+03

frequency

spec

trum

Series: xRaw Periodogram

bandwidth = 0.00601

0 1 2 3 4 5 6

1e−

021e

+00

1e+

02

frequency

spec

trum

Series: xSmoothed Periodogram

bandwidth = 0.0662

Figure 7.6: The Quasi-Biennial Oscillation (QBO) from 1955 to 2001. Top:the the QBO is plotted and shown cyclic behaviour. Middle: the spectrumis shown. Bottom: the spectrum is shown again, but has been smoothed.

To help identifying the season length, a periodogram, or spectrum, is used.The spectrum examines many frequencies in the data and computes thestrength of each possible frequency. Hence any frequency that is very strongis an indication of the period of the season.

Example 7.11: The quasi-biennial oscillation, or QBO, is calculated atthe Climate Diagnostic Centre from the zonal average of the 30mbzonal wind at the equator. The monthly data have a distinct seasonalpattern (Fig. 7.6, top panel) but is not aligned with years or seasons,or anything else useful.

Using , the spectrum is found using the function spectrum as follows:

> qbo <- ts(qbo, start = c(1955, 1), end = c(2001,

© USQ, February 21, 2007

Page 148: Study Book

144 Module 7. Non-Stationary Models

+ 12), frequency = 12)

> qbo.spec <- spectrum(qbo)

This spectrum (Fig. 7.6, centre panel) is very noisy. It is best tosmooth the plot, as shown below. (You do not need to understandwhat this smoother does or how it works; the point is that it smoothsthe plot.)

> k5 <- kernel("daniell", 5)

> qbo.spec <- spectrum(qbo, kernel = k5)

The result is a much smoother spectrum (Fig. 7.6, bottom panel). Theseason length is identified as the frequency where the spectrum is atits greatest. This can also be done in r:

> max.spec <- max(qbo.spec$spec)

> max.freq <- qbo.spec$freq[max.spec == qbo.spec$spec]

> max.freq

[1] 0.4375

> 1/max.freq

[1] 2.285714

The maximum frequency corresponds to 2.3 “seasons”, or 2.3 years inthis case.

Random numbers are expected to have a spectrum that is fairly constantfor all frequencies; see the following example.

Example 7.12: In this example, we look at the spectrum of four sets ofrandom numbers from a Normal distribution.

> set.seed(102030)

> par(mfrow = c(2, 2))

> k5 <- kernel("daniell", 5)

> for (i in (1:4)) {

+ random.numbers <- rnorm(1000)

+ spectrum(random.numbers, kernel = k5)

+ }

In the output (Fig. 7.7)., no frequencies stand out as being muchstronger than others.

© USQ, February 21, 2007

Page 149: Study Book

7.5. Seasonal models 145

0.0 0.1 0.2 0.3 0.4 0.5

0.2

0.5

1.0

frequency

spec

trum

Series: xSmoothed Periodogram

bandwidth = 0.00318

0.0 0.1 0.2 0.3 0.4 0.5

0.5

1.0

1.5

frequency

spec

trum

Series: xSmoothed Periodogram

bandwidth = 0.00318

0.0 0.1 0.2 0.3 0.4 0.5

0.2

0.5

1.0

2.0

frequency

spec

trum

Series: xSmoothed Periodogram

bandwidth = 0.00318

0.0 0.1 0.2 0.3 0.4 0.5

0.5

1.0

2.0

frequency

spec

trum

Series: xSmoothed Periodogram

bandwidth = 0.00318

Figure 7.7: Four replication of a spectrum from 1000 Normal random num-bers. There is no evidence of one frequency dominating.

7.5.2 Notation

These models are very difficult to write down. There are a number of pa-rameters that must be included:

1. The order of non-seasonal (or ordinary) differencing, d;

2. The order of the non-seasonal ar model, p;

3. The order of the non-seasonal ma model, q;

4. The length of a season, s;

5. The order of seasonal differencing, D;

6. The order of the seasonal ar model, P ;

7. The order of the seasonal ma model, Q.

Note that any one model should only have a few parameters, and so someof p, q P , and Q are expected to be zero. In addition, d + D is most oftenone, sometimes two, and rarely greater than two. These parameters aresummarized by writing a model down as follows: A model with all of theabove parameters would be written as a arima(p, d, q) (P,D,Q)s model.

© USQ, February 21, 2007

Page 150: Study Book

146 Module 7. Non-Stationary Models

Example 7.13: Consider a time series {Rt}. The series is non-stationary,and ordinary differences and seasonal differences (period 7) are takento make the series stationary in the mean. An ordinary ar(2) modeland seasonal ma(1) model is then fitted. The final model is a arima(2, 1, 0)(0, 1, 1)7 model.

Example 7.14: Consider a time series {Zt}. The series is non-stationary,and two sets of seasonal differences (period 12) are taken to make theseries stationary in the mean. An ordinary arma(1, 1) model is thenfitted. The final model is a arima(1, 0, 1) (0, 2, 0)12 model.

Example 7.15: Consider a time series {Pn}. The series is non-stationary,and one set of seasonal differences (period 4) are taken to make theseries stationary in the mean. An ordinary ma(1) model and seasonalar(2) model is then fitted. The final model is a arima(0, 0, 1) (2, 1, 0)4

model.

Example 7.16: Consider a time series {An}. The series is non-stationary,and one set of seasonal differences (period 12) are taken to make theseries stationary in the mean. The data then appears to be white noise(that is, the acf andpacf suggest no model to be fitted). The finalmodel is a arima(0, 0, 0) (0, 1, 0)12 model.

When writing seasonal components of the model, it is usual to write seasonalar terms with a capital phi: Φ. Likewise, seasonal ma models are writtenusing a capital theta: Θ. This is in line with using capital P and Q for theorders of the seasonal components.

Example 7.17: In Example 7.8 (p 139), the average monthly sea level atDarwin was analysed. In Example 7.9 (p 141), seasonal differenceswere taken to make the data stationary.

The seasonally differenced data (Fig. 7.5, top panel) was non-statonary.The seasonally differenced and non-seasonally differenced data (Fig. 7.5,bottom panel) looks approximately stationary. The sample acf andpacf of this series is shown in Fig. 7.8.

© USQ, February 21, 2007

Page 151: Study Book

7.5. Seasonal models 147

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

0

Lag

AC

F

0.0 0.5 1.0 1.5 2.0 2.5 3.0−

0.4

−0.

3−

0.2

−0.

10.

00.

10.

2

Lag

Par

tial A

CF

Figure 7.8: The sample acf and pacf for the twice-differenced monthlyaverage sea level at Darwin, Australia in metres. Top: the sample acf;Bottom: the sample pacfof the twice-differenced data are shown.

For the non-seasonal components of the model, the sample acf sug-gests no model is necessary (the one component above the dotted confi-dence interval can probably be ignored—it is just over the approximatelines and is at a lag of two). The sample pacf suggest no model isneeded either—though there is again a marginal component at a lag oftwo. (It may be necessary to include these terms later, as wil becomeevident in the diagnostic analysis, but it is unlikely.)

For the seasonal model, the sample pacf decays very slowly (thereis one at seasonal lag 1, lag 2 and lag 3), suggesting a large numberseasonal ar terms would be necessary. In contrast, the sample acfsuggests one seasonal ma term is needed. In summary, two differenceshave been taken (so d = 1 and D = 1). No non-seasonal model seemsnecessary (so p = q = 0), but a seasonal ma(1) term is suggested (soP = 0 and Q = 1). So the model is arima(0, 1, 0) (0, 1, 1)12, and thereis only one parameter to estimate (the seasonal ma(1) parameter).

7.5.3 The backshift operator

Earlier, it was shown the backshift operator equivalent of taking non-seasonaldifferences was (1 − B). Similarly, if the series {Xt} is seasonally differ-enced with a season of length s, then the backshift operator equivalent is(1−Bs)Xt.

© USQ, February 21, 2007

Page 152: Study Book

148 Module 7. Non-Stationary Models

The general form of an arima(p, d, q) (P,D,Q)s model is written using thebackshift operator as

(1−B)d(1−Bs)Dφ(B)Φ(B)Xt = θ(B)Θ(B)et,

where φ(B) is the non-seasonal ar component written using the backshiftoperator, Φ(B) is the seasonal ar component written using the backshiftoperator, θ(B) is the non-seasonal ma component written using the backshiftoperator, and Θ(B) is the the seasonal ma component written using thebackshift operator. The terms in the seasonal components decay in stepsof the season-length (that is, of the season has a length of seven, Φ(B) =1 + 0.31Φ1B

7 − 0.19Φ2B14 is a typical term).

Example 7.18: In Example 7.17, one model suggested for the averagemonthly sea level at Darwin was arima(0, 1, 0) (0, 1, 1)12. Using thebackshift operator, this model is

(1−B)(1−B12)Xt = Θ(B)et,

where Θ(B) = 1 + Θ1B12. Using r, the unknown parameter is calcu-

lated to be −0.9996, so the model is

(1−B)(1−B12)Xt = (1− 0.9996B12)et.

Example 7.19: Maier & Dandy [31] use arima models to model the dailysalinity at Murray Bridge, South Australia from Jan 1, 1987 to 31 Dec1991. They examined numerous models, including some models notin the Box–Jenkins methodology. The best Box–Jenkins models werethose based on one set of non-seasonal differences, and one or two setsof seasonal differences, with a season of length s = 365. One of theirfinal models was the arima(1, 1, 1) (1, 2, 0)365 model

(1−B)(1−B365)2(1 + 0.267B)(1− 0.513B365)Xt

= (1− 0.455B)et,

where the daily salinity is {Xt}.

© USQ, February 21, 2007

Page 153: Study Book

7.5. Seasonal models 149

7.5.4 Estimation

Estimation of seasonal arima models is quite tricky, as there are many pa-rameters that could be specified: the ar and ma components both seasonallyand non-seasonally. This is part of the help from the arima function

arima(x, order = c(0, 0, 0),seasonal = list(order = c(0, 0, 0), period = NA) )

The input order has been used previously; to also specifiy seasonal compo-nents, the input seasonal must be used.

Example 7.20: In Example 7.17 (p 146), an arima(0, 1, 0) (0, 1, 1)12 wassuggested for the average monthly sea level at Darwin. This model isfitted in r as follows:

> dsl.small <- arima(sealevel$Sealevel, order = c(0,

+ 1, 0), seasonal = list(order = c(0, 1,

+ 1), period = 12))

> dsl.small

Call:arima(x = sealevel$Sealevel, order = c(0, 1, 0), seasonal = list(order = c(0,

1, 1), period = 12))

Coefficients:sma1

-0.9996s.e. 0.2305

sigma^2 estimated as 1013: log likelihood = -654.01, aic = 1312.02

So the fitted model is written

ord. diff︷ ︸︸ ︷(1−B) (1−B12)︸ ︷︷ ︸

seas. diff

Dt = (1− 0.99957B12)et

Dt −Dt−12 −Dt−1 + Dt−13 = et − 0.99957et−12.

This means the estimated seasonal ma parameter is −0.99957.

© USQ, February 21, 2007

Page 154: Study Book

150 Module 7. Non-Stationary Models

7.6 Forecasting

The principles of forecasting used earlier apply to arima and seasonal arimamodels without significant differences. However, it is necessary to write themodel without using the backshift operator first, which can be quite tedious.

Example 7.21: Consider the arima(1, 0, 0) (0, 1, 1)4 model Wn = 0.20Wn−1+en− 0.16en−4, where Wn = Zn−Zn−4 is the seasonally differenced se-ries. Using the backshift operator, the model is

(1−B4)︸ ︷︷ ︸seasonal diff.

(1− 0.20B)Zt = (1− 0.16B4)et.

Expanding the backshift terms gives

(1− 0.2B −B4 + 0.20B5)Zt = (1− 0.16B4)et,

so the model is written as

Zn = 0.2Zn−1 + Zn−4 − 0.20Zn−5 − 0.16en−4 + en.

A one-step ahead forecast is given by

Zn+1|n = 0.2Zn + Zn−3 − 0.20Zn−4 − 0.16en−3.

A two-step ahead forecast is given by

Zn+2|n = 0.2Zn+1|n + Zn−2 − 0.20Zn−3 − 0.16en−2.

Example 7.22: In Example 7.19, the arima(1, 1, 1) (1, 2, 0)365 model

(1−B)(1−B365)2(1 + 0.267B)(1− 0.513B365)Xt

= (1− 0.455B)et,

was given for the daily salinity at Murray Bridge, South Australia,say {Xt}. After expanding the terms on the left-hand side, there willbe terms involving B, B2, B365, B366, B367, B730, B731, B732, B1095,B1096 and B1097. This makes it very difficult to write down. Indeed,without using the backshift operator as above, it would be very tediousto write down the model at all, even though only three parameters havebeen estimated. Note this is an unusual case of model fitting in thatthree sets of differences were taken.

© USQ, February 21, 2007

Page 155: Study Book

7.7. Diagnostics 151

0.0 0.5 1.0 1.5

−0.

20.

20.

61.

0

Lag

AC

F

Series resid(sma1)

0.5 1.0 1.5

−0.

150.

000.

100.

20

Lag

Par

tial A

CF

Series resid(sma1)

0.0 0.2 0.4

0.0

0.4

0.8

frequency

Series: resid(dsl.small)

●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

−10

00

50

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 7.9: Some residual plots for the arima(0, 1, 0) (0, 1, 1)12 fitted to themonthly sea level at Darwin. Top left: the residual acf; top right: theresidual pacf; bottom left: the cumulative periodogram; bottom right: theQ–Q plot

7.7 Diagnostics

The usual diagnostics apply equally for non-stationary models; see Module 6.

Example 7.23: In Example 7.17, the model arima(0, 1, 0) (0, 1, 1)12 wassuggested for the monthly sea level at Darwin. The residual acf,residual pacf and the cumulative periodogram can be produced in(Fig. 7.9).

> sma1 <- arima(sl, order = c(0, 1, 0), seasonal = list(order = c(0,

+ 1, 1), period = 12))

> acf(resid(sma1))

> pacf(resid(sma1))

> cpgram(resid(sma1))

Both the residual acf and pacf look OK, but both have a significantterm at lag 2; the periodogram looks a little suspect, but isn’t too bad.The Box–Pierce Q statistic can be computed, and the standard errorof the estimated parameter found also:

© USQ, February 21, 2007

Page 156: Study Book

152 Module 7. Non-Stationary Models

> Box.test(resid(sma1))

Box-Pierce test

data: resid(sma1)X-squared = 2.9538, df = 1, p-value = 0.08567

> coef(sma1)/sqrt(diag(sma1$var.coef))

sma1-4.335704

The Box–Pierce test is OK, but is marginal. The estimated parameteris significant.

The Q–Q plot is OK if not perfect.

In summary, the arima(0, 1, 0) (0, 1, 1)12 looks OK, but there are somepoints of minor concern. Is there perhaps a better model? Perhapsa model with a term at lag 2 such as arima(2, 1, 0) (0, 1, 1)12? (Thereason for proposing this model is that the residual acf and pacfsuggest difficulties at lag 2.)

We fit this model and compare the residual analyses; see Fig. 7.10.

> oth.mod <- arima(sl, order = c(2, 1, 0), seasonal = list(order = c(0,

+ 1, 1), period = 12))

> acf(resid(oth.mod))

> pacf(resid(oth.mod))

> cpgram(resid(oth.mod))

> qqnorm(resid(oth.mod))

> qqline(resid(oth.mod))

> Box.test(resid(oth.mod))

Box-Pierce test

data: resid(oth.mod)X-squared = 0.0116, df = 1, p-value = 0.9144

> coef(oth.mod)/sqrt(diag(oth.mod$var.coef))

ar1 ar2 sma1-1.309895 2.202391 -4.992391

The residual acf and pacf appear better, as does the periodogram.The Box–Pierce statistic now certainly not significant, but one of theparameters is unnecessary. (This was expected; we only really wanted

© USQ, February 21, 2007

Page 157: Study Book

7.7. Diagnostics 153

0.0 0.5 1.0 1.5

−0.

20.

20.

61.

0

Lag

AC

F

Series resid(oth.mod)

0.5 1.0 1.5

−0.

150.

000.

10

Lag

Par

tial A

CF

Series resid(oth.mod)

0 1 2 3 4 5 6

0.0

0.4

0.8

frequency

Series: resid(oth.mod)

●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

−2 −1 0 1 2

−0.

100.

00

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 7.10: Some residual plots for the arima(2, 1, 0) (0, 1, 1)12 fitted tothe monthly sea level at Darwin. Top left: the residual acf; top right: theresidual pacf; bottom left: the cumulative periodogram; bottom right: theQ–Q plot.

© USQ, February 21, 2007

Page 158: Study Book

154 Module 7. Non-Stationary Models

the second lag, but were forced to take the first, insignificant one.)The Q–Q plot looks marginally improved also.

Fitting a arima(0, 1, 2) (0, 1, 1)12 produces similar results. Which isthe better model? It is not entirely clear; either is probably OK.

7.8 A summary of model fitting

To summarise, these are the steps that need to be taken to fit a good model:

� Plot the data. Check that the data is stationary. If the data is notstationary, deal with it appropriately (by taking logarithms or differ-ences (seasonal and/or non-seasonal), or perhaps both). Rememberthat it is rare to require many levels of differencing.

� Examine the sample acf, sample pacf and/or the AIC to determinepossible models to for the data. Models may include ma, ar, arma orarima models, with non-seasonal and/or seasonal aspects. (Remem-ber that is it rare to have models with a large number of parameters tobe estimated.) You may have to use a periodogram to identify seasonlength.

� Use r’s arima function to fit the models and determine the parameterestimates.

� Perform the following diagnostic checks for each of the possible models.

– examine the residual acf and pacf;

– the cumulative periodogram of residuals;

– and the Box–Pierce statistic;

– examine the Q–Q plot; and

– the significance of the parameter estimates.

� Choose the best model from the available information, and write downthe model (probably using backshift operators). Remember that thesimplest, most adequate model is the best model; more parameters donot necessarily make a better model.

These steps are summarized in the flowchart in Fig. 7.11.

© USQ, February 21, 2007

Page 159: Study Book

7.8. A summary of model fitting 155

Is time series stationary?

Plot the series

Use differences and/or logs

No

Yes

Is the model adequate?

No

Yes

Write down the final model

Perform diagnostic checks

Estimate model parameters

(using ACF, PACF and/or AIC)

Identify possible models:

AR, MA, ARMA or ARIMA

Figure 7.11: A flowchart for fitting arima (Box–Jenkins) type models.

© USQ, February 21, 2007

Page 160: Study Book

156 Module 7. Non-Stationary Models

7.9 A complete example

The data file mlco2.dat contains monthly measurements of carbon dioxideabove Mauna Loa Hawaii from Jan 1959 to Dec 1990 in parts per million(ppm). (Missing values have been filled in by linear interpolation.) Thedata were collected by Scripps Institute of Oceanography, La Jolla, Cali-fornia. The original source is the climatology database maintained by theOak Ridge National Laboratory, and the data here have been obtained fromHyndman [5].

We will find a suitable model for the data, and use diagnostic tests to de-termine if the model is adequate.

A plot (Fig. 7.12, top panel) shows the series is clearly non-stationary inthe mean, and is also seasonal. Taking non-seasonal difference produces anapproximately stationary series in the mean, but the series is still strikinglyseasonal (Fig. 7.12, centre panel). Taking seasonal differences (length 12)produces a series that appears stationary (Fig. 7.12, bottom panel).

Using the stationary (twice-differenced) series, the sample acf and pacf areshown in Fig. 7.13.

To find a model, first consider the non-seasonal components. The acf sug-gests an ma(1) or perhaps ma(3) model. The pacf suggests an ar(1) orperhaps ar(3) model. At this stage, choosing either the ma(1) or ar(1)model seems appropriate as the terms at lag 3 are marginally over the ap-proximate confidence limits. Which to choose? Since the lag 3 term seemsmore marginal in the acf, perhaps the ma(1) model is the best choice (thismay not turn out to be the case).

Consider now the seasonal components. The acf has a strong term at aseasonal lag of 1 only, suggesting a seasonal ma(1) model. In contrast, thepacf shows significant terms at a seasonal lag of 1, 2 and 3 (and there maybe more if we looked at higher seasonal lags). This suggests a seasonal modelof at least ar(2). For the seasonal component, the best model is the ma(1).

Combining this information suggests the model arima(0, 1, 1) (0, 1, 1)12.This model is fitted as follows:

> co.model <- arima(co, order = c(0, 1, 1),

+ seasonal = list(order = c(0, 1, 1), season = 12))

Is this model an adequate model? The residual acf and pacf (Fig. 7.14)suggest the model is adequate.

The cumulative periodogram and Q–Q plots (Fig. 7.15) indicate the modelis adequate. The two estimated parameters are also significant:

© USQ, February 21, 2007

Page 161: Study Book

7.9. A complete example 157

Time

co

1960 1965 1970 1975 1980 1985 1990

320

330

340

350

First difference of CO2

Time

dco

1960 1965 1970 1975 1980 1985 1990

−2

−1

0

1

2

Seasonal and non−seasonal differences of CO2

Time

ddco

1960 1965 1970 1975 1980 1985 1990

−1.0

−0.5

0.0

0.5

1.0

Figure 7.12: The monthly measurements of carbon dioxide above MaunaLoa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top:the data is clearly non-stationary in the mean and is seasonal; Middle: thefirst differences have been taken; Bottom: the seasonal differences have alsobeen taken, and now the series appears stationary.

© USQ, February 21, 2007

Page 162: Study Book

158 Module 7. Non-Stationary Models

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

50.

00.

51.

0

Lag

AC

F

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−0.

3−

0.1

0.1

Lag

Par

tial A

CF

Figure 7.13: The monthly measurements of carbon dioxide above MaunaLoa, Hawaii from Jan 1959 to Dec 1990 in parts per million (ppm). Top:the sample acf of the twice-differenced series; Bottom: the sample pacf ofthe twice-differenced series.

© USQ, February 21, 2007

Page 163: Study Book

7.9. A complete example 159

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Lag

AC

F

Series resid(co.model)

0.5 1.0 1.5 2.0

−0.

100.

000.

050.

10

Lag

Par

tial A

CF

Series resid(co.model)

Figure 7.14: The monthly measurements of carbon dioxide above MaunaLoa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Top:the residual acf; Bottom: the residual pacf.

© USQ, February 21, 2007

Page 164: Study Book

160 Module 7. Non-Stationary Models

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Series: resid(co.model)

●●●●●●●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

−1.

0−

0.5

0.0

0.5

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 7.15: The monthly measurements of carbon dioxide above MaunaLoa, Hawaii. from Jan 1959 to Dec 1990 in parts per million (ppm). Thecumulative periodogram indicates that the model is adequate.

> co.model$coef/sqrt(diag(co.model$var.coef))

ma1 sma1-6.62813 -27.24159

The model suggested is

> co.model

Call:arima(x = co, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), season = 12))

Coefficients:ma1 sma1

-0.3634 -0.8581s.e. 0.0548 0.0315

sigma^2 estimated as 0.0803: log likelihood = -66.66, aic = 139.33

Using backshift operators, the fitted model is

B(1−B12)Ct = (1− 0.3634B − 0.8581B12)et.

© USQ, February 21, 2007

Page 165: Study Book

7.10. Summary 161

7.10 Summary

In this Module, three types of non-stationarity have been considered: non-stationarity in the mean, non-stationarity in the variance, and seasonal mod-els. For each, identification and estimation has been considered, as well asthe notation for each. The diagnostic testing involved is the same as forstationary models.

7.11 Exercises

Ex. 7.24: Consider a arima(1, 0, 0) (0, 1, 1)7 model fitted to a time series{Pn}. Write this model using the backshift operator notation (makeup some reasonable parameter estimates).

Ex. 7.25: Consider a arima(1, 1, 0) (1, 1, 0)12 model fitted to a time series{Yn}. Write this model using the backshift operator notation (makeup some reasonable parameter estimates).

Ex. 7.26: Consider some non-stationary data {W}. After taking non-seasonal differences, the series seems stationary. Let this differenceddata be {Y }. A non-seasonal ar(1) model and seasonal ma(2) is fittedto the stationary data (the season is of length 12).

(a) Write down the model fitted to the series {W} using the backshiftnotation;

(b) Write down the model fitted to the series {W} using arima no-tation.

(c) Write the model out in terms of Wt, et and previous terms (thatis, don’t use the backshift operator).

Ex. 7.27: Consider some non-stationary data {Z}. After taking seasonaldifferences, the series seems stationary. Let this differenced data be{Y }. A non-seasonal ma(2) model and seasonal arma(1, 1) is fittedto the stationary data (the season is of length 24).

(a) Write down the model fitted to the series {Z} using the backshiftnotation;

(b) Write down the model fitted to the series {Z} using arima no-tation.

(c) Write the model out in terms of Zt, et and previous terms (thatis, don’t use the backshift operator).

© USQ, February 21, 2007

Page 166: Study Book

162 Module 7. Non-Stationary Models

Ex. 7.28: For each of the following cases, write down the final model usingthe backshift operator and using notation.

(a) The time series {P} is non-stationary; after taking ordinary dif-ferences, an arma(1, 0) model was fitted to the data.

(b) The time series {T} is seasonal with period 12. After seasonaldifference were taken, a seasonal ma(2) model was fitted to thedata.

Ex. 7.29: For each of the following cases, write down the final model usingthe backshift operator and using notation.

(a) The time series {Y } is non-stationary; after taking seasonal dif-ferences (season of length 12), an arma(1, 1) model was fitted tothe data.

(b) The daily time series {S} is seasonal with period 365. Afterordinary and seasonal difference were taken, an arma(1, 1) modelwas fitted to the data.

Ex. 7.30: For the following models written using backshift operators, ex-pand the model and write down the model in standard form. In addi-tion, write down the model using arima notation.

(a) (1−B)(1− 0.3B12)Xt = (1 + 0.2B)et.

(b) (1−B7)Hn = (1− 0.5B − 0.2B2)en.

(c) (1− 0.3B)Wn+1 = (1− 0.4B)en+1.

Ex. 7.31: For the following models written using backshift operators, ex-pand the model and write down the model in standard form. In addi-tion, write down the model using arima notation.

(a) (1−B)2(1 + 0.3B)Yt = et.

(b) (1−B12)(1−B)(1 + 0.3B)Mn+1 = en+1.

(c) Wn+1 = (1− 0.4B)(1 + 0.3B7)en+1.

Ex. 7.32: Consider some non-stationary monthly data {G}. After takingseasonal differences, the series seems stationary. Let this differenceddata be {H}. A non-seasonal ma(2) model, a seasonal ma(1) and aseasonal ar(1) model are fitted to {H}. Write down the model fittedto the series {G} using

(a) the backshift operator;

(b) arima notation.

© USQ, February 21, 2007

Page 167: Study Book

7.11. Exercises 163

(c) Make up some (reasonable) numbers for the parameters in thismodel. Then write the model out in terms of Gt, et and previousterms.

Ex. 7.33: Trenberth & Stepaniak [43] defined an index of El Nino evolutionthey called the Trans-Nino Index (TNI). This monthly time series isgiven in the data file tni.txt, and contains values of the TNI fromJanuary 1958 to December 1999. (The data have been obtained fromthe Climate Diagnostic Center [2].)

(a) Plot the series and see that it is a little non-stationary.

(b) Use differences to make the series stationary.

(c) Find a suitable ar model for the series.

(d) Find a suitable ma models for the series.

(e) Which model would you prefer: the ar or ma model? Explainyour answer using diagnostic analyses.

(f) For your prefered model, estimate the parameters.

Ex. 7.34: The sunspot numbers from 1770 to 1869 were given in Table 1.2(p 20). The data are given in the data file sunspots.dat.

(a) Plot the data and decide if a seasonal component appears to exist.

(b) Use spectrum (and a smoother) to find any seasonal components.

(c) Suggest a possible model for the data (make sure to do a diag-nostic analysis).

Ex. 7.35: The quasi-bienniel oscillation (QBO) was considered in Exer-cise 1.7.

(a) Plot the data and decide if a seasonal component appears to exist.

(b) Use spectrum (and a smoother) to find any seasonal components.

(c) Suggest a possible model for the data (make sure to do a diag-nostic analysis).

Ex. 7.36: The average monthly air temperatures in degrees Fahrenheit atNottingham Castle has been recorded for 20 years and is given in thedata file nottstmp.txt. (The data are from Anderson [6, p 166], asquoted in Hand et al. [19, p 279].) Find a suitable time series modelfor the data.

Note that the season is expected to be of length 12. See if you can dis-cover this from the unsmoothed spectrum, and also from the smoothedspectrum.

© USQ, February 21, 2007

Page 168: Study Book

164 Module 7. Non-Stationary Models

Ex. 7.37: Karner & Rannik [25] fit an arima(0, 0, 0) (0, 1, 1)12 to the Inter-national Satellite Cloud Climatology Project (ISCCP) cloud detectiontime series {Cn}. They fit different model for different latitudes. At−90◦ latitude, the unknown model parameter is about 0.7 (taken fromtheir Figure 5).

(a) Write this model using the backshift operator.

(b) Write the model in terms of Cn and en.

(c) Develop a forecasting model for forecasting one-, two-, twelve-and thirteen- steps ahead.

Ex. 7.38: The streamflow in Little Mahoning Creek, McCormick, Pennsyl-vania, from 1940 to 1988 is given in the data file mcreek.txt. Thefile contains the monthly mean values of streamflow in cubic feet persecond. Find a suitable time series model for the data. (Make sure todo a diagnostic analysis.)

Ex. 7.39: The data file wateruse.dat contains the annual water usage inBaltimore city in litres per capita per day from 1885 to 1963. Thedata are from Hipel & McLeod [21] and Hyndman [5]. Plot the dataand confirm that the data is non-stationary.

(a) Use appropriate methods to make the series stationary.

(b) Find a suitable model for the series and estimate the parametersof the model. Make sure to do a diagnostic analysis.

(c) Write the model using the backshift operator.

Ex. 7.40: The file firring.txt contain the tree ring indicies for the Dou-glas fir at the Navajo National Monument in Arizona, USA from 1107to 1968. Find a suitable model for the data.

Ex. 7.41: The data file venicesealevel.dat contains the maximum sealevels recorded at Venice from 1887–1981. Find a suitable model forthe times series, including a diagnostic analysis of possible models.

7.11.1 Answers to selected Exercises

7.24 The model is of the form

(1−B7)(1− φB)Pn = (1−ΘB7)et

for some values φ and Θ.

© USQ, February 21, 2007

Page 169: Study Book

7.11. Exercises 165

7.26 (a) (1−B)(1− φB)Wt = (1 + Θ1B12 + Θ2B

24)et for some numbersφ, Θ1 and Θ2.

(b) arima(1, 1, 0) (0, 0, 2)12.

(c) Expanding the model written using backshift operators gives

(1− (1 + φ)B + φB2)Wt = (1 + Θ1B12 + Θ2B

24)et.

This is equivalent to

Wt = (1 + φ)Wt−1 − φWt−2 + et + Θ1et−12 + Θ2et−24.

7.28 (1−B)(1−φB)Pt = et which is arima(1, 1, 0) (0, 0, 0)0; (1−B12)Tt =(1 + Θ1B

12 + Θ2B24)et which is arima(0, 0, 0) (0, 1, 2)12.

7.30 (a) Xt = Xt−1 + 0.3Xt−12 − 0.3Xt−13 + et + 0.2et−1, which is anarima(0, 1, 1) (1, 0, 0)12 model.

(b) Hn = Hn−7 + en − 0.5en−1 − 0.2en−2 which is a arima(0, 0, 2)(1, 0, 0)7 model.

(c) Wn+1 = 0.3Wn + en+1 − 0.4en which is a arima(1, 0, 1) (0, 0, 0)?

model; ie, it is not seasonal.

7.39 The series is plotted in the top plot in Fig. 7.16. The data are clearlynon-stationary in the mean. Taking difference produces an approxi-mately stationary series; see the bottom plot in Fig. 7.16.

Using the stationary differenced series, the sample acf and pacf areshown in Fig. 7.17. These plots suggest that no model can be fittedto the differenced series. That is, the first differences are random.

The model for the water usage {Wt} is therefore

(1−B)Wt = et

or Wt = Wt−1 + et. There are no parameters to estimate.

Here is the code used:

> wu <- read.table("wateruse.dat", header = TRUE)

> wu <- ts(wu$Use, start = 1885)

> plot(wu, las = 1)

> dwu <- diff(wu)

> plot(dwu, main = "First difference of water use",

+ las = 1)

> acf(dwu, main = "")

> pacf(dwu, main = "")

The diagnostics have been left for you.

© USQ, February 21, 2007

Page 170: Study Book

166 Module 7. Non-Stationary Models

Time

wu

1900 1920 1940 1960

350

400

450

500

550

600

650

First difference of water use

Time

dwu

1900 1920 1940 1960

−150

−100

−50

0

50

100

Figure 7.16: The annual water usage in Baltimore city in litres per capitaper day from 1885 to 1968. Top: the data is clearly non-stationary in themean; Bottom: the first differences are approximately stationary.

© USQ, February 21, 2007

Page 171: Study Book

7.11. Exercises 167

0 5 10 15

−0.

20.

20.

61.

0

Lag

AC

F

5 10 15

−0.

20.

00.

10.

2

Lag

Par

tial A

CF

Figure 7.17: The annual water usage in Baltimore city in litres per capitaper day from 1885 to 1968. Top: the sample acf of the differenced series;Bottom: the sample pacf of the differenced series.

© USQ, February 21, 2007

Page 172: Study Book

168 Module 7. Non-Stationary Models

> par(mfrow = c(1, 2))

> plot(vs)

> plot(diff(vs))

Time

vs

1900 1920 1940 1960 1980

6080

100

120

140

160

180

Time

diff(

vs)

1900 1920 1940 1960 1980−

60−

40−

200

2040

6080

Figure 7.18: A plot of the Venice sea level data. Left: original data; right:after taking first differences

7.41 First, load and prepare the data:

> VSL <- read.table("venicesealevel.dat", header = TRUE)

> vs <- ts(VSL$MaxSealevel, start = c(1887))

A plot of the data shows the series is non-stationary (Fig. 7.18, leftpanel) and the data increasing (what is the implication there?). Takingdifference produces a more stationary series (Fig. 7.18, right panel).

See the acf and pacf (Fig. 7.19); the acf suggests an ma(1) model(or possibly ma(3), but start with the simpler choice), while the pacfsuggests an ar(2) model. Decide to start with the ma(1) model:

> vs.ar1 <- arima(vs, order = c(1, 1, 0))

The residual acf and pacf aren’t great (Fig. 7.20); there are quitea few components outside the approximate confidence limits, but thecomponents at lag 1 are fine in both plots. Maybe the ma(1) wouldbe better? That does appear to be true (Fig. 7.21).

> vs.ma1 <- arima(vs, order = c(0, 1, 1))

This model appears fine, if not perfect, so let’s examine more diagnos-tics (Fig. 7.22); these look OK too.

So, for some final diagnostics:

© USQ, February 21, 2007

Page 173: Study Book

7.11. Exercises 169

> par(mfrow = c(1, 2))

> acf(diff(vs))

> pacf(diff(vs))

0 5 10 15

−0.

4−

0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series diff(vs)

5 10 15

−0.

4−

0.3

−0.

2−

0.1

0.0

0.1

0.2

Lag

Par

tial A

CF

Series diff(vs)

Figure 7.19: The acf and pacf of the Venice sea level data

> par(mfrow = c(1, 2))

> acf(resid(vs.ar1))

> pacf(resid(vs.ar1))

0 5 10 15

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

F

Series resid(vs.ar1)

5 10 15

−0.

3−

0.2

−0.

10.

00.

10.

2

Lag

Par

tial A

CF

Series resid(vs.ar1)

Figure 7.20: The residual acf and pacf of the Venice sea level data afterfitting the ar(1) model

© USQ, February 21, 2007

Page 174: Study Book

170 Module 7. Non-Stationary Models

> par(mfrow = c(1, 2))

> acf(resid(vs.ma1))

> pacf(resid(vs.ma1))

0 5 10 15

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

FSeries resid(vs.ma1)

5 10 15−

0.3

−0.

2−

0.1

0.0

0.1

0.2

Lag

Par

tial A

CF

Series resid(vs.ma1)

Figure 7.21: The residual acf and pacf of the Venice sea level data afterfitting the ma(1) model

> par(mfrow = c(1, 2))

> cpgram(resid(vs.ma1))

> qqnorm(resid(vs.ma1))

> qqline(resid(vs.ma1))

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

Series: resid(vs.ma1)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

−2 −1 0 1 2

−40

−20

020

4060

80

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 7.22: Further diagnsotic plot of the Venice sea level data after fittingthe ma(1) model

© USQ, February 21, 2007

Page 175: Study Book

7.11. Exercises 171

> Box.test(resid(vs.ma1))

Box-Pierce test

data: resid(vs.ma1)X-squared = 0.8388, df = 1, p-value = 0.3597

> coef(vs.ma1)/sqrt(diag(vs.ma1$var.coef))

ma1-16.48790

All looks well; decide the ma(1) model is suitable:

> vs.ma1

Call:arima(x = vs, order = c(0, 1, 1))

Coefficients:ma1

-0.8677s.e. 0.0526

sigma^2 estimated as 319.5: log likelihood = -405.11, aic = 814.23

© USQ, February 21, 2007

Page 176: Study Book

172 Module 7. Non-Stationary Models

© USQ, February 21, 2007

Page 177: Study Book

Module 8Markov chains

Module contents8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1748.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.3 The transition matrix . . . . . . . . . . . . . . . . . . . . 1778.4 Forecast the future with powers of the transition matrix1818.5 Classification of finite Markov chains . . . . . . . . . . 1848.6 Limiting state (steady state) probabilities . . . . . . . . 187

8.6.1 Share of the market model . . . . . . . . . . . . . . . . 1908.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

8.7.1 Answers to selected Exercises . . . . . . . . . . . . . . . 200

Module objectives: Upon completion of this Module students should beable to:

� state and understand the Markov property;

� identify processes requiring a Markov chain description;

� determine the transition and higher order matrices of a Markov chain;

� calculate state probabilities;

173

Page 178: Study Book

174 Module 8. Markov chains

� determine and interpret the steady state distribution;

� calculate and interpret mean recurrence intervals;

� apply Markov chain techniques to basic decision making problems;

� determine future states or conditions using Markov analysis.

8.1 Introduction

Up to now, only continuous time series have been considered; that is, thequantity being measured over time is continuous. In this Module1, a simplemethod is considered for time series that take on discrete values. A simpleexample is the state of the weather: If it is fine or if it is raining, for example.

8.2 Terminology

A stochastic process is a collection of random variables {X(t)} where theparameter t denotes time (possibly space) and ranges over some interval ofinterest; e.g. t ≥ 0. X(t) denotes the random variable X at time t. Thevalues assumed by X(t) may be called states and the set of all possiblestates is called the state space. The state space (and hence X(t)) maybe discrete or continuous: the state space of a queue is discrete; the statespace of inter-event times is continuous. The time parameter t (sometimescalled the indexing parameter) may also be discrete or continuous. In thisModule, we study the case where the state space is discrete, and the timeparameter t is also discrete (and equally spaced). Two examples of a discretetime stochastic process follow.

� Let Y (n) be the volume of water in a reservoir at the start of month n.The parameter n is used in place of t to emphasise the fact that thisparameter is discrete, taking on values 0, 1, 2,. . . . Although Y (n) isnaturally continuous, since it is a measure of volume, it may be suffi-cient in some applications to measure Y (n) on a crude scale containingrelatively few values, in which case Y (n) would be treated as discrete.

� Let T (n) be the time between nth and (n + 1)th pulse registered ona Geiger counter. The indexing parameter n ∈ {0, 1, 2, . . .} is discreteand the state space continuous. A realisation of this process would bea discrete set of real numbers with values in the range (0,∞).

1Most of the material in this Module has been drawn from previous work by Dr AshleyPlank and Professor Tony Roberts.

© USQ, February 21, 2007

Page 179: Study Book

8.2. Terminology 175

In this section we consider stochastic models with both discrete state spaceand discrete parameter space. Some example include annual survey of bi-ological populations; monthly assessment of the water levels in a reservoir;weekly inventories of stock; daily inspections of a vending machine; mi-crosecond sampling of a buffer state. These models are used occasionally inclimate modelling.

Example 8.1: Tomorrow’s weather Consider the state of the weatheron a day to day basis. Days may be classified as either fine/sunny orovercast/cloudy. Suppose a fine day follows a fine sunny day 40% oftime and an overcast cloudy day follows an overcast day 20% of thetime. For example, the data this conclusion comes from may be thefollowing sequence of observations for consecutive days: C, S, C, C, S,S, S, C, S, C, S (though illustrative, this sample is far too small for realapplications). See that as stated above, 2/5’s of the sunny days arefollowed by a sunny day and that 1/5 of the cloudy days are followedby cloudy days. Define

Xt ={

1, if day t is fine/sunny,2, if day t is overcast/cloudy.

In other words, let state 1 correspond to sunny days, and state 2 tocloudy. We model this process as a Markov chain as defined below byassuming that for any two consecutive days t and t + 1 in the future:

Pr {Xt+1 = 1 | Xt = 1} = 0.4 , Pr {Xt+1 = 2 | Xt = 2} = 0.2 .

It follows that

Pr {Xt+1 = 2 | Xt = 1} = 1− 0.4 = 0.6 ,

Pr {Xt+1 = 1 | Xt = 2} = 1− 0.2 = 0.8 .

This information is recorded on a state transition diagram such as: Always draw a statetransition diagram

1=sunny 2=cloudy0.4 0.2

0.6

0.8

© USQ, February 21, 2007

Page 180: Study Book

176 Module 8. Markov chains

These four probabilities are conveniently represented as the matrix

P =[

P11 P12

P21 P22

]=

[0.4 0.60.8 0.2

](8.1)

Note that the rows sum to one.

Markov chains are a special type of discrete-time stochastic process. Forconvenience, as above, we write times as an integral number of some basicunits such as days, weeks, months, years or microseconds.

Definition 8.1 (Markov chain) Suppose a discrete-time stochastic pro-cess can be in one of a finite number of states, generally labelled 1, 2, 3, . . . ,s, then the stochastic process is called a Markov chain if

Pr {Xt+1 = it+1 | Xt = it, Xt−1 = it−1, . . . , X1 = i1, X0 = i0}= Pr {Xt+1 = it+1 | Xt = it} ,

This expression says that the probability distribution of the state at timet + 1 depends only on the state at time t (namely it) and does not dependon the states the chain passed through on the way to it at time t. Usu-ally we make a further assumption that for all states i and j and all t,Pr {Xt+1 = j | Xt = i} is independent of t. This assumption applies when-ever the system under study behaves consistently over time. Any stochasticprocess with this behaviour is called stationary . Based on this assumptionwe write

Pr {Xt+1 = j | Xt = i} = Pij , (8.2)

so that Pij is the probability that given the system is in state i at time t, thesystem will be in state j at time t + 1. Pij ’s are referred to as the transitionprobabilities.

Note that it is crucial that you clearly define the states and the discretetimes.

Example 8.2: Preisendorfer and Mobley [37] and Wilks [48] use a three-state Markov chain to model the transitions between below-normal,normal and above-normal months for temperature and precipitation.

© USQ, February 21, 2007

Page 181: Study Book

8.3. The transition matrix 177

8.3 The transition matrix

For a system with s states the transition probabilities are conveniently repre-sented as an s×s matrix P . Such a matrix P is called the transition matrixand each Pij is called a one-step transition probability . For example, P12

represents the probability that the process makes a transition from state 1to state 2 in one period, whereas P22 is the probability that the system staysin state 2. Each row represents the one-step transition probability distri-bution over all states. If we observe the system in state i at the beginningof any period, then the ith row of the transition matrix P represents theprobability distribution over the states at the beginning of the next period.

The same transition matrix completely describes the probabilistic behaviourof the system for all future one-step transitions. The probabilistic behaviourof such a system over time is called a stationary Markov chain. Stationarybecause the matrix P is the same for transitions between all times.

Example 8.3: Tomorrow’s weather continued Consider the weatherexample with transition matrix (8.1) and suppose today, t = 0, issunny, state X0 = 1. Then from the given data the probability of beingsunny tomorrow, state X1 = 1, is 0.4 and the probability of it beingcloudy, X1 = 2, is thus 0.6 . So our forecast for tomorrow’s weather isthe probabilistic mix p(1) =

[0.4 0.6

], called a probability vector

and denoting the probability of being sunny or cloudy respectively. Asclaimed above, this is just the first row of the transition matrix P .

What can we say about the weather in two days time? We seek avector of probabilities, say p(2), giving the probabilities of the dayafter tomorrow being sunny of cloudy respectively. Given today issunny, then the day after tomorrow can be sunny, X2 = 1, via twopossible routes: it can be cloudy tomorrow then sunny the day after,with probability (as the Markov assumption is that these transitionsare independent)

Pr {X2 = 1 | X1 = 2}×Pr {X1 = 2 | X0 = 1} = P21×P12 = 0.8× 0.6 ;

or it can be sunny tomorrow then sunny the day after, with probability(as the transitions are assumed independent)

Pr {X2 = 1 | X1 = 1}×Pr {X1 = 1 | X0 = 1} = P11×P11 = 0.4× 0.4 .

Since these are two mutually exclusive routes, we add their probabilityto determine

Pr {X2 = 1 | X0 = 1} = 0.4× 0.4 + 0.6× 0.8 = 0.64 .

© USQ, February 21, 2007

Page 182: Study Book

178 Module 8. Markov chains

Similarly, the probability that it is cloudy the day after tomorrow isthe sum of two possible routes:

Pr {X2 = 2 | X0 = 1} = 0.4× 0.6 + 0.6× 0.2 = 0.36 .

Combining these into one probability vector our probabilistic fore-cast for the day after tomorrow is p(2) =

[0.64 0.36

]. The im-

portant general feature of this example is that post-multiplication bythe transition matrix determines how the vector of probabilities evolve,p(2) = p(1)P , as you see realised in the above two displayed expres-sions.

This formula applies to the initial forecast of tomorrow’s weather too.Since we know today is sunny, the current state is p(0) =

[1 0

]denoting that we are certain the weather is in state 1. Then observein the above that p(1) = p(0)P .

Using independence of transitions from step to step, and the mutual exclu-siveness of different possible paths we establish the general key result:

Theorem 8.2 If a Markov chain has transition matrix P and is in stateswith probability vector p(t) at time t, then 1 time step later its probabilityvector is p(t + 1) = p(t)P .

Proof: Consider the following schematic general but partial state transitiondiagram:

����

s

...����

i

...����

2

����

1

����

j

ZZ

ZZ

ZZZ~

PPPPPPPq

�������:

��

��

���>

p1(t)

p2(t)

pi(t)

ps(t)

pj(t + 1)

P1j

P2j

Pij

Psj

© USQ, February 21, 2007

Page 183: Study Book

8.3. The transition matrix 179

The system arrives to be in some state j at time t+1 by s mutually exclusivepossibilities depending upon the state of the system at time t:

pj(t + 1) =s∑

i=1

Pr {make state i to j transition}

=s∑

i=1

Pr {in state i} × Pr {Xt+1 = j | Xt = i}

=s∑

i=1

pi(t)Pij by their definition

= jth element of p(t)P .

Hence putting these elements together: p(t + 1) = p(t)P . ♠

Note that the future behaviour of the system (for example, the states ofthe weather) only depends on the current state and not on how it enteredthis state. Given the transition matrix P , knowledge of the current stateoccupied by the process is sufficient to completely describe the future proba-bilistic behaviour of the process. This lack of memory of earlier history maybe viewed as an extreme limitation. However, this is not so. As the nextexample shows, we can build into the current state such a memory. Thetrick is widely applicable and creates a powerful modelling mechanism.

Example 8.4: Remembering yesterday’s weather. Assume that to-morrow’s weather depends on the weather condition during the lasttwo days as follows:

� if the last two days have been sunny, then 95% of the time to-morrow will be sunny;

� if yesterday was cloudy and today is sunny, then 70% of the timetomorrow’s will be sunny;

� if yesterday was sunny and today is cloudy, then 60% of the timetomorrow’s will be cloudy;

� if the last two days have been cloudy, then 80% of the time to-morrow will be cloudy.

Using this information model the weather as a Markov chain, drawthe state transition diagram and write down its transition matrix. Iftomorrow’s weather depends on the weather conditions during the lastthree days, how many states would be needed to model the weatheras a Markov chain?

© USQ, February 21, 2007

Page 184: Study Book

180 Module 8. Markov chains

Solution: Since each day is classified as either sunny (S) or cloudy(C) then we have 4 states: SS, SC, CS, and CC. In these labels the firstletter denotes what the weather was yesterday and the second letterdenotes today’s weather. For example, the second rule above says thatif today we are in state CS (that yesterday was cloudy and today issunny), then with probability 70% tomorrow will be in state SS be-cause tomorrow will be sunny, the second S, and tomorrow’s yesterday,namely today, was sunny, the first S. The state transition diagram is:

SS

SC

CC

CS

0.95 0.8

0.05 0.4

0.6

0.3

0.2

0.7

The transition matrix is thus

P =

SS SC CS CCSSSCCSCC

0.95 0.05 0.0 0.00.0 0.0 0.40 0.600.70 0.30 0.0 0.00.0 0.0 0.20 0.80

If tomorrow’s weather depends on the weather conditions during thelast 3 days, then 23 = 8 states are needed: SSS, SSC, SCS, SCS, SCC,CCC, CCS, CSC, and CSS.

See that you may write down the states of a Markov chain in any order thatyou please. But once you have decided on an ordering, you must stick tothat ordering throughout the analysis. In the above example, the labels forboth the rows and the columns of the transition matrix must be, and are,in the same order, namely SS, SC, CS, and CC. In applying Markov chains,there need not be a natural order for the states, and so you will have todecide and fix upon one.

© USQ, February 21, 2007

Page 185: Study Book

8.4. Forecast the future with powers of the transition matrix 181

8.4 Forecast the future with powers of the transitionmatrix

Using independence of transitions from step to step, and the mutual exclu-siveness of different possible paths:

Theorem 8.3 If the process is in states with probability vector p(t) attime t then n steps later its probability vector is p(t + n) = p(t)Pn .

Example 8.5: In the weather Example 8.3 we saw that p(1) = p(0)P andp(2) = p(1)P so that

p(2) = p(1)P = p(0)PP = p(0)P 2 .

Thus the forecast 2 days later is P 2 times the current probabilityvector.

Proof: It is certainly true for the n = 1 case: p(t + 1) = p(t)P by Theo-rem 8.2. For the case n = 2:

p(t + 2) = p(t + 1)P by Theorem 8.2= p(t)PP by Theorem 8.2 again= p(t)P 2 .

For the case n = 3:

p(t + 3) = p(t + 2)P by Theorem 8.2= p(t)P 2P by n = 2 case= p(t)P 3 .

For the case n = 4:

p(t + 4) = p(t + 3)P by Theorem 8.2= p(t)P 3P by n = 3 case= p(t)P 4 .

And so on (formally by induction) for the general case. ♠

Given a Markov chain with transition probability matrix P , if the chain isin state i at time t, we might be interested to know the probability thatn periods later the chain will be in a state j. Since we are dealing with astationary Markov chain, this probability will be independent of t.

© USQ, February 21, 2007

Page 186: Study Book

182 Module 8. Markov chains

Corollary 8.4 The (i, j)th element of Pn gives the probability of startingfrom state i and being in state j precisely n steps later.

Proof: Being in state i at time t corresponds to p(t) being zero except forthe ith element which is one, then the right-hand side of p(t + n) = p(t)Pn

shows p(t + n) must be just the ith row of Pn. Thus the corollary follows.♠

Example 8.6: Assume that the population movement of people betweencity and country is modelled as a Markov chain with transition matrix

P =[

P11 P12

P21 P22

]where:

� P11 = 0.9, is the probability that a person currently living in thecity will remain in the city after one transition (year)

� P12 = 0.1, is the probability that a person currently living in thecity will move to country after one transition (year)

� P21 = 0.2, is the probability that a person currently living in thecountry will move to the city after one transition (year); and

� P22 = 0.8, is the probability that a person currently living in thecountry will remain in the country after one transition (year).

If a person is currently living in the city what is the probability thatthis person will be living in the country 2 years from now?

If 75% of the population is currently living in the city and 25% in thecountry, what is the population distribution after 1, 2, 3 and 10 yearsfrom now.

Solution: To answer the first question we determine element (1, 2)of the matrix P 2.

P 2 =[

0.9 0.10.2 0.8

] [0.9 0.10.2 0.8

]=

[0.83 0.170.34 0.66

].

Hence [P 2]12 = 0.17. This means that the probability that a cityperson will live in the country after 2 transitions (years) is 17%.

To find the population distribution after 1, 2, 3 and 10 years giventhat the initial distribution is p(0) =

[0.75 0.25

]we perform the

following calculations.

© USQ, February 21, 2007

Page 187: Study Book

8.4. Forecast the future with powers of the transition matrix 183

After 1 year the distribution is:

p(1) =[

0.75 0.25] [

0.9 0.10.2 0.8

]=

[0.725 0.275

].

Use this result to find the population distribution after 2 years:

p(2) =[

0.725 0.275] [

0.9 0.10.2 0.8

]=

[0.7075 0.2925

].

And after 3 years:

p(3) =[

0.7075 0.2925] [

0.9 0.10.2 0.8

]=

[0.6952 0.3048

].

We continue with this process to obtain the population distributionafter 9 years and 10 years:

p(9) =[

0.6700 0.3300]

, p(10) =[

0.6690 0.3310]

.

Notice that after many transitions the population distribution tends to settledown to a steady state distribution.

The above calculations can be also performed as follows:

p(n) = p(0)Pn .

Hence to calculate p(10) we multiply the initial population distribution by

P 10 =[

0.6761 0.32390.6478 0.3522

].

p(10) =[

0.75 0.25] [

0.6761 0.32390.6478 0.3522

]=

[0.6690 0.3310

].

which is the same result as before.

For large n notice that Pn also approaches a steady state with identicalrows. For example,

limn→∞

Pn =[

0.6667 0.33330.6667 0.3333

]The probabilities in each row represent the population distribution in thesteady state. This distribution is independent of the initial conditions. Forexample if a fraction x (0 ≤ x ≤ 1) of the population initially lived in thecity and a fraction (1−x) in the country, in the steady state situation we willfind 66.67 percent living in the city and 33.33 percent living in the countryregardless of the value of x. This is verified by computing

p(∞) =[

x 1− x] [

0.6667 0.33330.6667 0.3333

]=

[0.6667 0.3333

].

© USQ, February 21, 2007

Page 188: Study Book

184 Module 8. Markov chains

8.5 Classification of finite Markov chains

The long term behaviour of Markov chains depend on the general struc-ture of the transition matrix. For some transition matrices the chain willsettle down to a steady state condition which is independent of the initialstate. In this subsection we identify the characteristics of a Markov chainthat will ensure a steady state exists. In order to do this we must classify aMarkov chain according to the structure of its transition diagram and ma-trix. The critical property we need for a steady state is that the Markovchain is “ergodic”—you may meet this term in completely different contexts,such as in fluid turbulence, but the meaning is essentially the same: hereit means that the probabilities get “mixed up” enough to ensure there areno long time correlations in behaviour and hence a steady state will appear.There are biological

situations with intriguingnon-ergodic effects.

Further, an ergodic system is one in which time averages, such as might beobtained from an experiment, are identical to ensemble averages, averagesover many realisations, which is what we often want to discuss and reportin applications.

Consider the following transition matrix

P =

1 2 3 4 512345

0.3 0.7 0 0 00.9 0.1 0 0 00 0 0.2 0.8 00 0 0.5 0.3 0.20 0 0 0.4 0.6

This matrix is depicted by the following state transition diagram. EachAlways draw such a state

transition diagram for yourMarkov chains.

node represents a state and the labels on the arrows represent the transitionprobability Pij .

1 2 3 4 50.3 0.1 0.2 0.3 0.6

0.7

0.9

0.8

0.5

0.2

0.4

The following properties refer to this particular Markov chain as a firstexample.

� Given two states i and j a path from i to j is a sequence of transitionsthat begins in i and ends in j such that each transition in the sequence

© USQ, February 21, 2007

Page 189: Study Book

8.5. Classification of finite Markov chains 185

has a positive probability of occurrence: thus [Pn]ij > 0 for an n-steppath.

For example, see that there are paths from 1 to 2, from 1 to 1, from 3to 5, but not from 3 to 1.

� A state j is accessible from i if there is a path leading from i to j afterone or more transitions.

For example, state 5 is accessible from state 3 but state 5 is not acces-sible from states 1 nor 2.

� Two states i and j communicate with each other if j is accessiblefrom i and i is accessible from j. If state i communicates with j andwith k, then j also communicates with k. Therefore, all states thatcommunicate with i also communicate with each other.2

For example, states 1 and 2 communicate with each other. Similarlystates 3 and 5 communicate, but states 1 and 5 do not.

� A set of states S in a Markov chain is a closed set if no state outsideof S is accessible from any state in S.

For example, S1 = {1, 2} and S2 = {3, 4, 5} are both closed sets.

� A state i is an absorbing state if Pii = 1. Once we enter such anabsorbing state, we never leave that state because with probability 1 wecan only make the transition from i to i, there is no“spare probability”to go elsewhere.

There are no absorbing states in the above example. However, inmany models of biological populations, the population going extinct isan absorbing state because with zero females the species cannot breedand so remains extinct forever.

� A state i is a transient state if a state j exists that is accessible fromi, but the state i is not accessible from j. If a state is not transientit is called a recurrent state. After a large number of transitions theprobability of being in a transient state is zero.

There are no transient states in the above example. States 1 and 2in the Markov chain with the following state transition diagram aretransient, states 3 and 4 are recurrent:

2For those who did Discrete Mathematics for Computing: communication is an equiv-alence relation.

© USQ, February 21, 2007

Page 190: Study Book

186 Module 8. Markov chains

1 2 3 4

� A recurrent state i is cyclic (periodic) with period d > 1 if the systemcan never return to state i except after a multiple of d steps. (Thusd is the greatest common divisor, over all possibilities, of the numberof transitions, n, for the process to move from state i back to state i:d = gcd{n | [Pn]ii > 0}.) A state that is not cyclic is called aperiodic.

In the earlier example, all states are aperiodic because from eachstate the system can revisit that state after any integer number ofsteps. However, for the example immediately above, the two recurrentstates 3 and 4 are cyclic with period d = 2 as, for example, state 3can only be returned to after a multiple of d = 2 steps, similarly forstate 4.

� If all states in a chain are recurrent, aperiodic and communicate witheach other, the chain is ergodic.

The above examples are not ergodic because not all states communi-cate with each other. See in the earlier example that no unique steadystate exists because if the system starts in states 1 or 2 it must stayin those states forever, whereas if its starts in states 3–5 then it staysin those states forever: the long time behaviour is quite different de-pending upon which case occurs, and thus there is no unique steadystate.

Example 8.7: Determine which of the chains with the following transitionmatrices is ergodic.

P1 =

0 0 0.5 0.50 0 0.4 0.60.1 0.9 0 00.4 0.6 0 0

, P2 =

0.2 0.4 0.40.1 0.2 0.70.3 0.3 0.4

.

Solution: Draw a state transition diagram for each, then the follow-ing observations easily follow. The states in P1 communicate with eachother. However, if the process is in state 1 or 2 it will always move toeither state 3 or 4 in the next transition. Similarly if the process is in

© USQ, February 21, 2007

Page 191: Study Book

8.6. Limiting state (steady state) probabilities 187

state 3 or 4 it will move back to state 1 or 2. All states in such a chainare cyclic with period d = 2. This chain is not ergodic.

All states in P2 communicate with each other. The states are recurrentand aperiodic. Therefore P2 is an ergodic chain.

8.6 Limiting state (steady state) probabilities

Theorem 8.5 Let P be the the transition matrix of an s-state ergodicMarkov chain, then a vector π =

[π1 π2 . . . πs

]exists such that

limn→∞

Pn =

π1 π2 · · · πs

π1 π2 · · · πs...

.... . .

...π1 π2 · · · πs

. (8.3)

The common row vector π represents the limiting state probability distribu-tion or the steady state probability distribution that the process approachesregardless of the initial state. When the above limit occurs, then followingany initial condition p(0) the probability vector after a large number n oftransitions is

p(n) = p(0)Pn → π .

To show this last step, consider just the first element, p1(n), of the proba-bility vector p(n). It is computed as p(0) times the first column of Pn, butPn to

[π1 · · · π1

]T hence

p1(n) → p(0)

π1...

π1

= π1p(0)

1...1

= π1

as the sum of the elements in p(0) have to be 1. Similarly for all the otherelements in p(n).

How do we find these limiting state probabilities π? For a given chain withtransitions matrix P we have observed that as the number of transitions nincreases

p(n) → π

© USQ, February 21, 2007

Page 192: Study Book

188 Module 8. Markov chains

But we know p(n + 1) = p(n)P and so taking the limit as n →∞:

π = πP . (8.4)

The limiting steady state probabilities are therefore the solution of the sys-tem of linear equations such that the row sum of π is 1:

s∑j=1

πj = 1 . (8.5)

Unfortunately, with the above condition we have s + 1 linear equations ins unknowns. To solve for the unknowns we may replace any one of the slinear equations obtained from (8.4) with

∑sj=1 πj = 1.

Example 8.8: To illustrate how to solve the steady state probabilities con-sider the transition matrix,

P =

0.7 0.2 0.1 00.3 0.4 0.2 0.10 0.3 0.4 0.30 0 0.3 0.7

.

Solving π = πP we have

[π1 π2 π3 π4

]=

[π1 π2 π3 π4

] 0.7 0.2 0.1 00.3 0.4 0.2 0.10 0.3 0.4 0.30 0 0.3 0.7

,

orπ1 = 0.7π1 + 0.3π2 + 0π3 + 0π4 ,π2 = 0.2π1 + 0.4π2 + 0.3π3 + 0π4 ,π3 = 0.1π1 + 0.2π2 + 0.4π3 + 0.3π4 ,π4 = 0π1 + 0.1π2 + 0.3π3 + 0.7π4 ,

together withπ1 + π2 + π3 + π4 = 1 . (8.6)

Discarding any of the first four equations and solving the remainingequations we find the steady state probabilities:

π =[

315

315

415

515

].

The steady state probabilities can be found by first noting that π = πPcan be written as

π(I − P ) = 0,

© USQ, February 21, 2007

Page 193: Study Book

8.6. Limiting state (steady state) probabilities 189

where I is an identity matrix of appropriate size (and rememberingthat the order of multiplication s important in matrix multiplication).This equation is of the form xA = b. To turn it into the more familiarform Ax = b, transpose both sides:

(I − P )T πT = 0

(since (AB)T = BT AT ). Now, this system has four equation, onlythree of which are necessary. One row (say the last row) can be re-placed with the equation

π =[

1 1 1 1]

(that is, Equation (8.6)). This can all be done in r—albeit with someeffort.

> data <- c(0.7, 0.2, 0.1, 0, 0.3, 0.4,

+ 0.2, 0.1, 0, 0.3, 0.4, 0.3, 0, 0,

+ 0.3, 0.7)

> P <- matrix(data, nrow = 4, ncol = 4,

+ byrow = T)

> eye <- diag(4)

> tIP <- t(eye - P)

> tIP[4, ] <- c(1, 1, 1, 1)

> rhs <- matrix(c(0, 0, 0, 1), nrow = 4,

+ ncol = 1)

> steady.state <- solve(tIP, rhs)

> steady.state

[,1][1,] 0.2000000[2,] 0.2000000[3,] 0.2666667[4,] 0.3333333

Of course, in R it can be easier just to raise the transition matrix to alarge power:

> P2 <- P %*% P

> P4 <- P %*% P %*% P %*% P

> P16 <- P4 %*% P4 %*% P4 %*% P4

> P64 <- P16 %*% P16 %*% P16 %*% P16

> P256 <- P64 %*% P64 %*% P64 %*% P64

> P256

© USQ, February 21, 2007

Page 194: Study Book

190 Module 8. Markov chains

[,1] [,2] [,3] [,4][1,] 0.2 0.2 0.2666667 0.3333333[2,] 0.2 0.2 0.2666667 0.3333333[3,] 0.2 0.2 0.2666667 0.3333333[4,] 0.2 0.2 0.2666667 0.3333333

The answers are the same.

8.6.1 Share of the market model

One application area of the Markov chains is in brand switching or shareof the market models. Suppose NoFrill Airlines (nfa) is competing forthe market share of domestic passengers with the other two major carriers,KangaRoo Airways (kra) and emu Airlines. The major airlines have com-missioned a survey to determine the likely impact of the newcomer on theirmarket share. The results of a random survey have revealed the followinginformation:

� 40% of passengers currently fly with kra;

� 50% of passengers currently fly with emu;

� 10% of passengers currently fly with nfa.

The survey results also showed that:

� 80% of the passengers who currently fly with kra will fly with kranext time, 15% will switch to emu and the remaining 5% will switchto nfa;

� 90% of the passengers who currently fly with emu will fly with emunext time, 6% will switch to kra and the remaining 4% will switch tonfa;

� 90% of the passengers who currently fly with nfa will fly with nfanext time, 4% will switch to kra and the remaining 6% will switch toemu Airlines.

The preference pattern of passengers is here modelled as a Markov chainwith the following transitions matrix:

P =kraemunfa

0.8 0.15 0.050.06 0.90 0.040.04 0.06 0.90

.

© USQ, February 21, 2007

Page 195: Study Book

8.7. Exercises 191

We also have the initial market share

p(0) =[

0.4 0.5 0.1]

.

To determine the long term market share for each airline we find the steadystate probabilities of the transition matrix. Solving π = πP and replacingany of the equations with π1 + π2 + π3 = 1 we get

π =[

0.2077 0.4918 0.3005]

.

Therefore, in the long term, the market share of the kra Airlines woulddrop from an initial 40% to 20.77%, the market share for the emu Airlineswill remain steady and the nfa airline would increase their market sharefrom 10% to 30.05%.

Note that the future market share for each airline only depends on thetransition matrix and not on the initial market share. The management ofthe kra could launch an advertising campaign to regain some of the 20% oftheir customers who are switching to the other two airlines.

8.7 Exercises

Ex. 8.9: The SOI is a well known climatological indicator for eastern Aus-tralia. Stone and Auliciems [42] developed SOI phases in which the av-erage monthly SOI is allocated to one of five phases correspond to theSOI falling rapidly (phase 1), staying consistently negative (phase 2),staying consistently near zero (phase 3), staying consistently positive(phase 4), and rising rapidly (phase 5).

The transition matrix, based on data collected from July 1877 toFebruary 2002 is

P =

0.668 0.000 0.081 0.154 0.1010.000 0.683 0.125 0.062 0.1300.354 0.000 0.063 0.370 0.2120.000 0.387 0.204 0.102 0.3030.036 0.026 0.132 0.276 0.529

.

(a) Draw a transition diagram for the SOI phases.(b) Determine if the Markov chain is ergodic.(c) Determine the steady state probabiities.

Ex. 8.10: Draw the state transition diagram for the Markov chain given by

P =[

1/3 2/31/4 3/4

].

© USQ, February 21, 2007

Page 196: Study Book

192 Module 8. Markov chains

Ex. 8.11: Draw the state transition diagram and hence determine if thefollowing Markov chain is ergodic. Also determine the recurrent, tran-sient and absorbing states of the chain.

P =

0 0 1 0 0 00 0 0 0 0 10 0 0 0 1 014

14 0 1

2 0 01 0 0 0 0 00 1

3 0 0 0 23

Ex. 8.12: The daily rainfall in Melbourne has been recorded from 1981 to1990. The data is contained in the file melbrain.dat, and is fromHyndman [5] (and originally from the Australian Bureau of Meteorol-ogy).

A large number of days recorded no rainfall at all. The followingtransition matrix shows the transition matrix for the two states ‘Rain’and ‘No rain’:

P =[

0.721 0.2790.440 0.560

].

(a) Draw a transition diagram from the matrix P .

(b) Use r to determine the steady state probabilities of days withrain in Melbourne.

(c) Determine the probability of having a wet day two days after afine day.

Ex. 8.13: The daily rainfall in Melbourne has been recorded from 1981 to1990, and was used in the previous exercise. In that exercise, two states(‘Rain’ (R) or ‘No rain’ (N)) were used. Then, the state yesterday wasused to deduce probabilities of the two states today. In this exercise,four states are used, taking into account the weather for the previoustwo days.

There are four states RR, RN, NR, NN; the left-most state occursearlier. (That is, RN means a rain-day followed by a day with norain). The following transition matrix shows the transition matrix forthe four states:

P =

0.564 0.436 0 0

0 0 0.315 0.6850.5554 0.445 0 0

0 0 0.265 0.735

.

(a) Draw a transition diagram from the matrix P .

© USQ, February 21, 2007

Page 197: Study Book

8.7. Exercises 193

(b) Explain why eight entries in the transition matrix must be exactlyzero.

(c) Use r to determine the steady state probabilities of the four statesfor the data.

(d) Determine the probability that two wet days will be followed bytwo dry days.

Ex. 8.14: A computer laboratory has become notorious in service becauseof computer breakdowns. Data collected on its status every 15 minutesfor about 12 hours (50 observations) is given below (1 indicates“systemup” and 0 indicates “system down”.)

1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 11 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 0 1

Assuming this process can be modelled as a Markov chain, estimatefrom the data the probabilities of the system being“up”or“down”each15 minutes given it was “up” or “down” in the previous period, drawthe state transition diagram and write down the transition matrix.

Ex. 8.15: Suppose that if it has rained for the past three days, then it willrain today with probability 0.8; if it did not rain for any of the pastthree days, then it will rain today with probability 0.2; and in anyother case the weather today will, with probability 0.6, be the sameas the weather yesterday. Determine the transition matrix for thisMarkov chain.

Ex. 8.16: Let {Xn | n = 0, 1, 2, ...} be a Markov chain with state space{1, 2, 3} and transition matrix

P =

12

14

14

23 0 1

335

25 0

.

Determine the following probabilities:

(a) being in state 3 two steps after being in state 2;

(b) Pr {X4 = 1 | X2 = 1} ;

(c) p(2) given that p(0) =[

1 0 0];

(d) Pr {X2 = 3} given that Pr {X0 = 1} = Pr {X0 = 2} = Pr {X0 = 3} ;

(e) Pr {X2 = 3 | X1 = 2& X0 = 1} ;

(f) Pr {X2 = 3& X1 = 2 | X0 = 1} .

© USQ, February 21, 2007

Page 198: Study Book

194 Module 8. Markov chains

Ex. 8.17: Determine the limiting state probabilities for Markov chains withthe following transition matrices.

P1 =[

0.5 0.50.7 0.3

]P2 =

0 1 00 0 1

0.4 0 0.6

P3 =

0.2 0.4 0.40.5 0.2 0.30.3 0.4 0.3

Ex. 8.18: Two white and two black balls are distributed in two urns insuch a way that each contains two balls. We say that the system is instate i, i = 0, 1, 2, if the first urn contains i white balls. At each step,we randomly select one ball from each urn and place the ball drawnfrom the first urn into the second, and conversely with the ball fromthe second urn. Let Xn denote the state of the system after nth step.Assuming that the process can be modelled as a Markov chain, drawthe state transition diagram and determine the transition matrix.

Ex. 8.19: A company has two machines. During any day each machine thatis working at the beginning of the day has a 1/3 chance of breakingdown. If a machine breaks down during the day, it is sent to repairfacility and will be working 3 days after it breaks down. (i.e. if amachine breaks down during day 3, it will be working at the beginningof day 6). Letting the state of the system be the number of machinesworking at the beginning of the day, draw a state transition diagramand formulate a transition probability matrix for this situation.

Ex. 8.20: The State Water Authority plans to build a reservoir for floodmitigation and irrigation purposes on the Macintyre river. The pro-posed maximum capacity of the reservoir is 4 million cubic metres.The weekly flow of the river can be approximated by the followingdiscrete probability distribution:

weekly inflow (106m3) 2 3 4 5probability 0.3 0.4 0.2 0.1

Irrigation demand is 2 million cubic metres per week. Environmentaldemand is 1 million cubic metres per week. Minimum storage require-ment is 1 million cubic metres. Any demand shortage is at the expenseof irrigation. Excess inflow would be released over the spillway. As-sume that the irrigation water may be supplied after the inflow arrives.

Before proceeding with the construction, the Water Authority wishesto have some idea of the behaviour of the reservoir.

(a) Model the system as a Markov chain and determine the steadystate probabilities. State any assumptions you make.

© USQ, February 21, 2007

Page 199: Study Book

8.7. Exercises 195

(b) Explain the steady state probabilities in the context of this ques-tion.

Ex. 8.21: Past records indicate that the survival function for light bulbs oftraffic lights has the following pattern:

Age of Bulbs in months 0 1 2 3Number surviving to age n 100 85 60 0

(a) If each light bulb is replaced after failure, draw a state transitiondiagram and find the transition matrix associated with this pro-cess. Assume that a replacement during the month is equivalentto a replacement at the end of the month.

(b) Determine the steady state probabilities.

(c) If an intersection has 20 bulbs, how many bulbs fail on averageper month?

(d) If an individual replacement has a cost of $15, what is the long-run average cost per month ?

Ex. 8.22: A machine in continuous service requires frequent adjustment toensure quality output. If it gets out of adjustment, on average $600 ofdefective parts are made before it can be corrected. Adjustment costs$200 in labour and downtime. Data collected on the operation of themachine is summarised below:

Time since adjustment(hours)

Probability ofdefective production

1 0.002 0.203 0.504 1.00

In answering the following questions make suitable assumptions whereappropriate.

(a) If the machine is adjusted only when defective production occurs,find the transition matrix associated with this process.

(b) Determine the steady state probabilities. What is the long runmean hourly cost of this policy?

(c) Suppose a policy of readjustment when needed or after threehours of running time (whichever comes first) is introduced. Whatis the long run mean cost of this policy?

Ex. 8.23: The following exercises involve computer work in R.

© USQ, February 21, 2007

Page 200: Study Book

196 Module 8. Markov chains

(a) The weather can be classified as Sunny (S) or Cloudy (C). Con-sider the previous two days classified in this manner; then thereare four states: SS, SC, CS and CC. The transition matrix, en-tered in R, is

pp <- matrix( nrow=4, ncol=4, byrow=TRUE,data=c(.9, .1, 0, 0,

0, 0, .4, .6,.7, .3, 0, 0,0, 0, .2, .8) )

(You can read ?matrix for assistance.) Verify this could be avalid transition matrix by computing the row sums rowSums(pp)(the row sums ) are all one. (See ?rowSums.)Suppose today is the second of two sunny days in a row, state SS,that is p0 =

[1 0 0 0

]. Enter this state into r by typing

pie <- c(1,0,0,0), then compute the probabilities of being invarious states tomorrow as pie <- pie %*% pp. (See ?"%*%"for help here (quotes necessary). Note that the operator * does anelement-by-element multiplication; the command %*% is used formatrix multiplication in r.) Why is Pr {cloudy tomorrow} = 0.1?Evaluate pie <- pie %*% pp again to compute the probabilitiesfor two days time. Why is Pr {cloudy in 2 days} = 0.15?What is the probability of being sunny in 3 days time?

(b) Keep applying pie <- pie %*% pp iteratively and see that thepredicted probabilities recognisably converge in about 10–20 daysto π =

[0.58 0.08 0.08 0.25

]. These are the long-term prob-

abilities of the various states.Compute P 10, P 20 and P 30 and see that the rows of powers ofthe transition matrix also converge to the same probabilities.

(c) So far we have only addressed patterns of probabilities. Some-times we run simulations to see how the Markov chain may ac-tually evolve. That is, we need to generate a sequence of statesaccording to the probabilities of transitions. For this weather ex-ample, if we start in zeroth state SS we need to generate for thefirst state either SS with probability 0.9 or SC with probability0.1. Suppose it was SC, then for the second state we need gener-ate either CS with probability 0.4 and CC with probability 0.6.How is this done?Sampling from general probability distributions is done using thecumulative probability distribution (cdf) and rand. For exam-ple, if we are in state SS, i = 1, then the cdf for the choice of nextstate is

[.9 1 1 1

]obtained in r by cumsum( pp[1,] ).

Thus in general the next state j is found from the current state iby for example

© USQ, February 21, 2007

Page 201: Study Book

8.7. Exercises 197

j <- sample( c(1,2,3,4), prob=pp[1,] , size=1)

Run a simulation by wrapping this in a loop such as

i<- 1 # Initial statefor (t in (1:99)) {

i <- sample( c(1,2,3,4), prob=pp[ i,] , size=1)cat(i) # prints the value of i

}cat("\n") # Ends the line

Now save the history of states by executing

i <- array( dim=200) # Set up an arrayi[1] <- 1 # Initial statefor (t in (1:199)) {

i[t+1] <- sample( c(1,2,3,4),prob=pp[ i[t],] , size=1)

}

Use hist(i, freq=FALSE, breaks=c(0,1,2,3,4) ) to draw ahistogram and verify the long-term histogram is reasonably closeto that predicted by theory. (The proportions are found bytable(i)/sum(table(i)); compare to pie.)

Ex. 8.24: Packets of information sent via modems down a noisy telephoneline often fail. For example, suppose in 31 attempts we find thatpackets are sent successfully or fail with the following pattern, where1 denotes success and 0 failure:

1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,

0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1 .

There seem to be runs of success and runs of failure so we guess thatthese may not be each independent of each other (failure in commu-nication networks are indeed generally correlated). Thus we try tomodel as a Markov chain.

Suppose the probability of success of the next packet only dependsupon chance and the success or failure of the current attempt. Arguefrom the data that the transition matrix should be

P ≈[

1/2 1/21/3 2/3

].

Given this model, what is the long-term probability of success for eachpacket?

Suppose the probability of success of packet transmission dependsupon chance and the success or failure of the previous two attempts.

© USQ, February 21, 2007

Page 202: Study Book

198 Module 8. Markov chains

Write down and interpret the four states of this Markov chain model.Use the data to estimate the transition probabilities, then form theminto a 4×4 transition matrix P . Compute using Matlab a high powerof P to see that the long-term distribution of states is approximatelyπ =

[.4 .2 .2 .2

]and hence deduce this model would predict a

slightly higher overall success rate.

Ex. 8.25: Let a Markov chain with the state space S = {0, 1, 2} be suchthat:

� from state 0 the particle jumps to states 1 or 2 with equal prob-ability 1/2;

� from state 2 the particle must next jump to state 1;

� state 1 is absorbing (that is, once the particle enters state 1, itcannot leave.

Draw the transition diagram and write down the transition matrix.

Ex. 8.26: For a Markov chain with the transition matrix

P =

0 0.1 0.90.8 0 0.20.7 0.3 0

,

draw the transition diagram and find the probability that the particlewill be in state 1 after three jumps given it started in state 1.

Ex. 8.27: (sampling problem) Let X be a Markov Chain. Show that thesequence Yn = X2n , n ≥ 0 is a Markov chain (such chains are calledimbedded in X).

Ex. 8.28: (lumping states together) Let X be a Markov chain. Show thatYn = |Xn| , n ≥ 0 is not necessarily a Markov chain.

Ex. 8.29: Classify the states of the following Markov chains and determinewhether they are absorbing, transient or recurrent:

P1 =

0 1/2 1/21/2 0 1/21/2 1/2 0

;

P2 =

0 0 0 10 0 0 1

1/2 1/2 0 00 0 1 0

;

© USQ, February 21, 2007

Page 203: Study Book

8.7. Exercises 199

P3 =

1/2 0 1/2 0 01/4 1/2 1/4 0 01/2 0 1/2 0 00 0 0 1/2 1/20 0 0 1/2 1/2

;

P4 =

1/4 3/4 0 0 01/2 1/2 0 0 00 0 1 0 00 0 1/3 2/3 01 0 0 0 0

.

Ex. 8.30: Classify the states of the Markov chains with the following tran-sition probability matrices:

P1 =

0 1/2 1/21/2 0 1/21/2 1/2 0

;

P2 =

0 0 1/2 1/21 0 0 00 1 0 00 1 0 0

;

P3 =

1/2 1/2 0 0 01/2 1/2 0 0 00 0 1/2 1/2 00 0 1/2 1/2 0

1/4 1/4 0 0 1/2

.

Ex. 8.31: Consider the Markov chain consisting of the four states and hav-ing transition probability matrix

P =

0 0 1/2 1/21 0 0 00 1 0 00 1 0 0

.

Which states are recurrent?

Ex. 8.32: Let a Markov chain be defined by the matrix

P =

1/2 1/2 0 0 01/2 1/2 0 0 00 0 1/2 1/2 00 0 1/2 1/2 0

1/4 1/4 0 0 1/2

.

What can you say about its decomposability into disjoint Markovchains and the transient and recurrent nature of its states?

© USQ, February 21, 2007

Page 204: Study Book

200 Module 8. Markov chains

Ex. 8.33: (A Communications system) Consider a communications systemwhich transmits the digit 0 and 1. Each digit transmitted must passthrough several stages, at each of which there is a probability p thatthe digit entered will be unchanged when it leaves. Letting Xn denotethe digit entering the nth stage, define its transmission probabilitymatrix. Show by induction that

Pn =[

12 + 1

2(2p− 1)n 12 −

12(2p− 1)n

12 −

12(2p− 1)n 1

2 + 12(2p− 1)n

].

Ex. 8.34: Suppose that coin 1 has probability 0.7 of coming up heads, andcoin 2 has probability 0.6 of coming up heads. If the coin flippedtoday comes up heads, then we select coin 1 to flip tomorrow, and ifit comes up tails, then we select coin 2 to flip tomorrow. If the coininitially flipped is equally likely to be coin 1 or coin 2, then what isthe probability that the coin flipped on the third day after the initialflip is coin 1?

Ex. 8.35: For a series of dependent trials the probability of success on anytrial is (k + 1)/(k + 2) where k is equal to the number of successes onthe previous two trials. Compute

limn→∞

Pr {success on the nth trial} .

Ex. 8.36: An organisation has N employees where N is a large number.Each employee has one of three possible job classifications and changesclassifications (independently) according to a Markov chain with tran-sition probabilities

P =

0.7 0.2 0.10.2 0.6 0.20.1 0.4 0.5

.

What percentage of employees are in each classification in the longrun?

8.7.1 Answers to selected Exercises

8.9 The chain is ergodic, and the steady state probabilities are (to threedecimal places)[0.165, 0.247, 0.126, 0.183, 0.278].

8.11 States 1, 2, 3, 5 and 6 are recurrent. State 4 is transient. S1 = {1, 3, 5}and S2 = {2, 6} are two closed sets. Since states 4 and 1 do notcommunicate the chain is not ergodic.

© USQ, February 21, 2007

Page 205: Study Book

8.7. Exercises 201

8.14

P =01

[614

814

835

2735

]8.15 The process may be modelled as an 8 state Markov chain with states

{[111], [112], [121], [122], [211], [212], [221], [222]} where 1 indicates norain, 2 indicates rain and a triple [abc] indicates the weather was a theday before yesterday, b yesterday and c today.

P =

[111][112][121][122][211][212][221][222]

0.8 0.2 0 0 0 0 0 00 0 0.4 0.6 0 0 0 00 0 0 0 0.6 0.4 0 00 0 0 0 0 0 0.4 0.6

0.6 0.4 0 0 0 0 0 00 0 0.4 0.6 0 0 0 00 0 0 0 0.6 0.4 0 00 0 0 0 0 0 0.2 0.8

8.16

P 2 =

1730

940

524

1630

930

16

1730

320

1760

.

(a) [P 2]23 = 1/6

(b) Pr {X4 = 1 | X2 = 1} = 17/30

(c) p(2) = p(0)P 2 =[

17/30 9/40 5/24]

(d) p(0) = [1/3 1/3 1/3] therefore Pr {X2 = 3} =[p(0)P 2

]3

=79/360

(e) Pr {X2 = 3 | X1 = 2& X0 = 1} = Pr {X2 = 3 | X1 = 2} = 1/3

(f) Pr {X2 = 3& X1 = 2 | X0 = 1} = Pr {X2 = 3 | X1 = 2}×Pr {X1 = 2 | X0 = 1} =1/3× 1/4 = 1/12

8.17 (a) π =[

712

512

](b) π =

[29

29

59

](c) π =

[13

13

13

]8.18

P =

0 1 014

12

14

0 1 0

8.19 The process may be modelled as a 6 state Markov chain with the

following states ∈ {[200], [101], [110], [020], [011], [002]}.The three numbers in the label for each state describes the numberof machines currently working, in repair for 1 day and in repair for 2

© USQ, February 21, 2007

Page 206: Study Book

202 Module 8. Markov chains

days. For example, the state [020] means no machines are currentlyworking and both machines were broken down yesterday and wouldbe available again the day after tomorrow. If we are currently atstate [020] then after one transition (day) the process will move tostate [002]. Following this process we find the transition matrix as

P =

[200][101][110][020][011][002]

49 0 4

919 0 0

23 0 1

3 0 0 00 2

3 0 0 13 0

0 0 0 0 0 10 1 0 0 0 01 0 0 0 0 0

8.20 (a) The states are the volume of water in the reservoir, which al-

though continuous are assumed to take discrete values ∈ {1, 2, 3, 4}.Hence the transition matrix is

P =

0.7 0.2 0.1 0.00.3 0.4 0.2 0.10.0 0.3 0.4 0.30.0 0.0 0.3 0.7

The steady state probabilities may be computed by solving π =πP where the elements of π sum to one to give

π =[

0.2 0.2 0.2667 0.3333]

.

(b) The steady state probabilities represent the long term averageprobability of finding the reservoir in each state. For examplein the long run we expect the reservoir will start, or end, witha volume of 1 million m3, 20.5% of the time and a volume of 4million m3, 32.9% of the time.

8.21 (a) The states are the age of lights in months ∈ {0, 1, 2} then thetransition matrix associated with this process is

P =

0.15 0.85 0.00.29 0.0 0.711.0 0.0 0.0

.

(b) π = [0.407 0.346 0.246]

(c) Average number of failures per month = 0.4076×20 = 8.15 units

(d) Long term average cost per month = $15× 8.15 = $122.28

© USQ, February 21, 2007

Page 207: Study Book

8.7. Exercises 203

8.22 (a) Let Xn = elapsed time in hours (at time n) since adjustment∈ {0, 1, 2, 3}. Assume that adjustments occur on the hour onlyand that the time taken to service the machine is negligible. (al-ternative sets of assumptions are possible.) The transition prob-abilities can be found by converting the given table as follows. Ifthe machine is adjusted 100 times, the number of these adjust-ments which we expect to ”survive” are given by

Time since adjustment (hours) Number surviving0 1001 1002 803 504 0

We then have

P01 = Pr {Xn+1 = 1 | Xn = 0} =100100

= 1

P12 =80100

= 0.8

P23 =5080

= 0.625

Since none survive to “age” 4 a state 4 is not needed. Hence therequired transition matrix is

P =

0 1 0 0

0.2 0 0.8 00.375 0 0 0.625

1 0 0 0

.

(b)π =

[1033

1033

833

533

]In the long run “breakdowns” occur in a proportion π0 of thehours and each breakdown costs $800. Therefore mean cost perhour = 10

33 × 800 = $242.42 .

(c) Now Xn ∈ {0, 1, 2} and

P =

0 1 00.2 0 0.81 0 0

,

with steady state distribution.

π =[

514

514

414

].

© USQ, February 21, 2007

Page 208: Study Book

204 Module 8. Markov chains

In the long run, the proportion of hours in which a breakdownoccurs

=514× 0 +

514× 0.2 +

414× 0.375 =

528

and each breakdown costs $800. The proportion of time thatreadjustment occurs without a breakdown

= Pr {reaching age 3 and no breakdown occurs}

=414× 0.624 =

528

and each readjustment alone costs $200. Hence, the long termcost per hour of this policy is

=528× 800 +

528× 200 = $178.57

8.34 Model this as a Markov chain with two states: C1 means that coin 1 isto be tossed; C2 means that coin 2 is to be tossed. From the questionthe state transition diagram is�

���C1

��

��C20.7 0.4�

Pr(H)=0.6

-Pr(T)=0.3

From this diagram the transition matrix is read off to be

P =[

0.7 0.30.6 0.4

]whence starting from the state π0 =

[0.5 0.5

]the predictions for

the states after three tosses is

π3 = π0P3 =

[0.6665 0.3335

].

Thus the probability of tossing coin 1 on the third day is 0.6665 .

© USQ, February 21, 2007

Page 209: Study Book

Module 9Other Models

Module contents9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2069.2 Using other models . . . . . . . . . . . . . . . . . . . . . 2069.3 Seasonally adjusted models . . . . . . . . . . . . . . . . 2069.4 Regime-dependent autoregressive models . . . . . . . . 2079.5 Neural networks . . . . . . . . . . . . . . . . . . . . . . . 2079.6 Trend-regression models . . . . . . . . . . . . . . . . . . 2089.7 Multivariate time series models . . . . . . . . . . . . . . 2089.8 Forecasting by means of a model . . . . . . . . . . . . . 2119.9 Finding similar past patterns . . . . . . . . . . . . . . . 2119.10 Singular spectrum analysis . . . . . . . . . . . . . . . . . 212

Module objectives

Upon completion of this module students should be able to:

� understand there are numerous other types of models for modellingtime series;

� name some other time series models used in climatology;

� explain one of the methods in more detail.

205

Page 210: Study Book

206 Module 9. Other Models

9.1 Introduction

In this part of the course, one particular type of time series methodologyhas been discussed: the Box–Jenkins models, or arma type models. Thereare a large number of other possible models for time series however. In thisModule, some of these models are briefly discussed. You are required toknow the details of only one of these models in particular, but should atleast know the names and ideas behind the others.

You don’t need to understand all the details in this Module; but see Assign-ment 3.

9.2 Using other models

The time series models previously discussed—arima( ,m, o)dels and Markovchain models—are reasonably simple. There are, however, many far morecomplicated models have not studied. In an attempt to compare numer-ous types of forecasting methods, Spyros Makridakis and Michele Hibonconducted the M3-Competition (following the M- and M2-Competitions),which compared 24 time series methods (including the Box–Jenkins ap-proach adopted here plus many more complicated models) on 3003 differenttime series. This was one of the conclusion from the competition:

Statistically sophisticated or complex methods do not necessarilyproduce more accurate forecasts than simpler ones.

In particular, the method that unofficially won the competition was thetheta-method (see Assimakopoulos and Nikolopoulos [7]) which was shownlater (see Hyndman and Billah [22]) to be simply exponential smoothingwith a drift (or trend) component. Exponential smoothing was listed inSection 1.4 as a simple method.

The lesson is clear: Just because methods appear clever, complicated ortechnical, simple methods are often the best. However, all methods havesituation in which they perform well, and there are other methods worthyof consideration. Some of those are considered here.

9.3 Seasonally adjusted models

Activity 9.A: Read Chu & Katz [13] in the selected read-ings.

© USQ, February 21, 2007

Page 211: Study Book

9.4. Regime-dependent autoregressive models 207

Table 9.1: The parameter estimates for an ar(3) model with seasonallyvarying parameters for modelling the seasonal SOI. Note the seasons referto northern hemisphere seasons.

SOI predictand

Parameter Spring Summer Fall WinterEstimates (t = 1) (t = 2) (t = 3) (t = 4)

φ1(t) 0.5268 0.7832 0.8554 0.7736φ2(t) 0.1158 0.2568 0.1700 0.1971φ3(t) −0.2011 −0.3816 −0.0674 −0.1808

Chu & Katz [13] discuss fitting arma type models to the seasonal andmonthly SOI using an arma model whose coefficients change according tothe season. They fit a seasonally varying ar(3) model to the seasonal SOI,{Xt}, of the form

Xt = φ1(t)Xt−1 + φ2(t)Xt−2 + φ3(t)Xt−3 + et,

with the parameters as shown in Table 9.1.

9.4 Regime-dependent autoregressive models

Activity 9.B: Read Zwiers & von Storch [53] in the selectedreadings.

Zwiers & von Storch [53] fit a regime-dependent ar model (ram) to the SOIdescribed by a stochastic differential equation. (These models are also calledThreshold Autoregressive Models by other authors, such as Tong [44].) Inessence, the SOI is modelled using one of two indicators of the SOI (ei-ther the South Pacific Convergence Zone hypothesis, or the Indian Monsoonhypothesis, as explained in the article), and a seasonal indicator.

9.5 Neural networks

Neural networks consist of processing elements (called nodes) joined byweighted connections. The processing elements take as inputs the weightedsum of the output of the nodes connected to it. The input to the processingelement is transformed (linearly or non-linearly) which is then the output

© USQ, February 21, 2007

Page 212: Study Book

208 Module 9. Other Models

(and can be passed to other processing elements). Neural networks are saidto be loosely based on the operation of the human brain (!).

Maier & Dandy [31] fit neural networks to daily salinity data at MurrayBridge, South Australia, as well as numerous Box–Jenkins models. Theyconclude the Box–Jenkins models produce better one-day ahead forecasts,while the neural networks produce better long term forecasts.

Guiot & Tessier [18] use neural networks and ar(3) models to detect theeffects of pollution of the widths of tree rings, and hence tree growth, from1900 to 1983.

9.6 Trend-regression models

Visser & Molenaar [47] discuss a trend-regression model for modelling a timeseries {Yt} in the presence of k other variables {Xi,t} for i = 1, . . . , k. Thesemodels are of the form

yt = µt + δ1,tX1,t + · · ·+ δk,tXk,t + et

where the stochastic trend µt is described using an arima(p, d, q) process,and et is the noise term. These models are written as TR(p, d, q, k) models,where p, d and q are the usual parameters for an arima(p, d, q) model, andk is the number of explanatory variables. The authors state the trend-regression models ‘include most trend and regression models used in theliterature’.

One particular model they fit is for modelling annual mean surface air tem-peratures in the northern hemisphere from 1851 to 1990, {Tn}. They fita TR(0, 2, 0, 2) model using the Southern Oscillation Index (SOI) and theindex of volcanic dust (VDI) on the northern hemisphere as covariates. Thefitted model is

Tn = µt − 0.050SOIt − 0.086VDIt + et

where the trend µt is described using an arima(0, 2, 0) model (parametersnot given).

9.7 Multivariate time series models

In this course, only univariate time series have been discussed. It is possible,however, for two time series to be related to each other. In this case, thereis a multivariate time series.

© USQ, February 21, 2007

Page 213: Study Book

9.7. Multivariate time series models 209

Time

SO

I

1960 1970 1980 1990

−30

−10

1030

Time

Sea

Lev

el P

ress

ure

Ano

mal

y

1960 1970 1980 1990

−6

−2

02

46

Figure 9.1: Two time series that might be expected to vary together: Top:the SOI; Bottom: the sea level air pressure anomaly at Easter Island.

Example 9.1: The SOI and the sea level air pressure anomaly at EasterIsland might be expected to vary together, since the SOI is related topressure anomalies at Darwin and Tahiti. The two are plotted togetherin Figure 9.1.

In a similar way as the autocorrelation was measured, the cross corre-lation can be defined as:

γXY = E[(Xt − µX)(Ytk − µY )],

where µX is the mean of the time series {Xt} and µY is the mean ofthe time series {Yt}, and k is again the lag. The cross correlation canbe computed for various k. For this example, the plot of the crosscorrelation is shown in Figure 9.2.

The cross correlation indicates there is a significant correlation between

© USQ, February 21, 2007

Page 214: Study Book

210 Module 9. Other Models

−2 −1 0 1 2

−0.

10.

00.

10.

20.

3

Lag

AC

F

SOI & slpa

Figure 9.2: The cross correlation between the SOI and the sea level airpressure anomaly at Easter Island.

© USQ, February 21, 2007

Page 215: Study Book

9.8. Forecasting by means of a model 211

the two series near a lag of zero. That is, when the SOI goes up, thereis a strong chance the sea level air pressure anomaly at Easter Islandwill also go up at the same time.

9.8 Forecasting by means of a model

Forecasting by means of a model is common in meteorology and astron-omy. The weather is routinely forecast by special groups in all developedcountries. They use data from satellites and terrestrial weather stations asinput to a fluid dynamical model of the earth’s atmosphere. The model issimply projected forward by a type of numerical integration to produce theforecasts.

9.9 Finding similar past patterns

Suppose we have a time series {Xt}t≥0 and we wish to be able to forecastfuture values. We wish to identify an estimator of the next value of the timeseries, say Xt+1|t. One way of doing this is to search through the historyof the time series and find a time when the past k values of the time serieshave approximately occurred before.

For example, suppose we wish to forecast tomorrow’s maximum daily tem-perature at Hervey Bay and wish to use the past five days maximum temper-atures to make this forecast. The strategy is to search through the availablehistory of the maximum temperatures at Hervey Bay and find a time isthe past when five maximum temperatures have been very similar to themaximum temperatures over the last five days. Whatever the next day’smaximum temperature was in the past will be the prediction for tomorrow.

How do we determine which five past values are “like” the pattern we arecurrently observing?

Call the current m values vector x. For any past series of m values, sayvector y, one measure of the distance between these two vectors is definedby

d(x,y) =

√√√√ m∑k=1

(xk − yk)2 .

where xk is the kth element of the vector x. (This is the most common wayto define distance between vectors.)

© USQ, February 21, 2007

Page 216: Study Book

212 Module 9. Other Models

The choice of m needs to be made carefully after consideration of the timeseries in question. The idea here is simply this: we are trying to find all thetimes in the past when things were similar to now.

9.10 Singular spectrum analysis

Singular spectrum analysis is a method which attempts to identify naturallyoccurring patterns in a time series. The time series formed by keeping themost important patterns, and removing the others (which are regarded asnoise) potentially leaves a series which represents the underlying dynamicsand is also easier to forecast.

© USQ, February 21, 2007

Page 217: Study Book

Strand II

MultivariateStatistics

213

Page 218: Study Book

214

© USQ, February 21, 2007

Page 219: Study Book

Module 10Introduction

Module contents10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 21610.2 Multivariate data . . . . . . . . . . . . . . . . . . . . . . 21610.3 Preview of methods . . . . . . . . . . . . . . . . . . . . . 21710.4 Review of mathematical concepts . . . . . . . . . . . . . 21710.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21710.6 Displaying multivariate data . . . . . . . . . . . . . . . . 21710.7 Some hypothesis tests . . . . . . . . . . . . . . . . . . . . 22110.8 Further comments . . . . . . . . . . . . . . . . . . . . . . 22110.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10.9.1 Answers to selected Exercises . . . . . . . . . . . . . . . 223

Module objectives

Upon completion of this module students should be able to:

� recognize multivariate data;

� give some examples of multivariate data;

� list some type of multivariate statistical methods;

� appropriately display multivariate data.

215

Page 220: Study Book

216 Module 10. Introduction

10.1 Introduction

In this Module, some basic multivariate statistical techniques are introduced.The emphasis is on the application rather than the details and the theory;there is insufficient time to delve too far into the theory.

This Module is based on the textbook Multivariate Statistical Methods byBryan F. J. Manly. This book includes numerous examples using real datasets, although most examples do not have a climatological flavour. Someexamples with such a flavour are given in these notes.

There are numerous books available about multivariate statistics, and manyare available from the USQ library. You may find other books useful to referto during your study of this multivariate anlaysis component of this course.

As a general comment, you will be expected to read the textbook to un-derstand this Module. The Study Book will supplement these notes wherenecessary, provide extra example, and make notes about using the r softwarefor performing the analyses.

10.2 Multivariate data

Activity 10.A: Read Manly, section 1.1.

Multivariate analysis is popular in many area of science, engineering andbusiness; the examples give some flavour of typical problems. Climatology isfilled with examples of multivariate data. There are numerous climatologicalvariables measured on a routine basis which can collectively be consideredmultivariate data.

One of the most common sources of multivariate data are the Sea Sur-face Temperatures (SST). SSTs are measurements of the temperature of theoceans, measured at locations all around the world.

In addition, multivariate data can be created from any univariate series sinceclimatological variable are often time-dependent. The original data, with sayn observations, can be designated as X1. The series can then be shifted backt time steps to create a new variable X2. Both variables can be adjusted tohave a length of n− t, when the variables could now be identified as X ′

1 andX ′

2. The two variables (X ′1, X

′2) can be considered multivariate data.

© USQ, February 21, 2007

Page 221: Study Book

10.3. Preview of methods 217

10.3 Preview of methods

Activity 10.B: Read Manly, section 1.2.

This section introduces some different types of multivariate methods. Notall the methods will be discussed in this course, but it is useful to know thetypes of methods available.

10.4 Review of mathematical concepts

Activity 10.C: Briefly read Manly, Chapter 2.

This Chapter contains material that should be revision for the most part.You may find it useful to refer back to Chapter 2 throughout this course. Payparticular attention to sections 2.5 to 2.7 as many multivariate techniquesuse these concepts.

10.5 Software

The software package r will be used for this Part, as with the time seriescomponent. See Sect. 1.5.1 for more details. Most statistical programs willhave multivariate analysis capabilities.

For this part of the course, the r multivariate analysis library is needed; thisshould be part of the package that you install by default. To enable thispackage to be available to r, type library(mva) at the r prompt when ris started. For an idea of what functions are available in this library, typelibrary(help=mva) at the r prompt.

10.6 Displaying multivariate data

With multivariate data, any plots will be of a multi-dimensional nature, andwill therefore be difficult to display on a two-dimensional page. Plottingdata is, of course, always useful for understanding the data and detectingpossible problems in the data (outliers, errors, missing values, and so on).Some creative solutions have been developed for plotting multivariate data.

© USQ, February 21, 2007

Page 222: Study Book

218 Module 10. Introduction

Activity 10.D: Read Manly, Chapter 3. We will not discussAndrew’s method.

Many of the plots discussed are available in the package S-Plus, a commercialpackage not unlike r. In the free software, r, however, some of these plotsare not available (in particular, Chernoff faces). The general consensus isthat it would be a lot of work for a graphic that isn’t that useful. Oneparticular problem with Chernoff faces is that the faces (and interpretations)can change dramatically depending on what variables are allocated to whichdimensions of the face.

However, the star plot is available using the function stars. The “Drafts-man’s display” is available just by plotting a multivariate dataset; see thefollowing example.

The profile plots are useful, but only when there are not too many variablesor too many groups, otherwise the plots become too cluttered to be of anyuse.

Example 10.1: Hand et al. [19, dataset 26] gives a number of measure-ments air pollution from 41 cities in the USA. The data consists ofseven variables (plus the names of the cities), generally means from1969 to 1971

� SO2: The SO2 content in micrograms per cubic metre;

� temp: The average annual temperature in degrees F;

� manufac: The number of manufacturing enterprizes employing20 or more workers;

� population: The population in thousands, in the 1970 census;

� wind.speed: The average annual wind speed in miles per hours;

� annual.precip: The average annual precipitation in inches;

� days.precip: The average number of days with precipitationeach year.

The following code shows how to plot this multi-dimensional data inr. First, a Draftsman’s display (this is an unusual term; it is oftencalled a pairwise scatterplot):

> library(mva)

> us <- read.table("usair.dat", header = TRUE)

> plot(us[, 1:4])

> pairs(us[, 1:4])

© USQ, February 21, 2007

Page 223: Study Book

10.6. Displaying multivariate data 219

SO2

45 55 65 75

●●●●

●●

●●

●●

● ●

●●

●●

● ●

● ●●● ●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●●●●

0 1000 2500

2060

100

●● ●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●●

4555

6575

●●

● ●

●●

●●

●●

● ●

●●

●●

temp

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●●

●●●

●●

●● ●

● ●

● ●● ●

●●

●●●●

●●

●●●

● ●●

●● ●

● ●

●●●

● ●

●● ●●

● ●

● ●●●

manufac

010

0025

00

●●

●●●

●●●

●●●

●●

●●●

●●

●● ●●

● ●

●●●●

20 60 100

010

0025

00

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●●

0 1000 2500

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

population

Figure 10.1: A multivariate plot of the US pollution dataset.

The plot is shown in Fig. 10.1. Star plots can also be produced:

> stars(us[1:11, ], main = "Pollution measures in 41 US cities",

+ flip.labels = FALSE, key.loc = c(7.8,

+ 2))

> stars(us[1:11, ], main = "Pollution measures in 41 US cities",

+ flip.labels = FALSE, draw.segments = TRUE,

+ key.loc = c(7.8, 2))

The input key.loc changes the location of the ‘key’ that shows whichvariable is displayed where on the star; it was determined through trialand error. Alter the value of the input flip.labels to true (that is,set flip.labels=TRUE) to see what affect this has.

The star plot discussed in the text is in Fig. 10.2. Only the stars forthe first eleven cities are shown so that the detail can be seen here.A variation of the star plot is given in Fig. 10.3, and is particularlyinstructive when seen in colour.

From the star plots, can you find any cities that look very similar?That look very different?

© USQ, February 21, 2007

Page 224: Study Book

220 Module 10. Introduction

Pollution measures in 41 US cities

Phoenix Little.Rock San.Francicso

Denver Hartford Wilmington

Washington Jacksonville Miami

Atlanta Chicago

SO2

tempmanufac

population

wind.speed

annual.precipdays.precip

Figure 10.2: A star plot of the US pollution dataset.

Pollution measures in 41 US cities

Phoenix Little.Rock San.Francicso

Denver Hartford Wilmington

Washington Jacksonville Miami

Atlanta Chicago

SO2

tempmanufac

population

wind.speedannual.precip

days.precip

Figure 10.3: A variation of the star plot of the US pollution dataset.

© USQ, February 21, 2007

Page 225: Study Book

10.7. Some hypothesis tests 221

10.7 Some hypothesis tests

Activity 10.E: Read Manly, Chapter 4. We will not dwellon the details, but it is important that you understand theissues involved (especially section 4.4).

Currently, Hotelling’s T 2 test is not implemented in r.

10.8 Further comments

One difficulty with multivariate data has already been discussed: it may behard to display the data in a useful way. Because of this, it is often difficultto find any outliers in multivariate data. Note that an observation may notappear as an outlier with regard to any particular variables, but it may havea strange combination of variables.

Multivariate data also can present computational difficulties. The math-ematics involved in using multivariate techniques is usually matrix based,and so often very large matrices will be in use. This can create memoryproblems, particularly when matrix inversion is necessary. Many computa-tional tricks and advanced methods are employed in standard software forperforming the computations. Techniques such as singular value decompo-sition (SVD) are common. Indeed, different answers are often obtained indifferent software packages because different algorithms are used.

The main multivariate techniques can be broadly divided into the followingcategories:

� Data reduction techniques. These techniques reduce the dimension ofthe data at the expense of losing a small amount of information. Abalance is made between reducing the dimension of the data and re-taining as much information as possible. Techniques such as principalcomponents analysis (PCA; see Module 11) and factor analysis (FA;see Module 12) are in this category.

� Classification techniques. These techniques attempt to classify datainto a number of groups. Techniques such as cluster analysis (seeModule 13) and discriminant analysis fall into this category.

Consider the data in Example 10.1. We may wish to reduce the numberof variables from eight to two or three. If we could reduce the number of

© USQ, February 21, 2007

Page 226: Study Book

222 Module 10. Introduction

Toowoomba weather by Decade

1890 1900 1910

1920 1930 1940

1950 1960 1970

1980 1990

maxt

mint

radn

Figure 10.4: A star plot of the Toowoomba weather data.

variables to just one, this might be called a ‘pollution index’. This would bean example of data reduction. Data reduction works with the variables.

However, we may wish to classify the 41 cities into a number of groupsdepending on their characteristics. We may be able to identify three group:high pollution, moderate pollution and low pollution categories. This is aclassification problem. Classification works with the individuals.

10.9 Exercises

Ex. 10.2: The data set twdecade.dat contains (among other things) theaverage rainfall, maximum temperature and minimum temperature atToowoomba for the decades 1890s to the 1990s.

Produce a multivariate plot of the three variables by decade. Whichdecades appear similar?

Ex. 10.3: The data set twdecade.dat contains the average rainfall, max-imum temperature and minimum temperature at Toowoomba for theeach month. It should be possible to see the seasonal pattern in tem-peratures and rainfall. Produce a multivariate plot that shows thefeatures by month.

© USQ, February 21, 2007

Page 227: Study Book

10.9. Exercises 223

Ex. 10.4: The data set emdecade.dat contains the average rainfall, max-imum temperature and minimum temperature at Emerald for thedecades 1890s to the 1990s.

Produce a multivariate plot of the three variables by decade. Whichdecades appear similar? How similar are the patterns to those observedfor Toowoomba?

Ex. 10.5: The data set emdecade.dat contains the average rainfall, maxi-mum temperature and minimum temperature at Emerald for the eachmonth. It should be possible to see the seasonal pattern in temper-atures and rainfall. Produce a multivariate plot that shows the fea-tures by month. How similar are the patterns to those observed forToowoomba?

Ex. 10.6: The data in the file countries.dat contains numerous variablesfrom a number of countries, and the countries have been classified byregion. Create a plot to see which countries appear similar.

Ex. 10.7: This question concerns a data set that is not climatological, butyou may find interesting. The data file chocolates.dat, availablefrom http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/popular/chocolates.html, contains measurements of the price, weightand nutritional information for 17 chocolates commonly available inQueensland stores. The data was gathered in April 2002 in Brisbane.Create a plot to see which chocolates appear similar. Are there aresurprises?

Ex. 10.8: The data file soitni.txt contains the SOI and TNI from 1958 to1999. The TNI is related to sea surface temperatures (SSTs), and SOIis also known to be related to SSTs. It may be expected, therefore,that there may be a relationship between the two indices. Create aplot to examine if such a relationship exists.

10.9.1 Answers to selected Exercises

10.2 A star plot can be found as follows:

> td <- read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/climatology/twdecade.dat",

+ header = TRUE)

> head(td)

rain maxt mint radn pan vpd1890 1087.22 22.426 11.430 17.798 4.521 14.5261900 850.78 22.426 11.430 17.798 4.521 14.526

© USQ, February 21, 2007

Page 228: Study Book

224 Module 10. Introduction

1910 856.65 22.426 11.430 17.798 4.521 14.5261920 921.28 22.427 11.431 17.798 4.521 14.5271930 931.85 22.426 11.430 17.798 4.521 14.5261940 969.08 22.427 11.431 17.798 4.521 14.527

> stars(td[, 1:3], draw.segments = TRUE,

+ key.loc = c(7, 2), main = "Toowoomba weather by Decade")

The plot (Fig. 10.4) shows a trend of increasing rainfall from the 1900sto the 1950s, a big drop in the 1960s, then a big jump in the 1970s. The1990s were very dry again. The 1990s were also a very warm decade(relatively speaking), and the 1960s very cold (relatively speaking).

© USQ, February 21, 2007

Page 229: Study Book

Module 11Principal ComponentsAnalysis

Module contents11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 22611.2 The procedure . . . . . . . . . . . . . . . . . . . . . . . . 228

11.2.1 When should the correlation matrix be used? . . . . . . 23211.2.2 Selecting the number of pcs . . . . . . . . . . . . . . . . 23311.2.3 Interpretation of pcs . . . . . . . . . . . . . . . . . . . . 234

11.3 pca and other statistical techniques . . . . . . . . . . . 23511.4 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23611.5 Spatial pca . . . . . . . . . . . . . . . . . . . . . . . . . . 242

11.5.1 A small example . . . . . . . . . . . . . . . . . . . . . . 24211.5.2 A larger example . . . . . . . . . . . . . . . . . . . . . . 245

11.6 Rotation of pcs . . . . . . . . . . . . . . . . . . . . . . . . 24711.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

11.7.1 Answers to selected Exercises . . . . . . . . . . . . . . . 252

Module objectives

Upon completion of this module students should be able to:

225

Page 230: Study Book

226 Module 11. Principal Components Analysis

� understand the principles underlying principal components analysis;

� give a geometric interpretation of the principal components method;

� compute principal components from given data using r;

� select an appropriate number of principal components using suitabletechniques;

� make sensible interpretations of the principal components where pos-sible;

� compute the principal components scores for each subject;

� conduct a spatial pca;

� understand that rotation of principal components is a contentious is-sue.

11.1 Introduction

Principal components analysis (pca) is one of the basic multivariate tech-niques, and is also one of the simplest. Wilks [49, p 373] says of pca thatit is “possibly the most widely used multivariate statistical technique in theatmospheric sciences. . . ” (to which statistical climatology belongs). pca isan example of a data reduction technique, one that reduces the dimensionof the data. This is possible if the variables are correlated. pca attempts tofind a new coordinate system for the data.

In climatology and related sciences, numerous variables are correlated, sopca is a commonly used technique. pca is also called empirical orthogonalfunctions (EOFs) or sometimes empirical eigenvector analysis (EEA).

Activity 11.A: Read Manly, Section 6.1. Read Wilks, theintroduction to Section 9.3.

For a geometric interpretation of principal components in the two-dimensionalcase, see Fig. 11.1. Fig. 11.1 (top left panel) shows the original data. Thedata have a strong trend in the SW–NE direction. In Fig. 11.1 (top rightpanel), the two principal components are shown. The first principal compo-nent is in the SW–NE direction as expected. Fig. 11.1 (centre left) shows oneparticular point being mapped to the new coordinates. In Fig. 11.1 (bottomright panel), a screeplot (see the next section) shows that most (almost 96%)of the original variation in the data can be explained by the first principal

© USQ, February 21, 2007

Page 231: Study Book

11.1. Introduction 227

●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

−2 −1 0 1 2

−2

−1

01

2

x1

x 2

●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

−2 −1 0 1 2

−2

−1

01

2

x1

x 2

●●

●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

−2 −1 0 1 2

−2

−1

01

2

x1

x 2

●●

Histogram of PCA 1

predict(pca)[, 1]

Fre

quen

cy

−3 −2 −1 0 1 2 3 4

05

1015

2025

3035

Histogram of PCA 2

predict(pca)[, 2]

Fre

quen

cy

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

05

1015

2025

30

Scree plot of principal components

Var

ianc

es

0.0

0.5

1.0

1.5

Figure 11.1: A geometric interpretation for principal components in thetwo-dimensional case. Top left: some points are shown. They tend to bestrongly oriented in one direction. Top right: the corresponding principalcomponents are shown as bold lines. The main principal component is inthe SW–NE direction. Bottom left: a particular point is mapped to the newcoordinates. Bottom right: the scree plot shows that the first pc accountsfor most of the variation in the data.

© USQ, February 21, 2007

Page 232: Study Book

228 Module 11. Principal Components Analysis

component only. That is, using just the first principal components reducesthe dimension of the problem from two to one, with only a small loss ofinformation.

Note the pcs are simply linear combinations of the variables, and that theyare orthogonal. Also note that my computer struggles to perform the re-quired computations on my machine (it seems to manage despite complain-ing).

11.2 The procedure

Activity 11.B: Read Manly, Sections 6.2 and 6.3. ReadWilks, Section 9.3.1.

A pca is conducted on a set of n observations of p (probably correlated)variables.

It is important to realize that pca—and most other multivariate methodsalso—is based on finding the eigenvalues and eigenvectors. Also note thatthe eigenvalues and eigenvectors are found from either the correlation matrixor the covariance matrix. The next section discusses which should be used.

The four steps outlined at the bottom of p 80 of Manly show the generalprocedure. Software is used to do the computations.

Example 11.1: Consider the following data matrix X with two variablesX1 and X2, with three observations (so n = 3) for each variable:

X =

1 01 14 2

.

The data are plotted in Fig. 11.2 (a). The mean for each variable isX1 = 2 and X2 = 1, so the centred matrix is

Xc =

−1 −1−1 02 1

.

The centred data are plotted in Fig. 11.2 (b). It is usual to find the pcsfrom the correlation matrix. First, find the covariance matrix, found

© USQ, February 21, 2007

Page 233: Study Book

11.2. The procedure 229

by computing (X −X)T (X −X)/n as follows1:

P = (X −X)T (X −X)/n =13

[6 33 2

]=

[2 11 2/3

].

This matrix is always symmetric. From the diagonals of this matrix,var[X1] = 2 and var[X2] = 2/3. Using these two numbers, the diagonalmatrix D can be formed:

D =[

2 00 2/3

],

when, by convention, D−1/2 refers to the matrix with the diagonalsraised to the power −1/2:

D−1/2 =[

1/√

2 00

√3/√

2

].

The correlation matrix, say R, can then be found as follows:

R = D−1/2PD−1/2 =[

1√

3/2√3/2 1

].

This matrix will always have ones on the diagonals.

The data can be scaled after being centred by dividing by the standarddeviation (obtained from matrix D−1/2); in this case, the centred andscaled data are

Xcs =

−1/√

2 −√

3/√

2−1/

√2 0√

2√

3/√

2

.

The centred and scaled data are plotted in Fig. 11.2 (c). In effect,it is this data for which principal components are sought (since R =XT

csXcs/n). Now,

R =13

[3 3

√3/2

3√

3/2 3

]=

[1

√3/2√

3/2 1

],

the correlation matrix.

The eigenvectors e and eigenvalues λ of matrix R are now required2,which are the solutions of

(R− Iλ)e = 0. (11.1)1Notice that we have divided by n rather than n − 1. This is simply to follow what r

does; more commonly, the divisor is n − 1 when sample variances (and covariances) arecomputed. I do not know why r divides by n instead of n − 1.

2This is a quick review of work already studied in MAT2100. Eigenvalues and eigen-vectors are covered in most introductory algebra texts.

© USQ, February 21, 2007

Page 234: Study Book

230 Module 11. Principal Components Analysis

This system of equation is only consistent if

|R− Iλ| = 0

(where |W | means the determinant of matrix W ). This becomes∣∣∣∣ 1− λ√

3/2√3/2 1− λ

∣∣∣∣ = 0,

or(1− λ)2 − 3/4 = 0,

with solutions λ1 = 1+√

3/2 ≈ 1.866 and λ2 = 1−√

3/2 ≈ 0.134. Sub-situting these eigenvalues into Equation (11.1) to find the eigenvectorsgives

e1 =[

1/√

21/√

2

]; e2 =

[1/√

2−1/

√2

].

These eigenvectors become the principal components, or pcs. Thereare two pcs as there were originally two variables. Note that the twoeigenvectors (or the two pcs) are orthogonal: e1.e2 = 0.

Generally, a matrix of eigenvectors is defined:

C =[

1/√

2 1/√

21/√

2 −1/√

2

].

(Note that these vectors are only defined up to a constant. Thesevectors have been defined to have a length of one, and the signs de-termined to be equivalent to those given in the current version of r Ihave3.) Note that the two eigenvectors are orthogonal: e1.e2 = 0.

The (directions of the) eigenvectors are shown plotted with the centredand scaled data in Fig. 11.2 (d).

There were originally two variables; there will be two pcs. The pcsare defined in the direction of the two eigenvectors. The proportionsof the variance explained by each is found from the eigenvalues, andcan be reported in a table like that shown below.

pc e’value % variance cumulative %

pc 1 1.866 93.3% 93.3%pc 2 0.134 6.7% 100%

2 100%

3The signs may change from one version of r to another, or even differ between copiesof r on different operating systems. This is true for almost any computer package gener-ating eigenvectors. A change in sign simply means the eigenvectors point in the oppositedirection and makes no effective difference to the analysis

© USQ, February 21, 2007

Page 235: Study Book

11.2. The procedure 231

−1 0 1 2 3 4

−1

01

23

4

Original data

X1

X2

−1 0 1 2 3 4

−1

01

23

4

Centred data

Centred X1

Cen

tred

X2

−1 0 1 2 3 4

−1

01

23

4

Centred and scaled data

Centred & scaled X1

Cen

tred

& s

cale

d X

2

−1 0 1 2 3 4

−1

01

23

4

With the PCs shown

X1

X2

e1

e2

Figure 11.2: The data from Example 11.1. Top left: the original data; Topright: the data have been centred; Bottom left: the data have been centredand then scaled; Bottom right: the directions of the principcal componentshave been added.

© USQ, February 21, 2007

Page 236: Study Book

232 Module 11. Principal Components Analysis

A scree plot can be drawn from this if you wish. In any case, one pcwould be taken (otherwise, no simplication has been made for all thiswork!).

It is possible to then determine what ‘score’ each of the original pointsnow have on the new variables (or principal components). These newscores, say Y , can be found from the ‘original’ variables, X, using

Y = XC.

In our case in this example, the matrix X will refer to the centred,scaled variables since the pcs were computed using these. Hence,

Y =

−1/√

2 −√

3/√

2−1/

√2 0√

2√

3/√

2

[1/√

2 1/√

21/√

2 −1/√

2

]

=

(−1−√

3)/2 (−1 +√

3)/2−1/2 −1/2

1 +√

3/2 (1−√

3)/2

.

Thus, the point (1, 0) is now mapped to ((−1−√

3)/2, (−1 +√

3)/2),the point (1, 1) is now mapped to (−1/2,−1/2), and the point (4, 2)is now mapped to (1 +

√3/2, (1 −

√3)/2) in the new system. In

Fig. 11.2 (d), the point (1, 1) can be seen to be mapped to a negativevalue for the first pc, and the same (possibly negative) value for thesecond pc4. Thus, (−1/2,−1/2) seems a sensible value to which thesecond point could be mapped.

Since we only take one pc, the new variable takes the values[(−1−

√3)/2, −1/2, 1 +

√3/2

]which accounts for about 93% of the variation in the original data.

11.2.1 When should the correlation matrix be used?

Activity 11.C: Read Wilks, Section 9.3.4.

When the variables measure similar information, or have similar units ofmeasurement, the covariance matrix is generally used. If the variables areon very different scales, the correlation matrix is usually the basis for pca.

4We say ‘possibly’ since it depends on which direction the eigenvectors are pointing.

© USQ, February 21, 2007

Page 237: Study Book

11.2. The procedure 233

For example, Example 10.1 involves variables that are measured on differentscales: SO2 was measured in micrograms per cubic metre, whereas man-ufac is simply the number of manufacturing enterprises with more than 20employees. These are very different and measured in different units of mea-surement. For this reason, the pca should be based on the correlationmatrix.

In effect, the correlation matrix transforms all of the variables to a similarscale so that the actual units of measurement are not important. Commonly,but not always, the correlation matrix is used. It is important to realize that,in general, different results are obtained using the correlation and covariancematrices.

11.2.2 Selecting the number of PCs

Activity 11.D: Read Wilks, Sections 9.3.2 and 9.3.3.

One of the difficult decisions to make in pca is how many principal com-ponents (pcs) are necessary to keep. The analysis will always produce asmany pcs as there are variables, so keeping all the pcs means that no infor-mation is lost, but it also completely reproduces the data. This defeats thepurpose of performing a data reduction technique such as pca—it simplycomplicates matters!

There are many criteria for making this decision, but no formal procedure(involving tests, etc.). There are only guidelines; some are given below.Using any of the methods without thought is dangerous and prone to error.Always examine the information and make a sensible decision that you canjustify. Sometimes, there is not one clear decision. Remember the purposeof pca is to reduce the dimension of the data, so a small number of pcs ispreferred.

Scree plots

One way to help make the decision is to use a scree plot. The scree plot isused to help decide between the important pcs (with large eigenvalues) andthe less important pcs (with small eigenvalues). Some authors claim thismethod generally includes too many pcs. When using a screeplot, some pcsshould be clearly more important than others. (This is not always the case,however.)

© USQ, February 21, 2007

Page 238: Study Book

234 Module 11. Principal Components Analysis

Total variance rule

Another proposed method is to take as many pcs as necessary until a certainpercentage (often 90%) of the variance has been explained.

Use above average PCs

This method recommends only keeping those pcs whose eigenvalues aregreater than the average. (Note that if the correlation matrix has been usedto compute the pcs, this means that pcs are retained if their eigenvalues aregreater than one.) For a small number of variables (say 20), this method isreported to include too few techniques.

Example 11.2: Kidson [29] analysed monthly means of surface pressures,temperature and rainfall using principal components analysis. In eachcase considered, 10 out of a possible 120 components accounted formore than 80% of the observed variance.

Example 11.3: Katz & Glantz [27] use a principal components analysison rainfall data to show that no single rainfall index (or principalcomponent) can adequately explain rainfall variation.

11.2.3 Interpretation of PCs

It is often useful to find an interpretation for the pcs, recalling that the pcsare simply linear combinations of the variables. It is not uncommon for thefirst pc to be a measure of ‘size’. Finding interpretations is often quite anart, and sometimes any interpretation is difficult to find.

Example 11.4: Mantua et al. [33] define the the Pacific Decadal Oscilla-tion (PDO) as the leading pc of monthly SST anomalies in the NorthPacific Ocean.

© USQ, February 21, 2007

Page 239: Study Book

11.3. pca and other statistical techniques 235

11.3 PCA and other statistical techniques

pca is often used as a data reduction technique, as has been described inthese notes. But there are other uses as well. For example, pca can be usedon various type of data often as a preliminary step before further analysis.

pca is sometimes used as a preliminary step before a regression analysis. Inparticular, if there are a large number of covariates, or there are a numberof large correlations between covariates, a pca is often performed, a numberof pcs selected, and these pcs used as covariates in a regression analysis.

Example 11.5: Wolff, Morrisey & Kelly [50] use principal componentsanalysis followed by a regression to identify source areas of the fineparticles and sulphates which are the primary components of summerhaze in the Blue Ridge Mountains of Virginia, USA.

Example 11.6: Fritts [17] describes two techniques for examining the rela-tionship between ring-width of conifers in western North America andclimatic variables. The first technique is a multiple regression on theprincipal components of climate.

pca is sometimes used with cluster analysis (see Module 13) to classifyclimatological variables.

Example 11.7: Stone & Auliciems [42] use a combination of cluster analy-sis and pca to define phases of the Southern Oscillation Index (SOI).

Example 11.8: One use of principal components analysis is to extract prin-cipal components from a multivariate time series. Michaelsen [35] usedthis method (which he called frequency domain principal componentsanalysis) on the movement of sea surface temperatures (SST) anoma-lies in the North Pacific, and found a low frequency SST field.

© USQ, February 21, 2007

Page 240: Study Book

236 Module 11. Principal Components Analysis

r Computation Matrixfunction method used

princomp Eigen-analysis correlation or covarianceprcomp SVD* centre and/or scale

Table 11.1: Two methods for computing principal components in r. Thestars indicate the preferred option. SVD stands for ‘singular-value decom-position’. princomp uses the less-preferred eigenvalue-based analysis (forcompatibility with programs such as S-Plus). The functions use differentmethods of specifying the matrix on which to base the computations: usingcenter=TRUE and scale=TRUE in prcomp is equivalent to using cor=TRUEin princomp. (The default for prcomp is center=TRUE, scale=FALSE; thedefault for princomp is cor=FALSE (that is, use the covariance matrix)).

11.4 Using R

r can be used to find principal components; confusingly, two different meth-ods exist; Table 11.1 compares the methods. In general, the function prcompwill be used here.

The next example continues on from Example 11.9 and uses a very smalldata matrix to show how the calculations done by hand can be compared tothose performed in r.

Example 11.9:

Refer to Example 11.1. and the data are plotted in Fig. 11.2 (a). Howcan this analysis be done in r?

Of course, tasks such as multiplying matrices and computing the eigen-values can be done in r (using the commands %*% and eigen respec-tively). First, define the data matrix (and then centre it also):

> testd <- matrix(byrow = TRUE, nrow = 3,

+ data = c(1, 0, 1, 1, 4, 2))

> means <- colMeans(testd)

> means <- c(1, 1, 1) %o% means

> ctestd <- testd - means

Some of the matrics we used can be defined also:

> XtX <- t(ctestd) %*% ctestd

> P <- XtX/length(testd[, 1])

© USQ, February 21, 2007

Page 241: Study Book

11.4. Using r 237

> st.devs <- sqrt(diag(P))

> cstestd <- testd

> cstestd[, 1] <- ctestd[, 1]/st.devs[1]

> cstestd[, 2] <- ctestd[, 2]/st.devs[2]

> cormat <- cor(ctestd)

> D.power <- diag(st.devs)

> cormat2 <- D.power^T %*% P %*% D.power

> es <- eigen(cormat)

> es

$values[1] 1.8660254 0.1339746

$vectors[,1] [,2]

[1,] 0.7071068 0.7071068[2,] 0.7071068 -0.7071068

These results agree with those in Example 11.1. But of course, r cancompute principal components without us having to resort to matrixmultiplication and finding eigenvalues.

> p <- prcomp(testd, center = TRUE, scale = TRUE)

> names(p)

[1] "sdev" "rotation" "center" "scale"[5] "x"

Specifying center=TRUE and scale=TRUE instructs r to use the corre-lation matrix to find the pcs. The standard deviations used by r toscale the data is

> p$sdev

[1] 1.3660254 0.3660254

> p$sdev^2

[1] 1.8660254 0.1339746

Likewise, the centres (means) of each variable is found using p$center(but aren’t shown here). The eigenvectors are in the columns of:

> p$rotation

© USQ, February 21, 2007

Page 242: Study Book

238 Module 11. Principal Components Analysis

PC1 PC2[1,] 0.7071068 0.7071068[2,] 0.7071068 -0.7071068

A screeplot can be produced using

> screeplot(p)

or just

> plot(p)

but is not shown here. The proportion of the variance explained byeach pc is found using summary:

> summary(p)

Importance of components:PC1 PC2

Standard deviation 1.366 0.366Proportion of Variance 0.933 0.067Cumulative Proportion 0.933 1.000

The eigenvalues are given by

> p$sdev^2

[1] 1.8660254 0.1339746

The new scores, called the principal components or pcs (and called Yearlier), can be found using

> predict(p)

PC1 PC2[1,] -1.1153551 0.2988585[2,] -0.4082483 -0.4082483[3,] 1.5236034 0.1093898

This example was to show you how to perform a pca by hand, and how tofind those bits-and-pieces in the r output. Notice that once the correlationmatrix has been found, the analysis proceeds without knowledge of anythingelse. Hence, given only a correlation matrix, pca can be performed. (Note

© USQ, February 21, 2007

Page 243: Study Book

11.4. Using r 239

r requires a data matrix for use in prcomp; to use only a correlation matrix,you must use eigen and so on.)

Commonly, a small number of the pcs are chosen for further analysis; thesecan be extracted as follows (where the first two pcs here are extracted as anexample):

> p.pcs <- predict(p)

The next example is more practical.

Example 11.10: Consider the sparrow example used by Manly in Exam-ple 6.1. (While not a climatological, it will demonstrate how to doequivalent analyses in r.) We use the correlation matrix since thevariables are dissimilar.

First, load the data

> sp <- read.table("sparrows.txt", header = TRUE)

It is then interesting to examine the correlations between the variables:

> cor(sp)

Length Extent Head HumerusLength 1.0000000 0.7349642 0.6618119 0.6269482Extent 0.7349642 1.0000000 0.6737411 0.7621451Head 0.6618119 0.6737411 1.0000000 0.7184943Humerus 0.6269482 0.7621451 0.7184943 1.0000000Sternum 0.6051247 0.5290138 0.5262701 0.5787743

SternumLength 0.6051247Extent 0.5290138Head 0.5262701Humerus 0.5787743Sternum 1.0000000

There are many high correlations, so it may be possible to reduce thenumber of variables are retain most of the information. That is, a pcamay be useful. The following code analyses the data:

> sp <- read.table("sparrows.txt", header = TRUE)

> sp.prcomp <- prcomp(sp, center = TRUE,

+ scale = TRUE)

> names(sp.prcomp)

© USQ, February 21, 2007

Page 244: Study Book

240 Module 11. Principal Components Analysis

[1] "sdev" "rotation" "center" "scale"[5] "x"

The command prcomp returns numerous variables, as can be seen. Thetable at the bottom of Manly, p 81 is found as follows:

> sp.prcomp$rotation

PC1 PC2 PC3Length 0.4548793 -0.06760175 0.7340681Extent 0.4662631 0.30512343 0.2671031Head 0.4494628 0.29277283 -0.3470235Humerus 0.4635108 0.22746613 -0.4772988Sternum 0.3985280 -0.87457014 -0.2038638

PC4 PC5Length 0.23424318 0.4413490Extent -0.47737764 -0.6247119Head 0.73389847 -0.2307272Humerus -0.41989524 0.5738386Sternum -0.04818454 -0.1800565

Can these pcs be interpretted? The first pc is almost equally loadedfor each variable; it therefore measures the general size of the bird.The second pc is highly loaded with the sternum length, not veryloaded with length, and equally loaded for the rest. It is not easy tointerpret, but perhaps is a measure of sternum length. The third pchas a high loading for length; perhaps it is a length pc. The fourthis a measure of head size; the fifth the constrast between extent andhumerus (since these two variable are loaded with different signs). Ascan be seen, some creativity may be necessary to develop meaningfulinterpretations!

The table above is equivalent to Table 6.3 in Manly, but informationis transposed (try t(sp.prcomp$rotation)). The numbers are alsoslightly different, but certainly similar. The eigenvalues (variances ofthe pcs) in Manly’s Table 6.3 are found as follows:

> sp.prcomp$sdev^2

[1] 3.5762941 0.5355019 0.3788619 0.3273533[5] 0.1819888

A screeplot is produced using screeplot:

> screeplot(sp.prcomp)

> screeplot(sp.prcomp, type = "lines")

© USQ, February 21, 2007

Page 245: Study Book

11.4. Using r 241

sp.prcomp

Var

ianc

es

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

●●

sp.prcomp

Var

ianc

es

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 2 3 4 5

Figure 11.3: Two different ways of presenting the screeplot for the sparrowdata. In (a), the default screeplot; in (b), the more standard screeplotproduced with the option type="lines".

The final plot is shown in Fig. 11.3. The first pc obviously is muchlarger than the rest, and easily accounts for most of the variation inthe data.

If we use the screeplot, you may decide to keep only one pc. Usingthe total variance rule, you may decide that three or four pcs arenecessary:

> summary(sp.vars)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.03640 0.06547 0.07577 0.20000 0.10710 0.71530

Using the above average pc rule would select only one pc:

> mean(sp.vars)

[1] 0.2

> sp.vars > mean(sp.vars)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5TRUE FALSE FALSE FALSE FALSE

The values of the pcs for each bird is found using (for the first 10 birdsonly)

© USQ, February 21, 2007

Page 246: Study Book

242 Module 11. Principal Components Analysis

> predict(sp.prcomp)[1:10]

[1] 0.07836554 -2.16233078 -1.13609553[4] -2.29462019 -0.28519596 1.93013405[7] -1.03954232 0.44378025 2.70477182[10] 0.19259851

Note that the first bird has a score of 0.07837 on the first pc, whereasthe score is 0.064 in Manly. The scores on the second pc are verysimilar: 0.6166 (above) compared to 0.602 (Manly).

The first three pcs are extracted for further analysis using

> sp.pcs <- predict(sp.prcomp)[, 1:3]

11.5 Spatial PCA

One important use of pca in climatology is spatial pca, or field pca.

Activity 11.E: Read Wilks, Section 9.3.5.

As noted by Wilks, this is a very common use of pca. The idea is this: Data,such as rainfall, may be available for a large number of locations (usuallycalled ‘stations’), usually over a long time period. pca can be used to findpatterns over those locations.

11.5.1 A small example

Example 11.11: As a preliminary example, consider some rainfall datafrom selected rainfall stations in Australia, as shown in Table 11.2.Each column consist of 15 observations of the rainfall at each station.Thus, there are the equivalent of 10 variables with 15 repeated ob-servations each. A pca can be performed to reduce the informationcontained in 10 stations to a smaller number. Notice that each of the15 observations for each station constitute a time series.

> p <- prcomp(rain, cor = TRUE)

> plot(p, main = "Small rainfall example")

© USQ, February 21, 2007

Page 247: Study Book

11.5. Spatial pca 243

Table 11.2: Monthly rainfall figures for ten stations in Australia. There are15 observations for each station, given in order of time (the actual recordingmonths are unknown; the source did not state).

Station number

1 2 3 4 5 6 7 8 9 10

1 111.70 30.80 78.70 58.60 30.60 63.60 53.40 15.90 27.60 72.602 25.50 2.80 19.20 4.00 8.10 7.80 10.30 1.00 4.10 27.303 82.90 47.50 98.90 65.20 73.50 117.00 95.60 37.50 93.40 139.904 174.30 81.50 106.80 80.90 73.90 123.50 155.80 51.20 81.50 177.105 77.70 22.00 48.90 56.20 67.10 113.00 256.40 38.30 65.60 253.306 117.10 35.90 118.10 86.90 81.90 98.60 84.00 42.40 67.30 154.307 111.20 52.70 69.10 56.80 27.20 51.60 76.00 16.30 50.40 191.508 147.40 109.70 150.70 101.20 102.80 112.40 32.60 42.60 52.50 47.309 66.50 29.00 41.70 22.60 50.60 73.10 92.80 26.40 36.00 80.10

10 107.70 37.70 77.00 52.80 27.60 34.80 16.20 7.60 5.50 12.2011 26.70 6.10 16.20 11.90 14.20 34.80 32.60 18.00 28.70 118.3012 92.40 25.70 45.50 58.00 22.20 32.30 35.70 8.80 13.80 37.8013 157.00 63.00 79.20 70.10 45.70 66.80 76.00 14.40 16.30 71.5014 20.80 4.10 12.50 7.90 7.40 11.70 9.30 14.80 6.60 19.4015 137.20 38.10 82.40 59.70 27.60 58.00 45.30 5.00 34.30 108.40

The scree plot is shown in Fig. 11.4; it is not clear how many pcs shouldbe retained. We shall select three for the purpose of this example; threeis not unreasonable as they account for over 90% of the variation inthe data (see line 11 of the output).

There a few important points to note:

(a) In practice, there are often hundreds of stations with availabledata, and over a hundred years worth of rainfall data for moststations. This creates huge data files that, in practice, take largeamounts of computing power to analyse.

(b) If latitudes and longitudes of the stations are known, contourmaps can be drawn of the principal components over a map ofAustralia (see the next example).

(c) Each pc is a vector of length 15 and is also a time series. Thesecan be plotted as time series (see Fig. 11.5) and even analysedas a time series using the techniques previously studied. Thisanalysis can detect time trends in the pcs.

In this small example, the time trends of 15 stations have been reducedto time trends of three new variables that capture the important in-formation carried by all 15.

© USQ, February 21, 2007

Page 248: Study Book

244 Module 11. Principal Components Analysis

Small rainfall example

Var

ianc

es

020

0040

0060

0080

0010

000

Figure 11.4: The scree plot for the pca of the small rainfall example.

2 4 6 8 10 12 14

−200

−100

0

100

PCs

Tim

e

PCA 1 PCA 2PCA 3

Figure 11.5: The pcs plotted over time for the small rainfall examples.

© USQ, February 21, 2007

Page 249: Study Book

11.5. Spatial pca 245

Full rainfall exampleV

aria

nces

05

1015

Figure 11.6: The scree plot for the full rainfall example.

11.5.2 A larger example

Example 11.12: Using the larger data file from which the data in theprevious example came, a more thorough pca can be performed. Thisanalysis was over 1188 time points for 52 stations. The data matrix has1188×52 = 61 776 entries; this needs a lot of storage in the computer,and a lot of memory for performing operations such as matrix multi-plication and matrix inversions. The scree plot is shown in Fig. 11.6.Plotting the first pc over a map of Australia gives Fig. 11.7 (a). Thesecond pc has been plotted over a map of Australia Fig. 11.7 (b).

This time, the first three pcs account for about 57% of the total varia-tion. Notice that even with 52 stations, the contours are jagged; theycould, of course, be smoothed.

It requires special methods to handle data files of this size. The codeused to generate these picture is given below. Be aware that youprobably cannot run this code as it requires installing r libraries thatyou probably do not have by default (but can perhaps be installed;see Appendix A). The huge data files necessary are in a format callednetCDF, and a special library is required to read these files.

© USQ, February 21, 2007

Page 250: Study Book

246 Module 11. Principal Components Analysis

First PC

Longitude

Latit

ude

120 130 140 150

−45

−40

−35

−30

−25

−20

−15

−10

Second PC

Longitude

Latit

ude

120 130 140 150

−45

−40

−35

−30

−25

−20

−15

−10

Figure 11.7: The first two pcs plotted over a map of Australia.

© USQ, February 21, 2007

Page 251: Study Book

11.6. Rotation of pcs 247

> library(oz)

> library(ncdf)

> set.datadir()

> d <- open.ncdf("./pca/oz-rain.nc")

> rawrain <- get.var.ncdf(d, "RAIN")

> missing <- attr(rawrain, "missing_value")

> rawrain[rawrain == missing] <- NA

> set.docdir()

> longs <- get.var.ncdf(d, "LONGITUDE79_90")

> nx <- length(longs)

> lats <- get.var.ncdf(d, "LATITUDE19_33")

> ny <- length(lats)

> times <- get.var.ncdf(d, "TIME")

> ntime <- length(times)

> rain <- matrix(0, ntime, nx * ny)

> for (ix in (1:nx)) {

+ for (iy in (1:ny)) {

+ idx <- (iy - 1) * nx + ix

+ t <- rawrain[ix, iy, 1:ntime]

+ if (length(na.omit(t)) == ntime) {

+ rain[, idx] <- t

+ }

+ }

+ }

> pc.rain <- rain[, colSums(rain) > 0]

> p1 <- prcomp(pc.rain, center = TRUE, scale = TRUE)

> plot(p1$rotation, type = "b", main = "Full rainfall example",

+ ylab = "Eigenvalues")

> par(mfrow = c(2, 1))

> oz(add = TRUE, lwd = 2)

> oz(add = TRUE, lwd = 2)

The gaps in the plots are because there is such little data in thoseremote parts of Australia, and rainfall is scare there anyway. Note thepcs are deduced from the correlations, so the contours are for smalland sometimes negative numbers, not rainfall amounts.

11.6 Rotation of PCs

One controverisal topic is the rotation of principal components, which webriefly discuss here.

© USQ, February 21, 2007

Page 252: Study Book

248 Module 11. Principal Components Analysis

One constraint on the pcs is they must be orthogonal, which some authorsargue limits how well they can be interpretted. If the physical interpretationof the pcs is more important than data reduction, some authors argue thatthe orthogonality constraint should be relaxed to allow better interpretation(see, for example, Richman [38]). This is called rotation of the pcs. Manymethods exist for rotation of the pcs.

However, there are many arguments against rotation of pcs (see, for ex-ample, Basilevsky [8]). Accordingly, r does not explicitly allow for pcs tobe rotated, but it can be accomplished using functions designed to be usedin factor analysis (where rotations are probably the norm rather than theexception). We will not discuss this topic any further, except to note twoissues:

1. Rotation is discussed further in Chapter 12 on factor analysis, whereit is more appropriate;

2. The purpose of rotation of the pcs appears to generally be to ‘cluster’the pcs together. This can be accomplished using a cluster analysis(see Chapter 13).

11.7 Exercises

Ex. 11.13: Consider the following data:

X =

3 33 41 31 6

.

(a) Perform a pca ‘by hand’ using the correlation matrix (followExample 11.1 or Example 11.9). (Don’t use prcomp or similarfunctions; you may use r to do the matrix multiplication and soon for you.)

(b) Perform a pca ‘by hand’, but using the covariance matrix.

(c) Compare and comment on the two strategies.

Ex. 11.14: Consider the following data:

X =

1 20 33 54 6

.

© USQ, February 21, 2007

Page 253: Study Book

11.7. Exercises 249

(a) Perform a pca ‘by hand’ using the correlation matrix (followExample 11.1 or Example 11.9). (Don’t use prcomp or similarfunctions; you may use r to do the matrix multiplication and soon for you.)

(b) Perform a pca ‘by hand’, but using the covariance matrix.

(c) Compare and comment on the two strategies.

Ex. 11.15: Consider the correlation matrix

R =[

1 0.60.6 1

].

Perform a pca using the correlation matrix. Define the new variables,and explain how many new pcs are necessary.

Ex. 11.16: Consider the correlation matrix

R =[

1 rr 1

].

(a) Perform a pca using the correlation matrix and show it alwaysproduces new axes at 45◦ to the original axes.

(b) Explain what happens in the pca for r = 0, r = 0.25, r = 0.5and r = 1.

Ex. 11.17: The data file toowoomba.dat contains (among other things) thedaily rainfall, maximum and minimum temperatures at Toowoombafrom 1 January 1889 to 21 July 2002 (a total of 41474 observationson three variables). Perform a pca. How many pcs are necessary tosummarize the data?

Ex. 11.18: Consider again the air quality data from 41 cities in the USA,as seen in Example 10.1. For each city, seven variables have been mea-sured (see p 218). The first is the concentration of SO2 in microgramper cubic metre; the other six are potential identifiers of pollutionproblems. The original source treats the concentration of SO2 as aresponse variable, and the other six as covariates.

(a) Examine the correlation matrix; what varaible are highly corre-lated?

(b) Produce a star plot of the data, and comment.

(c) Is it possible to reduce these six covariates to a smaller number,without losing much information? Use a pca to perform a datareduction.

© USQ, February 21, 2007

Page 254: Study Book

250 Module 11. Principal Components Analysis

(d) Should a correlation or covaraince matrix be used for the pca?Explain your answer.

(e) Examine the loadings; is there any sensible interpretation?

(f)

Ex. 11.19: Consider the example in 11.5.2. If you can load the appropri-ate libraries, try the same steps in that example but for the data inoz-slp.nc.

Ex. 11.20: The data file emerald.dat contains the daily rainfall, maximumand minimum temperatures, radiationp, an evaporation and maximumvapour pressure deficit (in hPa) at Emerald from 1 January 1889 to15 September 2002 (a total of 41530 observations on three variables).Perform a pca. How many pcs are necessary to summarize the data?

Ex. 11.21: The data file gatton.dat contains the daily rainfall, maximumand minimum temperatures, radiationp, an evaporation and maximumvapour pressure deficit (in hPa) at Gatton from 1 January 1889 to 15September 2002 (a total of 41530 observations on three variables).

(a) Perform a pca using the covariance matrix.

(b) Perform a pca using the correlation matrix. Compare to theprevious pca. Which would you choose: a pca based on thecovariance or the correlation matrix? Explain.

(c) How many pcs are necessary to summarize the data? Explain.

(d) If possible, interpret the pcs.

(e) Take the first pc; perform a quick time series analysis on this pc.(Don’t attempt necessarily to find an ‘optimal’ model; doing sowill be time consuming because oe the amount of data, and maybe difficult also. Just plot an ACF, PACF and suggest a modelbased on those.)

Ex. 11.22: The data file strainfall.dat contains the average month andannual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a pca using the monthly averages (and not the annualaverage) using the correlation matrix. How many pcs seems nec-essary?

(b) Perform a pca using the monthly averages (and not the annualaverage) using the covariance matrix. How many pcs seems nec-essary?

(c) Which pca would you prefer? Why?

(d) Select the first two pcs. Confirm that they are uncorrelated.

© USQ, February 21, 2007

Page 255: Study Book

11.7. Exercises 251

Ex. 11.23: The data file jondaryn.dat contains the daily rainfall, max-imum and minimum temperatures, radiationp, an evaporation andmaximum vapour pressure deficit (in hPa) at Jondaryn from 1 Jan-uary 1889 to 15 September 2002 (a total of 41474 observations on sixvariables). Perform a pca. How many pcs are necessary to summarizethe data?

Ex. 11.24: The data file wind_ca.dat contains numerous weather and windmeasurements from Canberra during 1989.

(a) Explain why it is best to use the correlation matrix for this data.

(b) Perform a pca using the correlation matrix.

(c) How many pcs are necessary to summarize the data? Explain.

(d) If possible, interpret the pcs.

(e) Perform a time series analysis on the first pc.

Ex. 11.25: The data file wind_wp.dat contains numerous weather and windmeasurements from Wilson’s Promontory, Victoria (the most southerlypoint of mainland Australia) during 1989.

(a) Explain why it is best to use the correlation matrix for this data.

(b) Perform a pca using the correlation matrix.

(c) How many pcs are necessary to summarize the data? Explain.

(d) If possible, interpret the pcs.

(e) Explain why a time series analysis on, say, the first pc cannot bedone here. (Hint: Read the help about the data.)

Ex. 11.26: The data file qldweather.dat contains six weather-related vari-ables for 20 Queensland cities.

(a) Perform a pca using the correlation matrix. How many pcs seemsnecessary?

(b) Perform a pca using the covariance matrix. How many pcs seemsnecessary?

(c) Which pca would you prefer? Why?

(d) Select the first three pcs. Confirm that they are uncorrelated.

Ex. 11.27: This question concerns a data set that is not climatological,but you may find interesting. The data file chocolates.dat, availablefrom http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/popular/chocolates.html, contains measurements of the price, weightand nutritional information for 17 chocolates commonly available inQueensland stores. The data was gathered in April 2002 in Brisbane.

© USQ, February 21, 2007

Page 256: Study Book

252 Module 11. Principal Components Analysis

(a) Would it be best to use the correlation or covariance matris forthe pca? Explain.

(b) Perform this pca using the nutritional information.

(c) How many pcs are useful?

(d) If possible, give an interpretation for the pcs.

11.7.1 Answers to selected Exercises

11.13 First, use the correlation matrix.

> testd <- matrix(byrow = TRUE, nrow = 4,

+ data = c(3, 3, 3, 4, 1, 3, 1, 6))

> means <- colMeans(testd)

> ctestd <- testd - means

> means <- colMeans(testd)

> means <- c(1, 1, 1) %o% means

> XtX <- t(ctestd) %*% ctestd

> P <- XtX/length(testd[, 1])

> st.devs <- sqrt(diag(P))

> cstestd <- testd

> cstestd[, 1] <- ctestd[, 1]/st.devs[1]

> cstestd[, 2] <- ctestd[, 2]/st.devs[2]

> cormat <- cor(ctestd)

> D.power <- diag(1/st.devs)

> cormat2 <- D.power^T %*% P %*% D.power

> es <- eigen(cormat)

> es

$values[1] 1.5 0.5

$vectors[,1] [,2]

[1,] 0.7071068 0.7071068[2,] -0.7071068 0.7071068

Using the covariance matrix:

> es <- eigen(cov(testd))

> es

$values[1] 2.4120227 0.9213107

© USQ, February 21, 2007

Page 257: Study Book

11.7. Exercises 253

$vectors[,1] [,2]

[1,] 0.5257311 0.8506508[2,] -0.8506508 0.5257311

As expected, the eigenvalues and pcs are different.

11.18 Here is some r code:

> us <- read.table("usair.dat", header = TRUE,

+ row.names = 1)

> us.pca <- prcomp(us[, 2:7], center = TRUE,

+ scale = TRUE)

> plot(us.pca, main = "Screeplot for US air data")

How many pcs should be selected? The screeplot is shown in Fig. 11.8,from which three or four might be selected. The variances of theeigenvectors are

> summary(us.pca)

Importance of components:PC1 PC2 PC3 PC4

Standard deviation 1.482 1.225 1.181 0.872Proportion of Variance 0.366 0.250 0.232 0.127Cumulative Proportion 0.366 0.616 0.848 0.975

PC5 PC6Standard deviation 0.3385 0.18560Proportion of Variance 0.0191 0.00574Cumulative Proportion 0.9943 1.00000

Perhaps three pcs are appropriate. The first three account for almost74% of the total variance. It would also be possible to choose four pcs,but with six original variables, this isn’t a large reduction.

PC1 PC2 PC3temp -0.32964613 0.1275974 -0.67168611manufac 0.61154243 0.1680577 -0.27288633population 0.57782195 0.2224533 -0.35037413wind.speed 0.35383877 -0.1307915 0.29725334annual.precip -0.04080701 -0.6228578 -0.50456294days.precip 0.23791593 -0.7077653 0.09308852

PC4 PC5 PC6temp -0.30645728 0.55805638 -0.13618780

© USQ, February 21, 2007

Page 258: Study Book

254 Module 11. Principal Components Analysis

Screeplot for US air data

Var

ianc

es

0.0

0.5

1.0

1.5

2.0

Figure 11.8: The scree plot for the US air data.

manufac 0.13684076 -0.10204211 -0.70297051population 0.07248126 0.07806551 0.69464131wind.speed -0.86942583 0.11326688 -0.02452501annual.precip -0.17114826 -0.56818342 0.06062222days.precip 0.31130693 0.58000387 -0.02196062

Is there a sensible interpretation for these pcs? The first pc has a highpositive loading for temperature, but a high negative loading for theother variables (apart from annual precipitation). This could be seenas the contrast between temperature and the other variables: the con-trast between temperature rises and other variables rising. It is hardto see any intelligent purpose in such a pc. Likewise, interpretationsfor the next two pcs are difficult to determine.

© USQ, February 21, 2007

Page 259: Study Book

Module 12Factor Analysis

Module contents12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 256

12.2 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . 257

12.2.1 Path model . . . . . . . . . . . . . . . . . . . . . . . . . 25812.2.2 Steps in a fa . . . . . . . . . . . . . . . . . . . . . . . . 260

12.3 Factor rotation . . . . . . . . . . . . . . . . . . . . . . . . 262

12.3.1 Methods of factor rotation . . . . . . . . . . . . . . . . . 26212.4 Interpretation of factors . . . . . . . . . . . . . . . . . . 263

12.5 The differences between pca and fa . . . . . . . . . . . 266

12.6 Principal components factor analysis . . . . . . . . . . . 267

12.7 How many factors to choose? . . . . . . . . . . . . . . . 268

12.8 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

12.9 Concluding comments . . . . . . . . . . . . . . . . . . . . 274

12.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

12.10.1Answers to selected Exercises . . . . . . . . . . . . . . . 277

Module objectives

Upon completion of this module students should be able to:

255

Page 260: Study Book

256 Module 12. Factor Analysis

� understand the principles underlying factor analysis;

� give a geometric interpretation of the factors used, where possible;

� perform a factor analysis from given data using r;

� select an appropriate number of factors using suitable techniques.

12.1 Introduction

Factor analysis is a data reduction technique very similar to pca. Indeed,many students find it hard to see the differences between the two methods;see Sect. 12.5 for a discussion on this issue.

Activity 12.A: Read Manly, Sect. 7.1.

Factor analysis refers to a variety of statistical techniques whose commonobjective is to represent a set of variables in terms of a smaller number ofhypothetical variables or factors. pca is therefore an example of a factoranalysis. Usually, however, factor analysis refers to so-called common factoranalysis, which is considered here.

In general, the first step is an examination of the interrelationships betweenthe variables. Usually correlation coefficients are used as a measure of the as-sociation between variables. Inspection of the correlation matrix may revealrelationships within some subsets of variables, and that these correlations arehigher than those between subsets. Factor analysis explains these observedcorrelations by postulating the existence of a small number of hypotheticalvariables or factors which are causing the observed correlations.

It can be argued that, ignoring sampling errors, a causal system of fac-tors will lead to a unique correlation system of observed variables. How-ever the reverse is not true. Only under very limiting conditions can oneunequivocably determine the underlying causal structure from the correla-tional structure. In practice, only a correlational structure is presented. Theconstruction of a causal system of factors from this structure relies as muchon mathematics as judgement, knowledge of the system under investigationand interpretation of the analysis.

At one extreme, the researcher may not have any idea as to how manyunderlying factors exist. Then fa is an exploratory technique aiming atascertaining the minimum number of hypothetical factors that can accountfor the observed covariation. The majority of applications of this type arein the social sciences.

© USQ, February 21, 2007

Page 261: Study Book

12.2. The Procedure 257

fa may also be used as a means of testing specific hypotheses. A researcherwith a considerable depth of knowledge of an area may hypothesize twodifferent underlying dimensions or factors, and that certain variables belongto one dimension while others belong to the second. If fa is used to test thisexpectation, then it is used as a means of confirming a certain hypothesis,not as a means of exploring underlying dimensions. Thus, it is referred toas confirmatory factor analysis.

The idea of having underlying, but unobservable, factors may sound odd.But consider an example: annual taxable income, number of cars owned,value of home, and occupation may all be measure various observable so-cioeconomic status indicators. Likewise, heart rate, muscle strength, bloodpressure and hours of exercise per week may all be measurements of fitness.The observable measurements are all aspects of the underlying factor called‘fitness’. In both cases, the true, underlying variable of interest (‘socioeco-nomic status’ and ‘fitness’) is hard to measure directly, but can be measuredusing the observed variables given.

12.2 The Procedure

Activity 12.B: Read Manly, Sect 7.2 and 7.3.

Factor analysis (fa), like pca, is a data reduction technique. fa and pca arevery similar, and indeed some computer programs and texts barely distingishbetween them. However, there are certainly differences. As with pca, theanalysis starts with n observations on p variables. These p variables areassumed to have a common set of m factors underlying them; the role of fais to identify these factors.

Mathematically, the p variables are

X1 = a11F1 + a12F2 + · · ·+ a1mFm + e1

X2 = a21F1 + a22F2 + · · ·+ a2mFm + e2

......

...Xp = ap1F1 + ap2F2 + · · ·+ apmFm + ep (12.1)

where Fj are the underlying factors common to all the variables Xi, aij arecalled factor loadings, and the ei are the parts of each variable unique tothat variable. In matrix notation,

x = Λf + e, (12.2)

© USQ, February 21, 2007

Page 262: Study Book

258 Module 12. Factor Analysis

where the factor loadings are in the matrix λ. In general, the Xi are stan-dardized to have mean zero and variance one. Likewise, the factors Fj areassumed to have mean zero and variance one, and are independent of ei.The factor loadings aij are assumed constant. Under these assumptions,

var[Xi] = 1 = a2i1 + a2

i2 + · · ·+ a2im + var[ei].

Hence, the observed variance in Xi is due to two components:

1. The effect of the common factors Fj , through the constants aij . Hence,the quantity a2

i1 + a2i2 + · · ·+ a2

im is called the communality for Xi.

2. The effect of the component specific to Xi, through var[ei]. Hencevar[ei] is called the specificity or uniqueness of Xi. This can also beseen as the error variance.

12.2.1 Path model

The relationship between (observed) variables and factors is often displayedusing a path model. For example, consider the (unlikely) situation wherethere are three observed variables, X1, X2 and X3, and two factors F1 andF2. Suppose further that the factor loadings aij in Eq. (12.1) are known.Then a path model can be constructed which is consistent with the originaldata:

F2

F1

X3

X2

X1

XXXXXXXXXz

���������:

��

��

��

���>

ZZ

ZZ

ZZ

ZZZ~

XXXXXXXXXz

���������:

a12

a31

a32

a11

a22

a21

e3

e2

e1

Using properties of expectations and covariances, the original variances ofthe Xi (which are 1, recall) can be recovered.

© USQ, February 21, 2007

Page 263: Study Book

12.2. The Procedure 259

Example 12.1: Consider a (hypothetical) example where three variablesare observed on a number of fit men: X1 is the numbers of hours ofexercise performed each week; X2 is the time taken to run 10km; andX3 is the time taken to sprint 100m. The correlation matrix is 1 0.64 0.51

0.64 1 0.270.51 0.27 1

.

One possible allocation of the factors is shown below.

F2

F1

X3

X2

X1

XXXXXXXXXz

���������:

��

��

��

���>

ZZ

ZZ

ZZ

ZZZ~

XXXXXXXXXz

���������:

0.5

0.1

0.9

0.6

0.2

0.9

0.18

0.15

0.38

Note that, for example,

Covar[X1, X2]= Covar[0.6F1 + 0.5F2, 0.9F1 + 0.2F2]= Covar[0.6F1 + 0.5F2, 0.9F1] + Covar[0.6F1 + 0.5F2, 0.2F2]= Covar[0.6F1, 0.9F1] + Covar[0.5F2, 0.9F1] +

Covar[0.6F1, 0.2F2] + Covar[0.5F2, 0.2F2]= 0.54Covar[F1, F1] + 0 + 0 + 0.1Covar[F2, F2]= 0.64,

as in the original correlation matrix. The communalities are given bye1 = 0.38; e2 = 0.15 and e3 = 0.18. At this stage, we are assumingF1 and F2 are orthogonal, so Covar[F1, F2] = 0. (Recall var[Fi] =Covar[Fi, Fi] = 1 and var[Xi] = 1.) In addition,

var[X1] = var[0.6F1] + var[0.5F2] + var[e1]= 0.36 + 0.25 + 0.38 ≈ 1

as required. Thus this path model represents one possible allocationof the factors; there are, however, others possible. Often, the relation-ships between the factors and the observable variables are given in atable:

© USQ, February 21, 2007

Page 264: Study Book

260 Module 12. Factor Analysis

F1 F2

X1 0.6 0.5X2 0.9 0.2X3 0.1 0.9

Is there a sensible interpretation of the factors? F1 is strongly relatedto the time to run 10km, and also to the hours of exercise per week;perhaps this factor could be interpretted as measuring stamina. Thesecond factor is highly related to the time to sprint 100m, and thehours of exercise per week; perhaps this factor could be interpreted asmeasuring strength.

Written using the matrix notation of Eq. (12.2),

x = Λf + e X1

X2

X3

=

0.6 0.50.9 0.20.1 0.9

[F1

F2

]+

0.380.150.18

,

where

Λ =

0.6 0.50.9 0.20.1 0.9

.

12.2.2 Steps in a FA

fa has three steps:

1. Find some provisional factor loadings. Commonly, this is done using apca. Since the number of underlying factors is often unknown, m pcsare chosen to become the m underlying factors. Since these factors areactually pcs, they are uncorrelated. However, the choice of factors F1,F2, . . . , Fm is not unique. Any linear combination of these is also avalid choice for the factors. That is,

F ′1 = d11F1 + d12F2 + · · ·+ d1mFm

F ′2 = d21F1 + d22F2 + · · ·+ d2mFm

......

...F ′

m = dm1F1 + dm2F2 + · · ·+ dmmFm

are also valid factors. The original factor loadings Λ are effectivelyreplaced by ΛT for some rotation matrix T .

© USQ, February 21, 2007

Page 265: Study Book

12.2. The Procedure 261

2. The second step involves selecting a linear combination of the factorsto help interpretation; that is, computing the dij above. This step iscalled rotation. There are two types of rotation:

(a) Orthogonal: With this type of rotation, the factors remain or-thogonal. A common example is the varimax rotation. Thismethod maximizes

∑ij(dij −d.j)2 where d.j is the mean over i of

the dij .A transformation y = Ax is orthogonal if the transformationmatrix A is orthogonal; a square matrix A is orthogonal if andonly if its column vectors (say, ai, a2, . . .an) form an orthonormalset; that is

aTi aj =

[0 if i 6= j1 if i = j

For example, the matrix

P =[

0.9397 −0.34200.3420 0.9397

]is orthogonal. First, write a1 = [0.9397, 0.3420]T , and a2 =[−0.3420, 0.9397]T . Then, aT

1 a1 = 0.93972 + 0.34202 = 1 andaT

2 a2 = (−0.3420)2+0.93972 = 1; also, aT1 a2 = (0.9397×−0.3420)+

(0.9397×0.3420) = 0. Thus, a transformation based on matrix Pis an orthogonal transformation. (In fact, it represents a rotationof −20◦.)

(b) Oblique: The factors do not have to remain orthogonal with thistype of rotation. The promax rotation is an example. This pro-cedure tends to increase large loadings in magnitude relative tosmall loadings.

3. The third step is to compute the factors scores; that is, how much ofeach variable is explained by each factor. This leads to interpretationsof the factors. To make interpretation easier, a good rotation shouldproduce factor loadings so that some are close to one, and the othersclose to zero.

Some points to note:

� pca is often the first step in a factor analysis;

� Factor analysis, like pca, is based on eigenvalues;

� Many types of rotation may be performed. The software package S-Plus (which is very similar to r) implements twelve different criteria(Venables & Ripley [46, p 409]). The varimax method is probably themost popular.

© USQ, February 21, 2007

Page 266: Study Book

262 Module 12. Factor Analysis

12.3 Factor rotation

In general, with two or more common factors, the initial factor solutionmay be converted to another equally valid solution with the same number offactors by an orthogonal rotation. Such a rotation preserves the correlationsand communalities amongst variables, but of course changes the loadings orcorrelations between the original variables and the factors. Recalling thatthe initial factor solution may result in loadings which do not allow easyinterpretation of the factors, rotation can be used to “simplify” the loadingsin the sense of enabling easier interpretation. The rotational process offactor analysis allows the reseacher a degree of flexibility by presenting amultiplicity of views of the same data set. Obtain a parsimonious or simplestructure following these guidelines:

1. Any column of the factor loadings matrix should have mostly smallvalues, as close to zero as possible.

2. Any row of the matrix should have only a few entries far from zero.

3. Any two columns of the matrix should exhibit a different pattern ofhigh and low loadings.

12.3.1 Methods of factor rotation

Orthogonal rotation discussed above preserves the orientation between theinitial factors so that they are still perpendicular after rotation. In factthe initial factor axes can be rotated independently giving factors whichare not necessarily perpendicular to each other but still explain the reducedcorrelation matrix. This rotation technique is called oblique.

Orthogonal rotation methods enjoy some distinctive properties:

1. Factors remain uncorrelated.

2. The communality estimates are not affected but the proportion ofvariability accounted for by a given factor will change as a result ofthe rotation.

3. Although the total amount of variance explained by the common fac-tors won’t change with orthogonal rotation, the percentage accountedfor by an individual factor will, in general, be different.

© USQ, February 21, 2007

Page 267: Study Book

12.4. Interpretation of factors 263

The standard orthogonal rotation techniques are the varimax (which is inr), quartimax , and equimax methods. They each aim to simplify the factorstructure but in different ways. Varimax is the most popular and is usuallyused with pca extraction. It aims to create small, medium and large loadingswithin a particular factor. Quartimax aims, for each variable, to obtain oneand only one major loading across the factors. Equimax attempts to simplifyboth the rows and the columns of the structure matrix.

Unfortunately, the use of orthogonal rotation techniques may not result inuncovering an easily interpretable set of factors. Also there is often no reasonto believe that the hypothetical factors should be uncorrelated. Thus, it ispossible to arrive at much more interpretable factors if oblique rotation isallowed.

The most popular oblique factor rotation methods are promax (which is inr), oblimax , quartimin, covarimin, biquartimin, and oblimin. Similar to or-thogonal rotation methods, oblique methods are designed to satisfy variousdefinitions of simple structure, and no algorithm is clearly superior to an-other. Oblique methods present complexities that don’t exist for orthogonalmethods. They include:

1. The factors are no longer uncorrelated and hence the pattern andstructure matrices will not in general be identical.

2. Communalities and variances accounted for are not invariant underoblique rotation.

For more information on some popular rotation techniques, see Kim andMueller [30].

Example 12.2: Buell & Bundgaard [10] use factor analysis to representwind soundings over Battery MacKenzie.

12.4 Interpretation of factors

It is often useful to find an interpretation for the resultant factors; rotationis usually performed to help with this. As with pca, finding interpretationsis often quite an art, and sometimes any interpretation is difficult to find.

Sometimes using a different kind of rotation may help.

© USQ, February 21, 2007

Page 268: Study Book

264 Module 12. Factor Analysis

Example 12.3: Kalnicky [24] used factor analysis to classify the atmo-spheric circulation over the midlatitudes of the northern hemispherefrom 1899–1969.

Example 12.4: Hannes [20] used rotated factors to explore the relationshipbetween water temperatures measured at Blunt’s Reef Light Ship andthe air pressure at Eureka, California. The factor loadings indicatedthat the water temperatures measured at Trinidad Head and Blunt’sReef were quite different.

Example 12.5: Rogers [39] used factor analysis to find areal patterns ofanomalous sea surface temperature (SST) over the eastern North Pa-cific based on monthly SSTs, surface pressure and 1000–500mb layerthickness over North America during 1960–1973.

Example 12.6: Consider Example 12.1. An orthogonal rotation can beused to rotate the matrix of factor loadings. For example (and this isprobably not a practical example of a rotation but serves to demon-strate the point), an orthogonal rotation could be achieved using thematrix

T =[ √

3/2 −1/21/2

√3/2

]. (12.3)

(Check this transformation matrix is orthogonal!) Then, the factorloadings become

ΛT =

0.6 0.50.9 0.50.1 0.9

× [ √3/2 −1/2

1/2√

3/2

]

0.774 0.130.88 −0.280.54 0.73

.

This allocation of factor loadings produces the following path diagram:

© USQ, February 21, 2007

Page 269: Study Book

12.4. Interpretation of factors 265

−4 −2 0 1 2 3

−3

−1

01

23

Original x

Orig

inal

y

−4 −2 0 2

−3

−1

01

23

Transformed x

Tra

nsfo

rmed

y

Figure 12.1: The effect on the cartesian plane of applying the orthogonaltransform in matrix T in Eq. (12.3)

F2

F1

X3

X2

X1

XXXXXXXXXz

���������:

��

��

��

���>

ZZ

ZZ

ZZ

ZZZ~

XXXXXXXXXz

���������:0.77

0.880.54

0.13 −0.28

0.73�

0.18

0.15

0.38

Note that still

Covar[X1, X2] = Covar[0.77F1 + 0.13F2, 0.88F1 − 0.28F2]= (0.77× 0.88) + (0.1×−0.28)≈ 0.64.

It is not clear that this (arbitrary) rotation helps aid interpretation; ithas been used merely to demonstrate the concepts. The transforma-tion represents a rotation of −30◦ (Fig. 12.2).

Example 12.7: A non-orthogonal rotation for Example 12.1 can be ob-tained using the rotation matrix

S =[

1.07 −0.288−0.116 1.04

].

Then, the factor loadings become

ΛS ≈

0.58 0.350.94 −0.052

0.0023 0.90

.

© USQ, February 21, 2007

Page 270: Study Book

266 Module 12. Factor Analysis

−4 −3 −2 −1 0 1 2

−3

−2

−1

01

23

Original x

Orig

inal

y

−4 −2 0 2

−3

−1

01

23

Transformed x

Tra

nsfo

rmed

y

Figure 12.2: The effect on the cartesian plane of applying the oblique trans-form in matrix S in Eq. (12.7)

With oblique rotations, matters become more complicated becausenow the factors are correlated. In a path diagram, this is indicated asshown below, where r is the correlation between the two factors.

F2

F1

X3

X2

X1

XXXXXXXXXz

���������:

��

��

��

���>

ZZ

ZZ

ZZ

ZZZ~

XXXXXXXXXz

���������:

0.35

0.0023

0.90

0.58

−0.052

0.94

0.18

0.15

0.38

6

?r

12.5 The differences between PCA and FA

pca and factor analysis are similar methods, which is often a source ofconfusion for students. This section lists some of the difference (also seeMardia, Kent & Bibby [34, §9.8]).

1. As seen above, a pca is often a first step in a factor analysis.

2. There is an essential difference between the two analyses. In pca, thehypothetical new variables (the principal components) are defined aslinear combinations of the observed variables. In factor analysis, it isthe other way around: The observed variables are conceptualized asbeing linear composites of some unobserved variables or factors.

© USQ, February 21, 2007

Page 271: Study Book

12.6. Principal components factor analysis 267

3. In pca, the major objective is to select a number of components thatexplain as much of the total variance as possible. The values of theprincipal components for an individual are relatively simple to com-pute and interpret usually. In contrast, the factors obtained in factoranalysis are selected mainly to explain the interrelationships betweenthe original variables.

4. In pca, computations are started with the covariance matrix or thecorrelation matrix. In factor analysis computations often begin witha reduced correlation matrix, a matrix in which the 1’s on the maindiagonal are replaced by communalities. These are further explainedbelow.

5. In pca, the principal components is just a transformation of the origi-nal data, with no assumptions made about the form of the covariancematrix of the data. In factor analysis, a definite form is assumed.

12.6 Principal components factor analysis

In the previous section, differences between fa and pca were pointed out.However, pca can actually be used to assist in performing a fa. This iscalled principal components factor analysis, and uses a pca to perform thefirst step in the fa (note that this is not the only option) from which thenext two steps can be done. This idea is presented in this section.

Begin with p original variables Xi for i = 1 . . . p. Performing a pca willproduce p pcs, Zi for i = 1 . . . p. The pcs are defined as

Z1 = b11X1 + b12X2 + · · ·+ b1pXp

......

...Zp = bp1X1 + bp2X2 + · · ·+ bppXp

where the bij are given by the eigenvectors of the correlation matrix. Inmatrix form, write Z = BX. Since B is a matrix of eigenvectors, B−1 = BT ,so also X = BT Z, or

X1 = b11Z1 + b21Z2 + · · ·+ bp1Zp

......

...Xp = b1pZ1 + b2pZ2 + · · ·+ bppZp

Now in a factor analysis, we only keep m of the p factors; hence

X1 = b11Z1 + b21Z2 + · · · bp1Zm + e1

......

...Xp = b1pZ1 + b2pZ2 + · · · bmpZm + ep

© USQ, February 21, 2007

Page 272: Study Book

268 Module 12. Factor Analysis

where the ei are unexplained components after omitting the last p−m pcs.In this equation, the bij are like factor loadings. But true factors have avariance of one; here, var[Zi] = λi since the Zi is a pc. This means the Zi

are not ‘true’ factors. Of course, the Zi can be rescaled to have a varianceof one:

X1 = (√

λ1b11)Z1/√

λ1 + (√

λ2b21)Z2

√λ2 + · · ·+ (

√λmbm1)Zm

√λm + e1

......

...Xp = (

√λ1b1p)Z1

√λ1 + (

√λ2b2p)Z2

√λ2 + · · ·+ (

√λpbmp)Zm

√λm + ep

when we can also write

X1 = a11F1 + b12F2 + · · · b1mFm + e1

......

...Xp = ap1F1 + ap2F2 + · · · bpmFm + em,

where Fi = Zi/√

λi and aij = bji

√λi (note the subscripts carefully!). In

matrix form,X = Λf + e.

A rotation can be perfomed by writing

X = ΛT f + e

for an appropriate rotation matrix T .

12.7 How many factors to choose?

In pca, there were some guidelines for selecting the number of pcs. Similarguidelines also exist for factor analysis. r will not let you have too manyfactors; for example, if you try to extract three factors from four variables,you will be told this is too many.

As usual, there are two competing criteria: To have the simplest modelpossible, and to explain as much of the variation as possible.

There is no easy answer to explain how many factors are chosen. This is oneof the major criticism of fa. Try to find a number of factors that explains asmuch variation as possible (using the communalities and uniquenesses), butis not too complicated, and preferably leads to a useful interpretation. Thebest methods is probably to perform a pca, and note that ‘best’ number ofpcs, and then use this many factors in the fa.

© USQ, February 21, 2007

Page 273: Study Book

12.8. Using r 269

Note also that choosing the number of factors is a separate issue to therotation. The rotation will not alter the communalities or uniquenesses.The first step is therefore to decide on the number of factors using commu-nalities and uniquenesses, and then try various rotations to find the bestinterpretation.

12.8 Using R

r can be used to perform factor analysis using the function factanal.

The help file for this r function states

The fit is done by optimizing the log likelihood assumingmultivariate normality over the uniquenesses.

Actually doing this is beyond the scope of this course; we will just use rtrusting the code gives sensible answers.

Example 12.8: Consider the European employment data used by Manlyin Example 7.1. (While not a climatological, it will demonstrate howto do equivalent analyses in r.) The following code analyses the data.First, Manly’s Table 7.1 can be found directly, or using factanal: Thefactor analysis without rotation, shown in the middle of Manly p 101,can be obtained as follows:

> ee <- read.table("europe.txt", header = TRUE)

> cmat <- cor(ee)

> ee.fa4 <- factanal(ee, factors = 4, rotation = "none")

> print(ee.fa4$loadings, cutoff = 0)

Loadings:Factor1 Factor2 Factor3 Factor4

AGR -0.961 0.178 -0.178 0.094MIN 0.143 0.625 -0.410 -0.078MAN 0.744 0.416 -0.102 -0.508PS 0.582 0.576 -0.017 0.569CON 0.449 0.034 0.376 -0.375SER 0.601 -0.327 0.600 0.089FIN 0.103 -0.121 0.631 0.228SPS 0.697 -0.672 -0.138 0.196TC 0.615 -0.121 -0.233 0.146

© USQ, February 21, 2007

Page 274: Study Book

270 Module 12. Factor Analysis

Factor1 Factor2 Factor3 Factor4SS loadings 3.274 1.516 1.184 0.858Proportion Var 0.364 0.168 0.132 0.095Cumulative Var 0.364 0.532 0.664 0.759

Notice that the value are not identical to those shown in Manly; thereare numerous different algorithms for factor analysis, so this is of noconcern. The help for the function factanal in r states

There are so many variations on factor analysis that it is hard tocompare output from different programs. Further, the optimiza-tion in maximum likelihood factor analysis is hard, and manyother examples we compared had less good fits than produced bythis function.

The values are, however, similar. The signs are different, but this is ofno consequence.

The results using the varimax rotation are obtained as follows:

> ee.fa4r <- factanal(ee, factors = 4, rotation = "varimax")

> print(ee.fa4r$loadings, cutoff = 0)

Loadings:Factor1 Factor2 Factor3 Factor4

AGR -0.695 -0.633 -0.278 -0.185MIN -0.142 0.194 -0.546 0.479MAN 0.199 0.882 -0.293 0.302PS 0.205 0.086 0.084 0.969CON 0.081 0.644 0.250 -0.033SER 0.427 0.368 0.720 0.023FIN -0.022 0.041 0.686 0.055SPS 0.972 0.051 0.197 -0.091TC 0.614 0.160 -0.061 0.249

Factor1 Factor2 Factor3 Factor4SS loadings 2.097 1.803 1.563 1.368Proportion Var 0.233 0.200 0.174 0.152Cumulative Var 0.233 0.433 0.607 0.759

Again, the factors are not identical, but are similar.

The communalities are not produced by r; instead, the uniqueness iscomputed (these are called specificity in Manly). Simply, the varianceof each factor (that is, the eigenvalues) consists of two parts: theuniqueness plus the communality. The communalities represent theproportion of each variable that is shared with the other variables

© USQ, February 21, 2007

Page 275: Study Book

12.8. Using r 271

through the common factors. The uniqueness is the proportion ofthe variance unique to each variable and not shared with the othervariables. The communalities are computed in r as follows:

> 1 - ee.fa4$uniqueness

AGR MIN MAN PS CON0.9950000 0.5852025 0.9950000 0.9950000 0.4853668

SER FIN SPS TC0.8366518 0.4758879 0.9950000 0.4682174

Again, while they are somewhat similar to those shown in Manly, theyare not identical.

We now show how to extract the ‘scores’ from the factor analysis.In this example, the ‘scores’ represent how each country scores oneach factor. First, we need to adjust the call to factanal by addingscores="regression":

> ee.fa4scores <- factanal(ee, factors = 4,

+ scores = "regression")

> names(ee.fa4scores)

[1] "converged" "loadings" "uniquenesses"[4] "correlation" "criteria" "factors"[7] "dof" "method" "scores"[10] "STATISTIC" "PVAL" "n.obs"[13] "call"

> ee.fa4scores$scores

Factor1 Factor2Belgium 0.735864909 0.39347899Denmark 1.660414922 -0.66961020France 0.196749209 0.34219028W.Germany 0.367203694 1.17655067Ireland 0.109519146 -1.15479327Italy -0.243011308 0.72606452Luxemborg -0.387309499 1.19266940Netherlands 0.998482010 -0.43751930UK 1.296604056 -0.10590831Austria -0.554916391 0.47914938Finland 0.674141842 -0.53711442Greece -1.463580232 -0.79224633Norway 0.910020084 -0.36175539Portugal -0.582448069 -0.01995971

© USQ, February 21, 2007

Page 276: Study Book

272 Module 12. Factor Analysis

Spain -1.446012474 0.95896642Sweden 1.826033446 -0.42669861Switzerland -0.904728288 2.13805967Turkey -1.041975726 -2.66833845Bulgaria -0.045481468 0.55490800Czechoslovakia 0.000259302 0.62003588E.Germany 0.668431037 1.24247646Hungary 0.027698217 -0.79180745Poland -0.354297646 -0.48659964Romania -1.022644195 0.39156263USSR 0.760473031 -0.50031050Yugoslavia -2.185489608 -1.26345072

Factor3 Factor4Belgium 1.05675332 -0.29779175Denmark 0.42843017 -1.16880321France 0.76331126 -0.16036086W.Germany -0.55782348 -0.15329451Ireland 0.78688797 1.07501857Italy 0.52722797 -1.17426449Luxemborg 0.95859219 -0.38523076Netherlands 1.45523464 -0.04966847UK 0.16428566 1.06316594Austria 0.88978886 1.33940370Finland 0.38459638 0.93802719Greece 0.60623551 -0.51373635Norway 1.10665730 -0.54306128Portugal 0.05541367 -0.72558905Spain 0.69160242 -0.40625354Sweden -0.11120997 -0.63396595Switzerland 0.32945045 -0.32444541Turkey -1.07926842 -1.66123747Bulgaria -1.64948488 -0.73494006Czechoslovakia -1.36338513 0.86309299E.Germany -1.65825030 0.96913085Hungary -0.72829821 2.83417294Poland -0.93180860 0.18018067Romania -1.52721874 -0.52834041USSR -1.32194122 -0.83876243Yugoslavia 0.72422119 1.03755315

These scores may be used in further analysis. For example, a factoranalysis (or pca) is often used to reduce the number of covariates usedin a regression analysis. Suppose, in this example, the given variableswere to be used in a regression analysis where the response variable

© USQ, February 21, 2007

Page 277: Study Book

12.8. Using r 273

is gross domestic product (GDP). (There is no such variable, but thiswill demonstrate the ideas). In r, to perform the regression of GDPagainst the four factors identified above, use

> ee.lm <- lm(GDP ~ ee.fa4scores)

To learn more about this regression fit, use

> summary(ee.lm)

> names(ee.lm)

Example 12.9: Consider Example 11.10, where a pca was performed onManly’s sparrow data. Here, a fa is conducted for comparison.

> sp <- read.table("sparrows.txt", header = TRUE)

> sp.fa.vm <- factanal(sp, factors = 2,

+ rotation = "varimax")

> loadings(sp.fa.vm)

Loadings:Factor1 Factor2

Length 0.370 0.926Extent 0.659 0.530Head 0.638 0.459Humerus 0.901 0.317Sternum 0.475 0.463

Factor1 Factor2SS loadings 2.017 1.665Proportion Var 0.403 0.333Cumulative Var 0.403 0.736

> 1 - sp.fa.vm$uniqueness

Length Extent Head Humerus Sternum0.9950000 0.7151875 0.6186305 0.9123528 0.4403432

> sp.fa.pm <- factanal(sp, factors = 2,

+ rotation = "promax")

> loadings(sp.fa.pm)

Loadings:Factor1 Factor2

© USQ, February 21, 2007

Page 278: Study Book

274 Module 12. Factor Analysis

Length -0.184 1.143Extent 0.588 0.293Head 0.614 0.200Humerus 1.138 -0.234Sternum 0.358 0.338

Factor1 Factor2SS loadings 2.180 1.601Proportion Var 0.436 0.320Cumulative Var 0.436 0.756

> 1 - sp.fa.pm$uniqueness

Length Extent Head Humerus Sternum0.9950000 0.7151875 0.6186305 0.9123528 0.4403432

12.9 Concluding comments

Activity 12.C: Read Manly, Sect. 7.6.

Factor analysis is perceived as valuable by many, and with scepticism bymany others. We present the technique here as a tool, without judgement.Of note, however, is that Wilks [49] does not consider fa; he only mentionsin passing that pca and fa are distinct methods.

12.10 Exercises

Ex. 12.10: The data file toowoomba.dat contains the daily rainfall, max-imum and minimum temperatures, radiation, pan evaporation andmaximum vapour pressure deficit (in hPa) at Toowoomba from 1 Jan-uary 1889 to 21 July 2002 (a total of 41474 observations on threevariables). Perform a fa to find two underying factors, and comparethe factors using no rotation, promax rotation and varimax rotation.

Ex. 12.11: In a certain factor analysis, the factor loadings were computedas shown in the following table.

© USQ, February 21, 2007

Page 279: Study Book

12.10. Exercises 275

F1 F2

X1 0.3 0.5X2 0.8 0.1X3 0.1 0.8X4 0.6 0.7

(a) Draw the path model for this problem.

(b) Determine the uniqueness for each variable.

Ex. 12.12: Consider again the air quality data from 41 cities in the USA,as seen in Example 10.1. For each city, seven variables have been mea-sured (see p 218). The first is the concentration of SO2 in microgramper cubic metre; the other six are potential identifiers of pollutionproblems. The original source treats the concentration of SO2 as aresponse variable, and the other six as covariates.

(a) Is it possible to reduce these six covariates to a smaller number,without losing much information? How many factors are ade-quate?

(b) Use an appropriate fa to perform a data reduction. If possible,find a useful interpretation of the resultant factors.

(c) Perform a regression analysis using SO2 as the response, and thefactors as regressors. Compare to a regression of SO2 on all theoriginal variables, and comment. (To regress variables A and Bagainst Y in r, use m1 <- lm ( Y ~ A + B); then names(m1)and summary(m1) may prove useful.)

Ex. 12.13: The data file gatton.dat contains the daily rainfall, maximumand minimum temperatures, radiation, pan evaporation and maximumvapour pressure deficit (in hPa) at Gatton from 1 January 1889 to 21July 2002 from 1 January 1889 to 15 September 2002 (a total of 41474observations on six variables). Perform a fa to find two underyingfactors, and compare the factors using no rotation, promax rotationand varimax rotation.

Ex. 12.14: The data file strainfall.dat contains the average month andannual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a fa. How many factors seems necessary?

(b) How many factors are useful?

(c) If possible, find a rotation that provides a useful interpretationfor the factors.

© USQ, February 21, 2007

Page 280: Study Book

276 Module 12. Factor Analysis

Ex. 12.15: The data file jondaryn.dat contains the daily rainfall, max-imum and minimum temperatures, radiation, pan evaporation andmaximum vapour pressure deficit (in hPa) at Jondaryn from 1 January1889 to 21 July 2002 (a total of 41474 observations on six variables).Perform a fa to find two underying factors, and compare the factorsusing no rotation, promax rotation and varimax rotation.

Ex. 12.16: The data file emerald.dat contains the daily rainfall, maximumand minimum temperatures, radiationp, an evaporation and maximumvapour pressure deficit (in hPa) at Emerald from 1 January 1889 to21 July 2002 (a total of 41474 observations on six variables). Performa fa to find two underying factors, and compare the factors using norotation, promax rotation and varimax rotation.

Ex. 12.17: The data file wind_ca.dat contains numerous weather and windmeasurements from Canberra during 1989.

(a) Perform a fa on the data.

(b) How many factors are necessary to summarize the data? Explain.

(c) If possible, interpret the factors. What rotation makes for easiestinterpretation?

Ex. 12.18: The data file wind_wp.dat contains numerous weather and windmeasurements from Wilson’s Promontory, Victoria (the most southerlypoint of mainland Australia) during 1989.

(a) Perform a fa on the data.

(b) How many factors are necessary to summarize the data? Explain.

(c) If possible, interpret the factors. What rotation makes for easiestinterpretation?

Ex. 12.19: The data file qldweather.dat contains six weather-related vari-ables for 20 Queensland cities.

(a) Perform a fa. How many factors seems necessary?

(b) How many factors are useful?

(c) If possible, find a rotation that provides a useful interpretationfor the factors.

Ex. 12.20: This question concerns a data set that is not climatological,but you may find interesting. The data file chocolates.dat, availablefrom http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/popular/chocolates.html, contains measurements of the price, weightand nutritional information for 17 chocolates commonly available inQueensland stores. The data was gathered in April 2002 in Brisbane.

© USQ, February 21, 2007

Page 281: Study Book

12.10. Exercises 277

(a) Perform a fa using the nutritional information.

(b) How many factors are useful?

(c) If possible, find a rotation that provides a useful interpretationfor the factors.

12.10.1 Answers to selected Exercises

12.10 Here is a brief analysis.

> tw <- read.table("toowoomba.dat", header = TRUE)

> tw.2.n <- factanal(tw[4:9], factors = 2,

+ rotation = "none")

> tw.2.v <- factanal(tw[4:9], factors = 2,

+ rotation = "varimax")

> tw.2.p <- factanal(tw[4:9], factors = 2,

+ rotation = "promax")

© USQ, February 21, 2007

Page 282: Study Book

278 Module 12. Factor Analysis

© USQ, February 21, 2007

Page 283: Study Book

Module 13Cluster Analysis

Module contents13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 28013.2 Types of cluster analysis . . . . . . . . . . . . . . . . . . 280

13.2.1 Hierarchical methods . . . . . . . . . . . . . . . . . . . . 28013.3 Problems with cluster analysis . . . . . . . . . . . . . . 28113.4 Measures of distance . . . . . . . . . . . . . . . . . . . . 28113.5 Using PCA and cluster analysis . . . . . . . . . . . . . . 28113.6 Using r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28113.7 Some final comments . . . . . . . . . . . . . . . . . . . . 28713.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

13.8.1 Answers to selected Exercises . . . . . . . . . . . . . . . 290

Module objectives

Upon completion of this module students should be able to:

� understand the principles underlying cluster analysis;

� compute clusters using r;

� select an appropriate number of cluster for a given data;

� plot a dendrogram using r.

279

Page 284: Study Book

280 Module 13. Cluster Analysis

13.1 Introduction

Cluster analysis, unlike PCA and factor analysis, is a classification technique.

Activity 13.A: Read Manly, section 9.1.

Example 13.1: Kavvas and Delleur [28] use a cluster analysis for modellingsequences of daily rainfall in Indiana.

Example 13.2: Fritts [17] describes two techniques for examining the rela-tionship between ring-width of conifers in western North America andclimatic variables. The second technique is a cluster analysis whichhe uses to identify similarities and differences in the response functionand then to classify the tree sites.

13.2 Types of cluster analysis

Activity 13.B: Read Manly, sections 9.2.

The simple idea of cluster analysis is explained in Manly, section 9.1. Theactual mechanics, however, can be performed in numerous ways. Manlydiscusses two of these. Two methods are hierarchical clustering (usinghclust),also the first mentioned by Manly; and k-means clustering (usingkmeans),the second method mentioned by Manly. The hierarchical methodsare discussed in more detail in both Manly and these notes.

13.2.1 Hierarchical methods

Activity 13.C: Read Manly, section 9.3.

The hierarchical methods discussed in this section are well explained by thetext. The third method, using group averages, can be performed in r usingthe option method="average" in the call to hclust. A similar approach tothe first method is found using the option method="single". r also providesother hierarchical clustering methods; see ?hclust.

© USQ, February 21, 2007

Page 285: Study Book

13.3. Problems with cluster analysis 281

13.3 Problems with cluster analysis

Activity 13.D: Read Manly, section 9.4.

13.4 Measures of distance

The hierarchical clustering methods are all based on measures of distancebetween observations. There are different measures of distance that can beused besides the standard Euclidean distance.

Activity 13.E: Read Manly, sections 9.5, 5.1, 5.2 and 5.3.

13.5 Using PCA and cluster analysis

As mentioned in Sect. 11.3, PCA is often a preliminary step before conduct-ing a cluster analysis.

Activity 13.F: Read Manly, section 9.6.

Example 13.3: Stone & Auliciems [42] use a combination of cluster analy-sis and PCA to define phases of the Southern Oscillation Index (SOI).

13.6 Using R

Cluster analysis can be performed in r, as briefly been mentioned previously.The primary functions to use for hierarchical methods are hclust (whichperforms the clustering), and dist (which computes the distance matrix ofwhich the clustering is based). The default distance measure is the standardEuclidean distance.

After hclust is used, the resultant object can be plotted; the default plotis the dendrogram (Manly, Figure 9.1).

For k-means clustering (called partitioning in Manly), the function kmeanscan be used.

© USQ, February 21, 2007

Page 286: Study Book

282 Module 13. Cluster Analysis

Example 13.4: Consider the example concerning European countries usedby Manly in Example 9.1. (While not a climatological, it will demon-strate how to do equivalent analyses in r.) The following code analysesthe data. First the data is loaded, the names of the countries extracted,and the rest of the variables re-labelled as ec:

> ec <- read.table("europe.txt", header = TRUE)

Then, attach the data:

> attach(ec)

The example in Manly uses standardized data (see the top of page 135).Here is one way to do this in r:

> ec.std <- scale(ec)

Now that the data is prepared, the clustering can commence. The clus-tering method used in the Example is the nearest neighbour method;the most similar of the methods available in r is called method="single".The distance measure used is the default Euclidean distance.

> es.hc <- hclust(dist(ec.std), method = "single")

> plot(es.hc, hang = -1)

The final plot, shown in Fig. 13.1, looks very similar to that shown inManly Figure 9.3.

You can try other methods if you want to experiment. To then de-termine which countries are in which cluster, the function cutree isused, here is an example of extracting four clusters:

> cutree(es.hc, k = 4)

Belgium Denmark France1 1 1

W.Germany Ireland Italy1 1 1

Luxemborg Netherlands UK1 1 1

Austria Finland Greece1 1 1

Norway Portugal Spain1 1 2

Sweden Switzerland Turkey1 1 3

Bulgaria Czechoslovakia E.Germany

© USQ, February 21, 2007

Page 287: Study Book

13.6. Using r 283

Tur

key

Yug

osla

via

Spa

inU

SS

RH

unga

ryC

zech

oslo

vaki

aE

.Ger

man

yR

oman

iaB

ulga

riaP

olan

dLu

xem

borg

Italy

Gre

ece

Por

tuga

lN

orw

ayS

witz

erla

ndIr

elan

dA

ustr

iaU

KF

inla

ndW

.Ger

man

yD

enm

ark

Sw

eden

Net

herla

nds

Bel

gium

Fra

nce

12

34

5

Cluster Dendrogram

hclust (*, "single")dist(ec.std)

Hei

ght

Figure 13.1: The dendrogram after fitting a hierarchical clustering model(using the single agglomeration method) to the European countries data

© USQ, February 21, 2007

Page 288: Study Book

284 Module 13. Cluster Analysis

1 1 1Hungary Poland Romania

1 1 1USSR Yugoslavia

1 4

> sort(cutree(es.hc, k = 4))

Belgium Denmark France1 1 1

W.Germany Ireland Italy1 1 1

Luxemborg Netherlands UK1 1 1

Austria Finland Greece1 1 1

Norway Portugal Sweden1 1 1

Switzerland Bulgaria Czechoslovakia1 1 1

E.Germany Hungary Poland1 1 1

Romania USSR Spain1 1 2

Turkey Yugoslavia3 4

Later (Example 13.6), we will see that using Ward’s method is commonin the climatological literature. This produces four different clusters(Fig. 13.2.)

> es.hc.w <- hclust(dist(ec.std), method = "ward")

> plot(es.hc.w, hang = -1)

> sort(cutree(es.hc.w, k = 4))

Belgium Denmark France1 1 1

Ireland Netherlands UK1 1 1

Austria Finland Norway1 1 1

Sweden W.Germany Italy1 2 2

Luxemborg Greece Portugal

© USQ, February 21, 2007

Page 289: Study Book

13.6. Using r 285

UK

Fin

land

Irel

and

Aus

tria

Net

herla

nds

Bel

gium

Fra

nce

Nor

way

Den

mar

kS

wed

enLu

xem

borg

Italy

W.G

erm

any

Sw

itzer

land

Spa

inG

reec

eP

ortu

gal

Tur

key

Yug

osla

via

Hun

gary

Cze

chos

lova

kia

E.G

erm

any

US

SR

Rom

ania

Bul

garia

Pol

and

05

1015

Cluster Dendrogram

hclust (*, "ward")dist(ec.std)

Hei

ght

Figure 13.2: The dendrogram after fitting a hierarchical clustering model(using Ward’s method) to the European countries data

2 2 2Spain Switzerland Turkey

2 2 3Yugoslavia Bulgaria Czechoslovakia

3 4 4E.Germany Hungary Poland

4 4 4Romania USSR

4 4

Which clustering seems to produce the more sensible clusters? Why?

Example 13.5: On page 137, Manly discusses using the partitioning, ork-means, method, on the European cities data. This can also be donein r; firstly, grouping into two groups:

> ec.km2 <- kmeans(ec, centers = 2)

> row.names(ec)[ec.km2$cluster == 1]

© USQ, February 21, 2007

Page 290: Study Book

286 Module 13. Cluster Analysis

[1] "Belgium" "Denmark"[3] "France" "W.Germany"[5] "Ireland" "Italy"[7] "Luxemborg" "Netherlands"[9] "UK" "Austria"[11] "Finland" "Norway"[13] "Portugal" "Spain"[15] "Sweden" "Switzerland"[17] "Bulgaria" "Czechoslovakia"[19] "E.Germany" "Hungary"[21] "USSR"

> row.names(ec)[ec.km2$cluster == 2]

[1] "Greece" "Turkey" "Poland"[4] "Romania" "Yugoslavia"

These are different groups that given in Manly (since a different algo-rithm is used). Six groups can also be specified:

> ec.km6 <- kmeans(ec, centers = 6)

> row.names(ec)[ec.km6$cluster == 1]

[1] "Greece" "Yugoslavia"

> row.names(ec)[ec.km6$cluster == 2]

[1] "W.Germany" "Switzerland"[3] "Czechoslovakia" "E.Germany"

> row.names(ec)[ec.km6$cluster == 3]

[1] "Belgium" "Denmark" "Netherlands"[4] "UK" "Norway" "Sweden"

> row.names(ec)[ec.km6$cluster == 4]

[1] "Ireland" "Portugal" "Spain" "Bulgaria"[5] "Hungary" "Poland" "Romania" "USSR"

> row.names(ec)[ec.km6$cluster == 5]

[1] "Turkey"

> row.names(ec)[ec.km6$cluster == 6]

© USQ, February 21, 2007

Page 291: Study Book

13.7. Some final comments 287

[1] "France" "Italy" "Luxemborg"[4] "Austria" "Finland"

Example 13.6:

Unal, Kindap and Karaca [45] use cluster analysis to analyse Turkey’sclimate. The abstract states:

Climate zones of Turkey are redefined by using . . . clusteranalysis. Data from 113 climate stations for temperatures(mean, maximum and minimum) and total precipitationfrom 1951 to 1998 are used after standardizing with zeromean and unit variance, to confirm that all variables areweighted equally in the cluster analysis. Hierarchical clusteranalysis is chosen to perform the regionalization. Five dif-ferent techniques were applied initially to decide the mostsuitable method for the region. Stability of the clusters isalso tested. It is decided that Ward’s method is the mostlikely to yield acceptable results in this particular case, asis often the case in climatological research. Seven differentclimate zones are found, as in conventional climate zones,but with considerable differences at the boundaries.

In the above quote, it is noted that Ward’s method is commonly used inclimatology. This is specified in r as follows:

hclust( dist( data ), method="ward")

The clusters produced using different methods can be quite different (thedefault method is the complete agglomeration method).

13.7 Some final comments

A cluster analysis is generally used to classify data into clusters. It is rarelyobvious how many clusters is ideal. There are, however, hypothesis testsavailable for helping make this decision (see Wilks [49, p 424] for somereferences). In terms of hierarchical cluster analysis as described here, thedendrogram can help in the decision. One can select a value of the ‘Height’

© USQ, February 21, 2007

Page 292: Study Book

288 Module 13. Cluster Analysis

or ‘Distance’ appropriately. An appropriate distance may be that valueunder which the clusters change rapidly; alternatively, interpretations mayaid the clustering. After clustering, there are sometimes useful labels thatcan be applied to the clusters (in a data set containing climatic variablesfor various countries, for example, countries may be clustered by climaticregions and labels such as ‘Desert’, ‘Mediterranean’ etc, may be applied).

In addition, it is often recommended that data be first standardized to sim-ilar scales before a cluster analysis is performed, especially if the data are indifferent units. (That is, subtract the mean from the variable, and divide bythe standard deviation; use the r function scale). This prevents variableswith large variances dominating the distance measure.

13.8 Exercises

Ex. 13.7: Try to reproduce Manly’s Figure 9.4 by standardizing and thenusing hclust.

Ex. 13.8: The data file tempppt.dat contains the average July temperature(in ◦F ) and the average July precipitation for 28 stations in the USA.Each station has also been classified as belonging to southeastern,central or northeastern USA.

(a) Plot the temperature and precipitation data on a set of axes,identifying on the plot the three regions the stations are from.Do the three regions appear to form clusters?

(b) Perform a cluster analysis using the temperature and precipita-tion data. Use various clustering methods and compare.

(c) How well are the stations clustered according to the three prede-fined classifications?

(d) Using a dendrogram, which two regions are most similar?

Ex. 13.9: The data file strainfall.dat contains the average month andannual rainfall (in tenths of mm) for 363 Australian rainfall stations.

(a) Perform a cluster analysis using the monthly averages. Use vari-ous clustering methods and compare.

(b) Using a dendrogram, how many classifications seems useful?

Ex. 13.10: Consider the data file strainfall.dat again.

(a) Perform a PCA on the data. Show that two PCs is reasonable.

(b) Plot the first PC against the second PC. What does this indicate?

© USQ, February 21, 2007

Page 293: Study Book

13.8. Exercises 289

(c) Perform a cluster analysis on the first two PCs.

(d) Using a dendrogram, how many classifications seems useful?

Ex. 13.11: This question concerns a data set that is not climatological,but you may find interesting. The data file chocolates.dat, availablefrom http://www.sci.usq.edu.au/staff/dunn/Datasets/applications/popular/chocolates.html, contains measurements of the price, weightand nutritional information for 17 chocolates commonly available inQueensland stores. The data was gathered in April 2002 in Brisbane.

(a) Perform a cluster analysis using the nutritional information usingvarious clustering methods, and compare.

(b) Using a dendrogram, how many classifications of seems useful?What broad names could be given to these classifications?

Ex. 13.12: The data file ustemps.dat contains the normal average Januaryminimum temperature in degrees Fahrenheit with the latitude andlongitude of 56 U.S. cities. (See the help file for full details.) Performa cluster analysis. How many clusters seem appropriate? Explain.

Ex. 13.13: In Exercise 11.18, the US pollution data was examined, and aPCA performed.

(a) Perform a cluster analysis of the first two PCs. Produce a den-drogram. Does it appear the cities can be clustered into a smallnumber of groups, based on the first two PCs?

(b) Repeat the above exercise, but use the first three PCs. Comparethe two cluster analyses.

Ex. 13.14: The data in the file qldweather.dat contains six weather-relateddata for 20 Queensland cities (covering temperatures, rainfall, numberof raindays, humidity) plus elevation.

(a) Perform a PCA to summarise the seven variables into a smallnumber. How many PCs seems appropriate?

(b) Using the first three PCs, perform a cluster analysis (use Ward’smethod).

(c) Plot a dendrogram. Can you identify and find useful names forsome clusters? A map of Queensland may be useful (Fig. 13.3).

(d) Compare your results to a cluster analysis on all numerical vari-ables.

(e) Based on the cluster analysis of all variables, use cutree to divideinto four clusters.

(f) Plot a star plot, and see if the clusters can be identified.

© USQ, February 21, 2007

Page 294: Study Book

290 Module 13. Cluster Analysis

Atherton

Birdsville

Brisbane

Cairns

Childers

Cunnamulla

Gladstone

Gympie

Innisfail

Mackay

Maryborough

Mt.Isa

Mt.Tamborine

Nambour

Rockhampton

Roma

Stanthorpe

Theodore

Toowomba

Townsville

Warwick

Weipa

Figure 13.3: A map of Queensland may be useful to label the clusters inExercise 13.14.

Ex. 13.15: The data in the file countries.dat contains numerous variablesfrom a number of countries, and the countries have been classified byregion.

(a) Perform a cluster analysis on the original data. Given the re-gions of the countries, is there a sensible clustering that emerges?Explain.

(b) Perform a PCA on the data. How many PCs seem necessary?Let this number of PCs be p.

(c) Cluster the first p PCs. Given the regions of the countries, isthere a sensible clustering that emerges? Explain.

(d) How do these clusters compare to the clusters identified using allthe data?

13.8.1 Answers to selected Exercises

13.7 The following r code will work:

> mn <- read.table("mandible.txt", header = TRUE)

> mn.hc <- hclust(dist(scale(mn)), method = "single")

> plot(mn.hc, hang = -1)

© USQ, February 21, 2007

Page 295: Study Book

Appendix AInstalling other packages in R

Installing extra r packages can be a tricky business in Windows (I havenever had any trouble in Linux, however). To install the packages oz andncdf as used in Section 11.5, there are a couple of options.

� First try using the menu in r: click Packages|Install packages fromCRAN. I have never got this to work for me, but some people have.

� If the above doesn’t work, there is an alternative. Follow these steps:

1. Check your version of r by typing version at the r prompt.

2. If the CD has packages for your version of r, then use the r menuto select Packages|Install package from local zip file and install thepackages from the zip files on the CD.

3. If this doesn’t work, or your version of r differs from that on theCD, use your browser to go to: http://mirror.aarnet.edu.au/pub/CRAN/, then click on r Binaries, then Windows and thencontrib. (Alternatively, go straight to http://mirror.aarnet.edu.au/pub/CRAN/bin/windows/contrib/ directly).

4. Select the directory/folder that corresponds to your version of r.

5. Then download the zip file you need and put it somewhere thatyou’ll remember.

291

Page 296: Study Book

292 Appendix A. Installing other packages in R

6. Then, from the r menu, select Install package from local zip file.Then point to where you saved the file.

Then you should have the package installed ready for use. At the r prompt,you can then type library(oz), for example, and the library is loaded.

© USQ, February 21, 2007

Page 297: Study Book

Appendix BReview of statistical rules

B.1 Basic definitions

Experiment: Any situation where the outcome is uncertain is called anexperiment .

An experiments range from the simple tossing of a coin, to the complexsimulation of a queuing system.

Sample space: For any experiment, the sample space S of the experimentconsists of all possible outcomes for the experiment.

For example, the sample space of a coin toss is simply S = {tail,head},usually abbreviated to {T,H}, whereas for a queuing system the sam-ple space is the huge set of all possible realisations over time of peoplearriving and being served in the queue.

Event: An event E consists of any collection of points (set of outcomes) inthe sample space.

For example, in the coin toss there are two possible outcomes: eitherT or H. These engender three possible nontrivial events: {T}, {H}and {T,H} (this last event always happens). Whereas when two coinsare tossed there are four possible outcomes: either TT , TH, HT orHH (using what I trust is an obvious notation). There are then fifteenpossible nontrivial events such as: two heads, E1 = {HH}; the first

293

Page 298: Study Book

294 Appendix B. Review of statistical rules

coin is a head, E2 = {HT, HH}; at least one of the coins is a tail,E3 = {TT, TH,HT}; etc.

This definition of an event as a set of outcomes is very important as itallows us to discuss events at level appropriate to the circumstances.For example, a driver is the event “drunk” if his/her blood-alcoholcontent is above 0.05% . This groups all the possible outcomes ofthe level of alcohol (a real percentage) into two possible sets, that is,events: “drunk” or “not drunk.”

Mutually exclusive: A collection of events E1, E2, E3, . . . are mutuallyexclusive if for i 6= j, Ei and Ej have no outcomes in common.

For example, when tossing two coins, the above events E1 and E3 aremutually exclusive because HH (the only outcome in E1) is not inE3. But, E2 and E3 is not mutually exclusive because HT is in both;neither is E1 and E2 mutually exclusive.

The probabilities of events must satisfy the following rules of probability.

� For any event E, Pr {E} ≥ 0 .

� Something always happens: Pr {S} = 1 .

� If E1, E2,. . . , En are mutually exclusive events, then

Pr {E1 ∪ E2 ∪ E3 ∪ · · · ∪ En} =n∑

j=1

Pr {Ej} .

For example, we used these last two properties to determine the steadystate probabilities in a queue. Let event Ej denote that the queue isin state j (that is, with j people in the queue). These are clearlymutually exclusive events as the queue cannot be in two states atonce. Further the sample space is the union of all possible states:S = E0 ∪ E1 ∪ E2 ∪ · · · and hence

1 = Pr {S} = Pr {E0 ∪ E1 ∪ E2 ∪ · · · }= Pr {E0}+ Pr {E1}+ Pr {E2}+ · · ·= π0 + π1 + π2 + · · · .

� Pr{E

}= 1 − Pr {E} where E is the complement of E, that is, E is

the set of outcomes that are not in E .

We used this before too. For example, the the event E be that noneare waiting in the queue, that is the system is in states 0 or 1, then

Pr{E

}= 1− Pr {E} = 1− Pr {E0 ∪ E1} = 1− Pr {E0} − Pr {E1}

gives the probability that there is someone waiting in the queue.

© USQ, February 21, 2007

Page 299: Study Book

B.1. Basic definitions 295

� If two events are not mutually exclusive, then

Pr {E1 ∪ E2} = Pr {E1}+ Pr {E2} − Pr {E1 ∩ E2} .

This is known as the general addition rule of probability. Note thatif E1 and E2 are mutually exclusive then Pr {E1 ∩ E2} = 0 and soPr {E1 ∪ E2} = Pr {E1}+ Pr {E2} as given above.

� For two events E1 and E2, the conditional probability that event E2

will occur given that E1 has already occurred, is

Pr {E2 | E1} =Pr {E1 ∩ E2}

Pr {E1}.

This gives rise to the general multiplication rule:

Pr {E2 ∩ E1} = Pr {E1}Pr {E2 | E1} = Pr {E2}Pr {E1 | E2} .

Events E1 and E2 are termed independent if and only if Pr {E2 | E1} =Pr {E2}, or equivalently Pr {E1 | E2} = Pr {E1}, or equivalently Pr {E2 ∩ E1} =Pr {E1}Pr {E2}.

Example B.1: The probability that a person convicted of dangerous driv-ing will be fined is 0.87, and the probability that he/she will losehis/her licence is 0.52. The probability that such a person will befined and lose their licence is 0.41.

What is the probability that a person convicted of dangerous drivingwill be either fined or lose licence or both?

Solution: The events of being fined and losing licence are not mutu-ally exclusive, therefore apply the general addition rule:

Pr {F ∪ L} = Pr {F}+ Pr {L} − Pr {F ∩ L}= 0.87 + 0.52− 0.41= 0.98 .

Example B.2: A researcher knows that 60% of the goats in a certain dis-trict are male and that 30% of female goats have a certain disease.Find the probability that a goat picked at random from the district isa female and has the disease.

© USQ, February 21, 2007

Page 300: Study Book

296 Appendix B. Review of statistical rules

Solution: Apply the general multiplication rule:

Pr {F ∩D} = Pr {F}Pr {D | F}= 0.4× 0.3= 0.12 .

B.2 Mean and variance for sums of random variables

If X1 and X2 are random variables and c is a constant, then the followingrelationships must hold.

� E(cX1) = cE(X1)

� E(X1 + c) = E(X1) + c

� E(X1 + X2) = E(X1) + E(X2)

� Var(cX1) = c2 Var(X1)

� Var(X1 + c) = Var(X1)

If X1 and X2 are independent random variables,

Var(X1 + X2) = Var(X1) + Var(X2) .

If X1 and X2 are not independent random variables,

Var(X1 + X2) = Var(X1) + Var(X2) + 2Covar[X1, X2] ,

where Covar[X1, X2,] is the covariance between X1 and X2.

Example B.3: A random variable X has a mean of 10 and variance of 5.Determine the mean and variance of 3X − 1.

© USQ, February 21, 2007

Page 301: Study Book

B.2. Mean and variance for sums of random variables 297

Solution: Given that E(X) = 10 and Var(X) = 5

� E(3X − 1) = E(3X)− 1 = 3E(X)− 1 = 3(10)− 1 = 29� Var(3X − 1) = Var(3X) = 9 Var(X) = 9(5) = 45

Example B.4: The alternative formula for the variance is derived as follows

Var(X) = E[(X − µX)2

]= E

[X2 − 2µXX + µ2

X

]= E

[X2

]+ E [−2µXX] + µ2

X by addition rules= E

[X2

]− 2µXE [X] + µ2

X by multiplication rule

= E[X2

]− E [X]2 as µX = E(X) .

Definition B.1 (general expectation, variance and standard deviation)

� The expected value of any given function g(X) of a discrete randomvariable is

E(g(X)) =∑

x

g(x)p(x) ,

where p(x) is its probability distribution.

� For a discrete random variable X, the variance of X is the expectedvalue of g(X) where g(x) = (x− µX)2 (recall µX = E(X)), that is,

Var(X) = σ2X = E

[(X − µX)2

]=

∑x

(x− µX)2p(x) .

The standard deviation of X is σX =√

Var(X) .

For any distribution, the variance of X may also be computed from

Var(X) = E(X2)− E(X)2 ,

as can be shown from properties established in the next subsubsection.

Example B.5: A random variable X has the following probability distri-bution

x 0 1 2 3 4p(x) 0.05 0.15 0.35 0.25 0.20

Determine the expected value of X and the variance of X.

© USQ, February 21, 2007

Page 302: Study Book

298 Appendix B. Review of statistical rules

Solution:

� E(X) = µx =∑

x xp(x) = 0 × 0.05 + 1 × 0.15 + 2 × 0.35 + 3 ×0.25 + 4× 0.20, therefore E(X) = 2.40

� E(X2) =∑

x x2p(x) = 02 × 0.05 + 12 × 0.15 + 22 × 0.35 + 32 ×0.25 + 42 × 0.20 = 7.0, therefore Var(X) = E(X2) − E(X)2 =7.0− 2.402 = 1.24

© USQ, February 21, 2007

Page 303: Study Book

Appendix CSome time series tricks in R

C.1 Helpful R commands

To convert an AR model to an MA model:

imp <- as.ts( c(1, rep(0,19) ) )# Creates a time-series (1, 0, 0, 0, ...)

theta <- filter(imp, c(ar1, ar2, ...), "recursive")# Note that ar0 = 1 is assumed, and should not be included.

To find the ACF of an AR model:

imp <- as.ts( c(1, rep(0,99) ) )# Creates a time-series (1, 0, 0, 0, ...)

# Now convert to MA model, as abovetheta <- filter(imp, c(ar1, ar2, ...), "recursive")# Note that ar0 = 1 is assumed, and should not be included.

# Now get gamma:convolve( theta, theta )

299

Page 304: Study Book

300 Appendix C. Some time series tricks in R

© USQ, February 21, 2007

Page 305: Study Book

Appendix DTime series functions in R

The following is a list of the functions available in r for time series analysis.

Table D.1: The time series library in r.

Function Descriptionacf Autocovariance and Autocorrelation

Function Estimationar Fit Autoregressive Models to Time Seriesar.burg Fit Autoregressive Models to Time Seriesar.mle Fit Autoregressive Models to Time Seriesar.ols Fit Autoregressive Models to Time Series by OLSar.yw Fit Autoregressive Models to Time Seriesarima ARIMA Modelling of Time Seriesaustres Quarterly Time Series: Number of Australian Residentsbandwidth.kernel Smoothing Kernel Objectsbeaver1 Body Temperature Series of Two Beaversbeaver2 Body Temperature Series of Two Beaversbeavers Body Temperature Series of Two BeaversBJsales Sales Data with Leading Indicator.Box.test Box–Pierce and Ljung–Box Testsccf Function Estimation

301

Page 306: Study Book

302 Appendix D. Time series functions in R

Function (cont.) Description (cont.)cpgram Plot Cumulative Periodogramdf.kernel Smoothing Kernel Objectsdiffinv Discrete Integrals: Inverse of Differencingembed Embedding a Time SeriesEuStockMarkets Daily Closing Prices of Major European

Stock Indices, 1991-1998.fdeaths Monthly Deaths from Lung Diseases in the UKfilter Linear Filtering on a Time Seriesis.tskernel Smoothing Kernel Objectskernapply Apply Smoothing Kernelkernel Smoothing Kernel Objectslag Lag a Time Serieslag.plot Time Series Lag PlotsLakeHuron Level of Lake Huron 1875–1972ldeaths Monthly Deaths from Lung Diseases in the UKlh Luteinizing Hormone in Blood Sampleslynx Annual Canadian Lynx trappings 1821–1934mdeaths Monthly Deaths from Lung Diseases in the UKna.contiguous NA Handling Routines for Time Seriesnottem Average Monthly Temperatures at

Nottingham, 1920–1939pacf Autocovariance and Autocorrelation Function Estimationplot.acf Plotting Autocovariance and Autocorrelation Functionsplot.spec Plotting Spectral Densitiesplot.stl Methods for STL Objectsplot.tskernel Smoothing Kernel ObjectsPP.test Phillips-Perron Unit Root Testpredict.ar Fit Autoregressive Models to Time Seriespredict.arima0 ARIMA Modelling of Time Series - Preliminary Versionprint.ar Fit Autoregressive Models to Time Seriesprint.arima0 ARIMA Modelling of Time Series - Preliminary Versionprint.stl Methods for STL Objectsprint.tskernel Smoothing Kernel Objectsspec Spectral Density Estimationspec.ar Estimate Spectral Density of a Time

Series from AR Fitspec.pgram Estimate Spectral Density of a Time

Series from Smoothed Periodogramspec.taper Taper a Time Series

© USQ, February 21, 2007

Page 307: Study Book

303

Function (cont.) Description (cont.)spectrum Spectral Density Estimationstl Seasonal Decomposition of Time Series by Loesssummary.stl Methods for STL Objectssunspot Yearly Sunspot Data, 1700-1988.

Monthly Sunspot Data, 1749-1997.toeplitz Form Symmetric Toeplitz Matrixtreering Yearly Treering Data, -6000-1979.ts.intersect Bind Two or More Time Seriests.plot Plot Multiple Time Seriests.union Bind Two or More Time SeriesUKDriverDeaths Deaths of Car Drivers in Great Britain, 1969-84UKLungDeaths Monthly Deaths from Lung Diseases in the UKUSAccDeaths Accidental Deaths in the US 1973-1978Tskernel Smoothing Kernel Objects

© USQ, February 21, 2007

Page 308: Study Book

304 Appendix D. Time series functions in R

© USQ, February 21, 2007

Page 309: Study Book

Appendix EMultivariate analysisfunctions in R

Table E.1: The multivariate statistics library in r.

Function Descriptionability.cov Ability and Intelligence Testsas.dendrogram General Tree Structuresas.dist Distance Matrix Computationas.hclust Convert Objects to Class hclustas.matrix.dist Distance Matrix Computationbiplot Biplot of Multivariate Databiplot.princomp Biplot for Principal Componentscancor Canonical Correlationscmdscale Classical (Metric) Multidimensional Scalingcut.dendrogram General Tree Structurescutree Cut a tree into groups of datadist Distance Matrix Computationfactanal Factor Analysisfactanal.fit.mle Factor Analysisformat.dist Distance Matrix ComputationHarman23.cor Harman Example 2.3Harman74.cor Harman Example 7.4

305

Page 310: Study Book

306 Appendix E. Multivariate analysis functions in R

Function (cont.) Description (cont.)hclust Hierarchical Clusteringidentify.hclust Identify Clusters in a Dendrogramkmeans K-Means Clusteringloadings Print Loadings in Factor Analysisnames.dist Distance Matrix Computationplclust Hierarchical Clusteringplot.dendrogram General Tree Structuresplot.hclust Hierarchical Clusteringplot.prcomp Principal Components Analysisplot.princomp Principal Components AnalysisplotNode General Tree StructuresplotNodeLimit General Tree Structuresprcomp Principal Components Analysispredict.princomp Principal Components Analysisprincomp Principal Components Analysisprint.dist Distance Matrix Computationprint.factanal Print Loadings in Factor Analysisprint.hclust Hierarchical Clusteringprint.loadings Print Loadings in Factor Analysisprint.prcomp Principal Components Analysisprint.princomp Principal Components Analysisprint.summary.prcomp Principal Components Analysisprint.summary.princomp Summary method for Principal Components Analysispromax Rotation Methods for Factor Analysisrect.hclust Draw Rectangles Around Hierarchical Clustersscreeplot Screeplot of PCA Resultssummary.prcomp Principal Components Analysissummary.princomp Summary method for Principal Components Analysisvarimax Rotation Methods for Factor Analysis

© USQ, February 21, 2007

Page 311: Study Book

Bibliography

[1] Joint Archive for Sea Level, http://uhslc.soest.hawaii.edu/uhslc/jasl.html.

[2] Climate Indicies, from the Climate Diagnostic Centre, http://www.cdc.noaa.gov/ClimateIndices/

[3] Climate Prediction Center http://www.cpc.noaa.gov/

[4] U.S. Geological Survey, Hydro-Climatic Data Network (HCDN):Streamflow Data Set, 1874–1988 By J.R. Slack, Alan M. Lumb,and Jurate Maciunas Landwehr. http://water.usgs.gov/pubs/wri/wri934076/1st_page.html

[5] Hyndman, Rob. The Time Series Data Library, http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/index.htm

[6] Anderson, O. D. (1976). Time Series Analysis and Forecasting: TheBox–Jenkins Approach. London and Boston: Butterworths.

[7] Assimakopoulos, V. and Nikolopoulos, K. (2000) ‘The theta model:a decomposition approach to forecasting’ in International Journal ofForecasting 16(4) 521–530.

[8] Basilevsky, A. (1994). Statistical factor analyasis and related methods,New York: John Wiley and Sons.

[9] Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis, SanFrancisco: Holden-Day.

[10] Buell, C. Eugene and Bundgaard, Robert C. (1971). ‘A factor analysisof winds to 60 km over Battery MacKenzie, C.Z.’ in Journal of AppliedMeteorology, 10(4), 803–810.

[11] Chatfield, Chris (1996). The Analysis of Time Series: an introduction,Boca Raton:Chapman and Hall.

307

Page 312: Study Book

308 Bibliography

[12] Chin, Roland T., Jau, Jack Y. C. and Weinman, James A. (1987). ‘Theapplication of time series models to cloud field morphology analysis’ inJournal of Climate and Applied Meteorology, 26, 363–373.

[13] Chu, Pao-Shin and Katz, Richard W. (1985). ‘Modeling and forecast-ing the southern oscillation: A Time-Domain Approach’ in MonthlyWeather Review, 113, 1876–1888.

[14] Claps, P. and Murrone, F. (1994). ‘Optimal parameter estimation ofconceptually-based streamflow models by time series aggregation’ inStochastic and Statistical Methods in Hydrology and Environmental En-gineering, Volume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panuand Vijay P. Singh, Netherland: Kluwer Academic Publishers p421–434.

[15] Davis, J. M. and Rapoport, P. N. (1974). ‘The use of time series analysistechniques in forecasting meteorological drought’, in Monthly WeatherReview 102, 176–180.

[16] Enfield, D.B., A. M. Mestas-Nunez and P.J. Tribble, (2001). ‘The At-lantic multidecadal oscillation and it’s relation to rainfall and river flowsin the continental U.S.’ in Geophysical Research Letters, 28, 2077–2080.

[17] Fritts, Harold C. (1974). ‘Relationships of ring widths in arid-siteconifers to variations in monthly temperature and precipitation’ in Eco-logical Monographs, 44, 411–440.

[18] Guiot, J. and Tessier, L. (1997). ‘Detection of pollution signals in tree-ring series using AR processes and neural networks’ in Applications ofTime Series Analysis in Astronomy and Meteorology, eds T. Subba Rao,M. B. Priestley and O. Lessi, London: Chapman and Hall, p413–426.

[19] Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E.(1994). A Handbook of Small Data Sets, London: Chapman and Hall.

[20] Hannes, Gerald (1976). ‘Factor analyusis of costal air pressure and wa-ter temperature’ in Journal of Applied Meteorology, 15(2), 120–126.

[21] Hipel and McLeod (1984). Time Series Modelling of Water Resourcesand Environmental Systems, Elsevier.

[22] Hyndman, Rob J. and Billah, Baki. ‘Unmasking the Theta method’ toappear in International Journal of Forecasting.

[23] Izenman, A. J. (1983). ‘J. R. Wolf and H. A. Wolfer: An historicalnote on the Zurick sunspot relative numbers’ in Journal of the RoyalStatistical Society A, 146, 311–318.

© USQ, February 21, 2007

Page 313: Study Book

Bibliography 309

[24] Kalnicky, Richard A. (1987) ‘Seasons, singularities, and climaticchanges over the midlatitudes of the northern hemisphere during 1899–1969’ in Journal of Applied Meterology, 26(11), 1496–1510.

[25] Karner, Olavi and Rannik, Ullar (1996). ‘Stochastic models to repre-sent the temporal variability of zonal average cloudiness’ in Journal ofClimate, 9, 2718–2726.

[26] Katz, Richard W. and Skaggs, Richard H. (1981). ‘On the use ofautoregressive-moving average processes to model meteorological timeseries’ in Monthly Weather Review, 109, 479–484.

[27] Katz, Richard W. and Glantz, Michael H. (1986). ‘Anatomy of a rainfallindex’ in Monthly Weather Review, 114(4), 764–771.

[28] Kavvas, M. L. and Delleur, J. W. (1981). ‘A stochastic cluster modelof daily rainfall sequences’ in Water Resources Research, 17(4), 1151–1160.

[29] Kidson, John W. (1975). ‘Eigenvector analysis of monthly mean surfacedata’ in Monthly Weather Review, 103(3), 177–186.

[30] Kim, Jae-On and Mueller, Charles W. (1990). Fcator Analysis: Sta-tistical Methods and Prcatical Issues, Sage University Paper series onQuantitative Applications in the Social Sciences, series no. 14. BeverleyHills and London: Sage Publications.

[31] Maier, H. R. and Dandy, G. C. (1995). Comparison Of The Box-JenkinsProcedure with Artificial Neural Network Methods for Univariate TimeSeries Modelling, Volume 1, Research Report R127, Department of Civiland Environmental Engineering, The University of Adelaide.

[32] Makridakis, Spyros and Hibon, Michele (2000). ‘The M3-Competition:results, conclusions and implications’ in International Journal of Fore-casting 16(4), 451–476

[33] Mantua, Nathan J. Hare, Steven R., Zhang, Yuan, Wallace, John M.,and Francis, Robert C. (1997). ‘A Pacific interdecadal climate oscilla-tion with impacts on salmon production’ in the Bulletin of the AmericanMeteorological Society, 78, 1069–1079.

[34] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Anal-ysis, London: Academic Press.

[35] Michaelsen, Joel (1982). ‘A statistical study of large-scale, long periodvariability in North Pacific sea surface temperature anomalies’ in Jour-nal of Physical Oceanography, 12(7), 694–703.

© USQ, February 21, 2007

Page 314: Study Book

310 Bibliography

[36] Parzen, Emanuel (1979). ‘Nonparametric statistical data modeling’ inJournal of the American Statistical Association, 74(365), 105–121.

[37] Preisendorfer, RW and Mobley, CD (1984). ‘Climate forecast verifica-tions , United States Mainland 1974–83’ in Monthly Weather Review,112, 809–825.

[38] Richman, MB (1986). ‘Rotation of principal components’ in Journal ofClimatology, 6, 293–335.

[39] Rogers, Jeffery C. (1976). ‘Sea surface temperature anomalies in theeastern North Pacific and associated wintertime atmospheric fluctu-ations over North America, 1960–73’ in Monthly Weather Review,104(8), 985–993.

[40] Sales, P. R. H., Pereira, B. de B. and Vieira, A. M. (1994). ‘Linearprocedures for time series analysis in hydrology’ in in Stochastic andStatistical Methods in Hydrology and Environmental Engineering, Vol-ume 3, eds Keith W. Hipel, A. Ian McLeod, U. S. Panu and Vijay P.Singh, Netherland: Kluwer Academic Publishers p105–117.

[41] Shaw, N. (1942). Manual of Meterology, Volume 1, London: CambridgeUniversity Press.

[42] Stone RC and Auliciems A. (1992). ‘SOI phase relationships with rain-fall in eastern Australia’ in International Journal of Climatology, 12,625–636.

[43] Trenberth Kevin E. and and Stepaniak, David P. (19XX). ‘Indices ofEl Nino Evolution’ in Journal of Climate, 14, 1697–1701.

[44] Tong, Howell (1983). Threshold Models in Nonlinear Time Series Anal-ysis, Springer-Verlag.

[45] Unal,Yurdanur, Kindap, Tayfun and Karaca, Mehmet (2003). ‘Redefin-ing the climate zones of Turkey using cluster analysis’ in Internationaljournal of climatology 23, 1045–1055.

[46] Venables, W. N. and Ripley, B. D. (1997). Modern Applied Statisticswith S-PLUS, second edition, Springer-Verlag: New York.

[47] Visser, H. and Molenaar, J. (1995). ‘Trend estimation and regressionanalysis in climatological time series: an application of structural timeseries models and Kalman filter’ in Journal of Climate, 8, 969–979.

[48] Wilks, DS (1989). ‘Conditioning stochastic daily precipitation modelson total monthly precipitation’ in Water Resources Research,

25, 1429–1439.

© USQ, February 21, 2007

Page 315: Study Book

Bibliography 311

[49] Wilks, Daniel S. (1995). Statistical Methods in the Atmospheric Sci-ences. Academic Press, San Diego.

[50] Wolff, George T., Morrisey, Mark L. and Kelly, Nelson A. (1984). ‘Aninvestigation of the sources of summertime haze in the Blue RidgeMountains using multivariate statistical methods’ in Journal of AppliedMeteorology, 23(9), 1333–1341.

[51] Woodward, Wayna A. and Gray, H. L. (1995). ‘Selecting a model fordetecting the presence of a trend’ in Journal of Climate, 8, 1929–1937.

[52] Yao, C. S. (1983). ‘Fitting a linear autoregressive model for long-rangeforecasting’ in Monthly Weather Review, 111, 692–700.

[53] Zwiers, Francis and von Storch, Hans (1990). ‘Regime-dependent au-toregressive time series models of the southern oscillation’ in Journalof Climate, 3, 1347–1363.

© USQ, February 21, 2007

Page 316: Study Book

312 Bibliography

© USQ, February 21, 2007