Advanced Topics in Analysis of Economic and Financial Data Using R

148
Advanced Topics in Analysis of Economic and Financial Data Using R and SAS 1 ZONGWU CAI a,b E-mail address: [email protected] a Department of Mathematics & Statistics and Department of Economics, University of North Carolina, Charlotte, NC 28223, U.S.A. b Wang Yanan Institute for Studies in Economics, Xiamen University, China December 27, 2007 c 2007, ALL RIGHTS RESERVED by ZONGWU CAI 1 This manuscript may be printed and reproduced for individual or instructional use, but may not be printed for commercial purposes.

Transcript of Advanced Topics in Analysis of Economic and Financial Data Using R

Advanced Topics in Analysis of Economic and Financial Data Using R and SAS1ZONGWU CAIa,bE-mail address: [email protected]

Department of Mathematics & Statistics and Department of Economics, University of North Carolina, Charlotte, NC 28223, U.S.A.

b

Wang Yanan Institute for Studies in Economics, Xiamen University, China

December 27, 2007

c 2007, ALL RIGHTS RESERVED by ZONGWU CAI

manuscript may be printed and reproduced for individual or instructional use, but may not be printed for commercial purposes.

1 This

PrefaceThis is the advanced level of econometrics and nancial econometrics with some basic theory and heavy applications. Here, our focuses are on the SKILLS of analyzing real data using advanced econometric techniques and statistical softwares such as SAS and R. This is along the line with our WISEs spirit STRONG THEORETICAL FOUNDATION and SKILL EXCELLENCE. In other words, this course covers basically some advanced topics in analysis of economic and nancial data, particularly in nonlinear time series models and some models related to economic and nancial applications. The topics covered start from classical approaches to modern modeling techniques even up to the research frontiers. The dierence between this course and others is that you will learn step by step how to build a model based on data (or so-called let data speak themselves) through real data examples using statistical softwares or how to explore the real data using what you have learned. Therefore, there is no a single book serviced as a textbook for this course so that materials from some books and articles will be provided. However, some necessary handouts, including computer codes like SAS and R codes, will be provided with your help (You might be asked to print out the materials by yourself). Five or six projects, including the heavy computer works, are assigned throughout the term. The group discussion is allowed to do the projects, particularly writing the computer codes. But, writing the nal report to each project must be in your own language. Copying each other will be regarded as a cheating. If you use the R language, similar to SPLUS, you can download it from the public web site at http://www.r-project.org/ and install it into your own computer or you can use PCs at our labs. You are STRONGLY encouraged to use (but not limited to) the package R since it is a very convenient programming language for doing statistical analysis and Monte Carol simulations as well as various applications in quantitative economics and nance. Of course, you are welcome to use any one of other packages such as SAS, MATLAB, GAUSS, STATA, SPSS and EVIEW. But, I might not have an ability of giving you a help if doing so. Some materials are based on the lecture notes given by Professor Robert H. Shumway, Department of Statistics, University of California at Davis and my colleague, Professor Stanislav Radchenko, Department of Economics, University of North Carolina at Charlotte, and the book by Tsay (2005). Some datasets are provided by Professor Robert H. Shumway, Department of Statistics, University of California at Davis and Professor Phillips Hans Franses at University of Rotterdam, Netherland. I am very grateful to them for providing their lecture notes and datasets. Finally, I have to express my many thanks to our Master student Ms. Huiqun Ma for writing all SAS codes.

ii

How to Install R ?The main package used is R, which is free from R-Project for Statistical Computing. It also needs Ox with G@RCH to t various GARCH models. Students may use other packages or programs if they prefer.

Install R(1) go to the web site http://www.r-project.org/; (2) click CRAN; (3) choose a site for downloading, say http://cran.cnr.Berkeley.edu; (4) click Windows (95 and later); (5) click base; (6) click R-2.6.1-win32.exe (Version of November 26, 2007) to save this le rst and then run it to install (Note that the setup program is 29 megabytes and it is updated every three months). The above steps installs the basic R into your computer. If you need to install other packages, you need to do the followings: (7) After it is installed, there is an icon on the screen. Click the icon to get into R; (8) Go to the top and nd packages and then click it; (9) Go down to Install package(s)... and click it; (10) There is a new window. Choose a location to download packages, say USA(CA1), move mouse to there and click OK; (11) There is a new window listing all packages. You can select any one of packages and click OK, or you can select all of them and then click OK.

Install Ox?OxMetricstm is a family of of software packages providing an integrated solution for the econometric analysis of time series, forecasting, nancial econometric modelling, or statistical analysis of cross-section and panel data. OxMetrics consists of a front-end program called OxMetrics, and individual application modules such as PcGive, STAMP, etc. OxMetrics Enterprisetm is a single product that includes all the important components: OxMetrics desktop, G@RCH, Ox Professional, PcGive and STAMP. To install Ox Console, please download the le http://www.math.uncc.edu/ zcai/installation Ox [email protected] and follow the steps.

iii

Data Analysis and Graphics Using R An Introduction (109 pages)I encourage you to download the le r-notes.pdf (109 pages) which can be downloaded from http://www.math.uncc.edu/ zcai/r-notes.pdf and learn it by yourself. Please see me if any questions.

Contents1 Review of Multiple Regression Models 1.1 Least Squared Estimation . . . . . . . 1.2 Model Diagnostics . . . . . . . . . . . 1.2.1 Box-Cox Transformation . . . . 1.2.2 Reading Materials . . . . . . . 1.3 Computer Codes . . . . . . . . . . . . 1.4 References . . . . . . . . . . . . . . . . 1 1 3 3 4 4 11 12 12 13 16 17 19 20 22 22 23 24 32 36 38 38 47 48 48 51 53 53 57 57 58

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Classical and Modern Model Selection Methods 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 Subset Approaches . . . . . . . . . . . . . . . . . . 2.3 Sequential Methods . . . . . . . . . . . . . . . . . . 2.4 Likelihood Based-Criteria . . . . . . . . . . . . . . 2.5 Cross-Validation and Generalized Cross-Validation . 2.6 Penalized Methods . . . . . . . . . . . . . . . . . . 2.7 Implementation in R . . . . . . . . . . . . . . . . . 2.7.1 Classical Models . . . . . . . . . . . . . . . 2.7.2 LASSO Type Methods . . . . . . . . . . . . 2.7.3 Example . . . . . . . . . . . . . . . . . . . . 2.8 Computer Codes . . . . . . . . . . . . . . . . . . . 2.9 References . . . . . . . . . . . . . . . . . . . . . . . 3 Regression Models With Correlated Errors 3.1 Methodology . . . . . . . . . . . . . . . . . . 3.2 Nonparametric Models with Correlated Errors 3.3 Predictive Regression Models . . . . . . . . . 3.4 Computer Codes . . . . . . . . . . . . . . . . 3.5 References . . . . . . . . . . . . . . . . . . . . 4 Estimation of Covariance Matrix 4.1 Methodology . . . . . . . . . . . 4.2 Details (see the paper by Zeileis) 4.3 Computer Codes . . . . . . . . . 4.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

CONTENTS 5 Seasonal Time Series Models 5.1 Characteristics of Seasonality . . . . . 5.2 Modeling . . . . . . . . . . . . . . . . . 5.3 Nonlinear Seasonal Time Series Models 5.4 Computer Codes . . . . . . . . . . . . 5.5 References . . . . . . . . . . . . . . . . 6 Robust and Quantile Regressions 6.1 Robust Regression . . . . . . . . 6.2 Quantile Regression . . . . . . . . 6.3 Computer Codes . . . . . . . . . 6.4 References . . . . . . . . . . . . . . . . . . . . . . . . .

v 59 59 62 71 72 78 80 80 81 83 84 86 86 87 87 88 92 94 96 96 101 101 102 103 103 103 107 108 113 114 119 119 119 121 124 125 126 127 133 133 137

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 How to Analyze Boston House Price Data? 7.1 Description of Data . . . . . . . . . . . . . . 7.2 Analysis Methods . . . . . . . . . . . . . . . 7.2.1 Linear Models . . . . . . . . . . . . . 7.2.2 Nonparametric Models . . . . . . . . 7.3 Computer Codes . . . . . . . . . . . . . . . 7.4 References . . . . . . . . . . . . . . . . . . .

8 Value at Risk 8.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 R Commends (R Menu) . . . . . . . . . . . . . . . . . . . 8.2.1 Generalized Pareto Distribution . . . . . . . . . . . 8.2.2 Background of the Generalized Pareto Distribution 8.3 Reading Materials I and II (see Handouts) . . . . . . . . . 8.4 New Developments (Nonparametric Approaches) . . . . . . 8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Framework . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Nonparametric Estimating Procedures . . . . . . . 8.4.4 Real Examples . . . . . . . . . . . . . . . . . . . . 8.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Long Memory Models and Structural Changes 9.1 Long Memory Models . . . . . . . . . . . . . . . 9.1.1 Methodology . . . . . . . . . . . . . . . 9.1.2 Spectral Density . . . . . . . . . . . . . 9.1.3 Applications . . . . . . . . . . . . . . . . 9.2 Related Problems and New Developments . . . 9.2.1 Long Memory versus Structural Breaks . 9.2.2 Testing for Breaks (Instability) . . . . . 9.2.3 Long Memory versus Trends . . . . . . . 9.3 Computer Codes . . . . . . . . . . . . . . . . . 9.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Tables2.1 9.1 AICC values for ten models for the recruits series . . . . . . . . . . . . . . . 30

Critical Values of the QLR statistic with 15% Trimming . . . . . . . . . . . 129

vi

List of Figures1.1 1.2 1.3 1.4 2.1 2.2 Scatterplot with regression line and lowess smoothed curve. . . . . . . . . . Scatterplot with regression line and both lowess and loess smoothed curves as well as the Theil-Sen estimated line. . . . . . . . . . . . . . . . . . . . . . Residual plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Leverage plot. (b) Inuential plot. (c) Plots without observation #34. . . Monthly SOI (left) and simulated recruitment (right) from a model (n=453 months, 1950-1987). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The SOI series (black solid line) compared with a 12 point moving average (red thicker solid line). The left panel: original data and the right panel: ltered series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple lagged scatterplots showing the relationship between SOI and the present (xt ) versus the lagged values (xt+h ) at lags 1 h 16. . . . . . . . . Autocorrelation functions of SOI and recruitment and cross correlation function between SOI and recruitment. . . . . . . . . . . . . . . . . . . . . . . . Multiple lagged scatterplots showing the relationship between the SOI at time t + h, say xt+h (x-axis) versus recruits at time t, say yt (y-axis), 0 h 15. Multiple lagged scatterplots showing the relationship between the SOI at time t, say xt (x-axis) versus recruits at time t + h, say yt+h (y-axis), 0 h 15. Partial autocorrelation functions for the SOI (left panel) and the recruits (right panel) series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ACF of residuals of AR(1) for SOI (left panel) and the plot of AIC and AICC values (right panel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quarterly earnings for Johnson & Johnson (4th quarter, 1970 to 1st quarter, 1980, left panel) with log transformed earnings (right panel). . . . . . . . . . Autocorrelation functions (ACF) and partial autocorrelation functions (PACF) for the detrended log J&J earnings series (top two panels)and the tted ARIMA(0, 0, 0) (1, 0, 0)4 residuals. . . . . . . . . . . . . . . . . . . . . . . Time plots of U.S. weekly interest rates (in percentages) from January 5, 1962 to September 10, 1999. The solid line (black) is the Treasury 1-year constant maturity rate and the dashed line the Treasury 3-year constant maturity rate (red). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 10 25

26 27 28 28 29 30 31 41

2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2

41

3.3

42

vii

LIST OF FIGURES 3.4 Scatterplots of U.S. weekly interest rates from January 5, 1962 to September 10, 1999: the left panel is 3-year rate versus 1-year rate, and the right panel is changes in 3-year rate versus changes in 1-year rate. . . . . . . . . . . . . Residual series of linear regression Model I for two U.S. weekly interest rates: the left panel is time plot and the right panel is ACF. . . . . . . . . . . . . . Time plots of the change series of U.S. weekly interest rates from January 12, 1962 to September 10, 1999: changes in the Treasury 1-year constant maturity rate are in denoted by black solid line, and changes in the Treasury 3-year constant maturity rate are indicated by red dashed line. . . . . . . . . . . . . Residual series of the linear regression models: Model II (top) and Model III (bottom) for two change series of U.S. weekly interest rates: time plot (left) and ACF (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

43 43

3.5 3.6

44

3.7

45

5.1 5.2 5.3

5.4

5.5

5.6

5.7

US Retail Sales Data from 1967-2000. . . . . . . . . . . . . . . . . . . . . . . 60 Four-weekly advertising expenditures on radio and television in The Netherlands, 1978.01 1994.13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Number of live births 1948(1) 1979(1) and residuals from models with a rst dierence, a rst dierence and a seasonal dierence of order 12 and a tted ARIMA(0, 1, 1) (0, 1, 1)12 model. . . . . . . . . . . . . . . . . . . . . . . . 64 Autocorrelation functions and partial autocorrelation functions for the birth series (top two panels), the rst dierence (second two panels) an ARIMA(0, 1, 0) (0, 1, 1)12 model (third two panels) and an ARIMA(0, 1, 1) (0, 1, 1)12 model (last two panels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Autocorrelation functions (ACF) and partial autocorrelation functions (PACF) for the log J&J earnings series (top two panels), the rst dierence (second two panels), ARIMA(0, 1, 0) (1, 0, 0)4 model (third two panels), and ARIMA(0, 1, 1) (1, 0, 0)4 model (last two panels). . . . . . . . . . . . . . . 68 ACF and PACF for ARIMA(0, 1, 1)(0, 1, 1)4 model (top two panels) and the residual plots of ARIMA(0, 1, 1)(1, 0, 0)4 (left bottom panel) and ARIMA(0, 1, 1) (0, 1, 1)4 model (right bottom panel). . . . . . . . . . . . . . . . . . . . . . . 69 Monthly simple return of CRSP Decile 1 index from January 1960 to December 2003: Time series plot of the simple return (left top panel), time series plot of the simple return after adjusting for January eect (right top panel), the ACF of the simple return (left bottom panel), and the ACF of the adjusted simple return. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Dierent loss functions for Quadratic, Hubers c (), c,0 (), 0.05 (), LAD and 0.95 (), where c = 1.345. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The results from model (7.1). . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Residual plot for model (7.1). (b) Plot of g1 (x6 ) versus x6 . (c) Residual plot for model (7.2). (d) Density estimate of Y . . . . . . . . . . . . . . . . . Boston Housing Price Data: Displayed in (a)-(d) are the scatter plots of the house price versus the covariates X13 , X6 , X1 and X1 , respectively. . . . . . 82 89 90 91

6.1 7.1 7.2 7.3

LIST OF FIGURES 7.4 Boston Housing Price Data: The plots of the estimated coecient functions for three quantiles = 0.05 (solid line) for model (7.4), = 0.50 (dashed line), and = 0.95 (dotted line), and the mean regression (dot-dashed line): a0, (u) and a0 (u) versus u in (a), a1, (u) and a1 (u) versus u in (b), and a2, (u) and a2 (u) versus u in (c). The thick dashed lines indicate the 95% point-wise condence interval for the median estimate with the bias ignored. . . . . . .

ix

92

8.1 8.2

(a) 5% CVaR estimate for DJI index. (b) 5% CES estimate for DJI index. . 114 (a) 5% CVaR estimates for IBM stock returns. (b) 5% CES estimates for IBM stock returns index. (c) 5% CVaR estimates for three dierent values of lagged negative IBM returns (0.275, 0.025, 0.325). (d) 5% CVaR estimates for three dierent values of lagged negative DJI returns (0.225, 0.025, 0.425). (e) 5% CES estimates for three dierent values of lagged negative IBM returns (0.275, 0.025, 0.325). (f) 5% CES estimates for three dierent values of lagged negative DJI returns (0.225, 0.025, 0.425). . . . . . . . . . . . . . . 115 Sample autocorrelation function of the absolute series of daily simple returns for the CRSP value-weighted (left top panel) and equal-weighted (right top panel) indexes. Sample partial autocorrelation function of the absolute series of daily simple returns for the CRSP value-weighted (left middle panel) and equal-weighted (right middle panel) indexes. The log smoothed spectral density estimation of the absolute series of daily simple returns for the CRSP value-weighted (left bottom panel) and equal-weighted (right bottom panel) indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Break testing results for the Nile River data: (a) Plot of F -statistics. (b) The scatterplot with the breakpoint. (c) Plot of the empirical uctuation process with linear boundaries. (d) Plot of the empirical uctuation process with alternative boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Break testing results for the oil price data: (a) Plot of F -statistics. (b) Scatterplot with the breakpoint. (c) Plot of the empirical uctuation process with linear boundaries. (d) Plot of the empirical uctuation process with alternative boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Break testing results for the consumer price index data: (a) Plot of F statistics. (b) Scatterplot with the breakpoint. (c) Plot of the empirical uctuation process with linear boundaries. (d) Plot of the empirical uctuation process with alternative boundaries. . . . . . . . . . . . . . . . . . . . .

9.1

126

9.2

130

9.3

131

9.4

132

Chapter 1 Review of Multiple Regression Models1.1 Least Squared Estimation

We begin our discussion of univariate and multivariate regression (time series) models by considering the idea of a simple regression model, which we have met before in other contexts such as statistics or econometrics courses. All of the multivariate methods follow, in some sense, from the ideas involved in simple univariate linear regression. In this case, we assume that there is some collection of xed known functions of time, say zt1 , zt2 , . . . , ztq that are inuencing our output yt which we know to be random. We express this relation between the inputs (predictors or independent variables or covariates or exogenous variables) and outputs (dependent or response variable) as yt = 1 zt1 + 2 zt2 + + q ztq + et (1.1)

coecients and et is a random error or noise, assumed to be white noise; this means that

at the time points t = 1, 2, , n, where 1 , , q are unknown xed regression

the observations have zero means, equal variances 2 and are (uncorrelated) independent. We traditionally assume also that the white noise series, et , is Gaussian or normally as 1 zt1 + 2 zt2 + + q ztq . distributed. Finally, of course, the basic assumption is E(yt |zt1 , , ztq ) is a linear function Question: If at leat one of those (four) assumptions is violated, what should we do? The linear regression model described by (1.1) can be conveniently written in slightly more general matrix notation by dening the column vectors zt = (zt1 , . . . , ztq )T and = (1 , . . . , q )T so that we write (1.1) in the alternate form yt = T zt + et . 1 (1.2)

CHAPTER 1. REVIEW OF MULTIPLE REGRESSION MODELS

2

To nd estimators for and 2 , it is natural to determine the coecient vector minimizing the sum of squared errors (SSE)n 2 t=1 et

with respect to . Of course, if the distribution

of {et } is known, it should be to maximize the likelihood. This yields least squares (LSE) or maximum likelihood estimator (MLE) and the maximum likelihood estimator for 2n

which is proportional to the unbiased 1 = nq2

t=1

yt zt

T

2

.

(1.3)

Note that the LSE is exactly same as the MLE when the distribution of {et } is normal (WHY?). An alternate way of writing the model (1.2) is as y = Z + e, (1.4)

where ZT = (z1 , z2 , , zn ) is a q n matrix composed of the values of the input variables with the errors stacked in the vector e = (e1 , e2 , , en )T . The ordinary least squares ZT Z = Z y. You need not be concerned as to how the above equation is solved in practice as all computer packages have ecient software for inverting the q q matrix ZT Z to obtain = ZT Z1

at the observed time points and yT = (y1 , y2 , , yn ) is the vector of observed outputs

estimators are the solutions to the normal equations

Z y.

(1.5)

An important quantity that all software produces is a measure of uncertainty for the estimated regression coecients, say Cov = 2 ZT Z1

2 C 2 (cij ).

(1.6)

Then, Cov(i , j ) = 2 cij and a 100(1 )% condence interval for i is i tdf (/2) freedom. What is the df for our case? cii , (1.7)

where tdf (/2) denotes the upper 100(1 )% point on a t distribution with df degrees of

Question: If at leat one of those (four) assumptions is violated, are equations (1.6) and (1.7) still true? If not, how to x or modify them?

CHAPTER 1. REVIEW OF MULTIPLE REGRESSION MODELS

3

It seems that it is VERY IMPORTANT to make sure that all assumptions are satised. The question is how to do this task. Well, that is what called model checking or model diagnostics, discussed in the next section.

1.21.2.1

Model DiagnosticsBox-Cox Transformation

If the distribution of the error is not normal, a naive way to deal with this problem is to use a transformation to the dependent variable. A simple and easy way to use a transformation is the Box-Cox transformation in a regression model. When the errors are heterogeneous and often non-Gaussian, a Box-Cox power transformation on the dependent variable is a useful method to alleviate heteroscedasticity when the distribution of the dependent variable is not known. For situations in which the dependent variable Y is known to be positive, the following transformation can be used: Y =

if = 0, log(Y ), if = 0.

Y 1 ,

Given the vector of data observations {Yi }n , one way to select the power is to use the i=1 that maximizes the logarithm of the likelihood function 1 f () = log Sn (Y ) ( 1) 2n

log(Yi ),i=1

function f ().

can take to be one of 2 + j (0 j 4/) to maximize the logarithm of the likelihood

where Sn (Y ) is the sample standard variance of the transformed data {Yi }. Generally, we

Note that the Box-Cox transformation approach can be applied to handle the case when any covariate is nonlinear. For this case, you need to do a transformation to each individual covariate. When you do the Box-Cox transformation to covariates, you need to minimize the SSE instead of the likelihood by running a regression. There is no build-in function in any statistical package so that when implementing the Box-Cox transformation, you have to write your won code. Another way to choose is to use the Q-Q plot. The command in R for the Q-Q plot is qqnorm() for the Q-Q normal plot or qqplot() for the Q-Q plot of one versus another.

CHAPTER 1. REVIEW OF MULTIPLE REGRESSION MODELS

4

1.2.2

Reading Materials

See materials from le simple-reg.htm. Also, le r-reference.htm contains references about how to use R.

1.3

Computer Codes

To t a multiple regression in R, one can use lm() or glm(); see the followings for details lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ...) to t a regression model without intercept, you need to use fit1=lm(y~-1+x1+...x9) where t1 is called the objective function containing all outputs you need. If you want to model diagnostic checking, you need to use plot(fit1) For multivariate data, it is usually a good idea to view the data as a whole using the pairwise scatter plots generated by the pairs() function: pairs(data) The smoothing parameter in lowess can be adjusted using the f =argument. The default is f = 2/3 meaning that 2/3 of the data points are used in calculating the tted value at each point. If we set f = 1 we get the ordinary regression line. Specifying f = 1/3 should yield a choppier curve. lowess(x, y = NULL, f = 2/3, iter = 3, delta = 0.01 * diff(range(xy$x[o])))

CHAPTER 1. REVIEW OF MULTIPLE REGRESSION MODELS

5

Theres a second loess function in R, loess, that has more options and generates more output. loess(formula, data, weights, subset, na.action, model = FALSE, span = 0.75, enp.target, degree = 2, parametric = FALSE, drop.square = FALSE, normalize = TRUE, family = c("gaussian", "symmetric"), method = c("loess", "model.frame"), control = loess.control(...), ...)

Example 1.1: We examine tting a simple linear regression model to data using R. The data are from Harder and Thompson (1989). As part of a study to investigate reproductive strategies in plants, these biologists recorded the time spent at sources of pollen and the proportions of pollen removed by bumblebee queens and honeybee workers pollinating a species of Erythronium lily. The data set consists of three variables. (1) removed: proportion of pollen removed duration; (2) duration of visit in seconds; and (3) code: 1=bumblebee queens, 2=honeybee workers. The response variable of interest is removed, the predictor is duration. Because removed is a proportion, it oers special challenges for analysis. We will ignore these issues for the moment. We will also look only at the queen bumblebees. I will leave consideration of the full collection of bees as a homework exercise. The following R code is for Example 1.1, can be found in the le ex1-1.r. # 10-12-2006 graphics.off() beepollen