Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Regression IIAnaysis of diagnostics

• Standard diagnostics

• Bootstrap

• Cross-validation

Standard diagnosticsBefore starting to model

1) Visualisation of data: 1) plotting predictor vs observations. These plots may give a clue about the relationship,

outliers

2) Smootheners

After modelling and fitting

2) Fitted values vs residuals. It may help to identify outliers, correctness of the model

3) Normal QQ plot of residuals. It may help to check distribution assumptions

4) Cook’s distance. Reveal outliers, check correctness of the model

5) Model assumptions - t tests given by default print of lm

Checking model and designing tests

3) Cross-validation. If you have a choice of models then cross-validation may help to choose the “best” model

4) Bootstrap. Validity of the model can be checked if the distribution of statistic of interest is available. Or these distributions could be generated using bootstrap

Visualisation prior to modelingDifferent type of datasets may require different visualisation tools. For simple

visualisation either plot(data) or pairs(data,panel=panel.smooth) could be used. Visualisation prior to modeling may help to propose model (form of the functional relationship between input and output, probability distribution of observation etc)

For example for dataset women - where weights and heights for 15 cases have been measured. plot and pairs commands produce these plots:

After modeling: linear modelsAfter modelling the results should be analysed. For example

attach(women)

lm1 = lm(weight~height)

It means that we want a liner model (we believe that dependence of weight on height is linear)

weight=0+1*height

Results could be viewed using

lm1

summary(lm1)

The last command will produce significant of various coefficients also. Significance levels produced by summary should be considered carefully. If there are many coefficients then the chance that one “significant” effect is observed is very high.

After modeling: linear modelsIt is a good idea to plot data and fitted model, and differences between fitted and

observed values on the same graph. For linear models with one predictor it can be done using:

plot(weight,height)

abline(lm1)

segements(weight,fitted(lm1),weight,height)

This plot already shows some systematic

differences. It is an indication that model

may need to be revised.

Checking validity of the model: standard toolsPlotting fitted values vs residual, QQ plot and Cook’s distance can give some insight

into model and how to improve it. Some of these plots can be done using

plot(lm1)

Prediction and confidence bandslm1 = lm(height~weight))

pp = predict(lm1,interval='p')

pc = predict(lm1,interval='c')

plot(weight,height,ylim=range(height,pp))

n1=order(weight)

matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')

matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red')

These commands produce two sets of bands: narrow and wide. Narrow band corresponds to confidence bands and wide band is prediction band

Bootstrap confidence lines

Similarly bootstrap line can be calculated using

boot_lm(women,flm0,1000)

Functions boot_lm and flm0 are available from the course’s website

Most of the above indicators show that quadratic (quadratic on predictor, not on parameter) model may be better. One obvious way of “improving” the model is to assume that dependence of heights on weights is quadratic. It can be done within linear model also. We can fit polynomial on predictor model

height = 0+1*weight+2*weight2+…

We will use quadratic model:

lm2 = lm(height~weight+I(weight^2))

Again summary of lm2 should be viewed

Default plot now looks better

lm2 = lm(height~weight+I(weight^2))

pp = predict(lm2,interval='p')

pc = predict(lm2,interval='c')

plot(weight,height,ylim=range(height,pp))

n1=order(weight)

matlines(weight[n1],pp[n1,],lty=c(1,2,2),col='red')

matlines(weight[n1],pc[n1,],lty=c(1,3,3),col='red')

Confidence bands using the following set of commands looks narrower

Spread of bootstrap confidence lines also is much smaller also

Which model is better?One of the ways of selecting model is cross-validation. There is no command in R

for cross validation for lm models. However there is a command for glm (generalised linear model. It is the subject of the next lecture. For now we need only to know that lm and glm with family=‘gaussian’ are the same). Let us use default leave one out cross-validation

lm1g = glm(height~weight,women,family=‘gaussian’)

cv1.err = cv.glm(women,lm1g)

cv1.err$delta

Results: 0.2572698 0.2538942

women1 = data.frame(h=height,w1=weight,w2=weight^2)

Lm2g = glm(h~w1+w2,data=women1,family=‘gaussian’)

cv2.err = cv.glm(women1,lm2g)

cv2.err$delta

Results: 0.007272508, 0.007148601

The second has smaller prediction error.

References

1. Stuart, A., Ord, KJ, Arnold, S (1999) Kendall’s advanced theory of statistics, Volume 2A

2. Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for experimenters

3. Berthold, MJ and Hand, DJ. Intelligent Data Analysis

4. Dalgaard, Introductury statistics with R

Exercise 3

Take data set city and analyse it as linear model. Write a report.

Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.

Documents

Transcript of Regression II Anaysis of diagnostics Standard diagnostics Bootstrap Cross-validation.