Linear Models Alan Lee Sample presentation for STATS 760.

20
Linear Models Alan Lee Sample presentation for STATS 760

Transcript of Linear Models Alan Lee Sample presentation for STATS 760.

Page 1: Linear Models Alan Lee Sample presentation for STATS 760.

Linear Models

Alan Lee

Sample presentation for STATS 760

Page 2: Linear Models Alan Lee Sample presentation for STATS 760.

Contents• The problem• Typical data• Exploratory Analysis• The Model• Estimation and testing• Diagnostics• Software• A Worked Example

Page 3: Linear Models Alan Lee Sample presentation for STATS 760.

The Problem

• To model the relationship between a continuous variable Y and several explanatory variables x1,… xk.

• Given values of x1,… xk , predict the value of Y.

Page 4: Linear Models Alan Lee Sample presentation for STATS 760.

Typical Data

• Data on 5000 motor vehicle insurance policies having at least one claim

• Variables are– Y: log(amount of claim)

– x1: sex of policy holder

– x2: age of policy holder

– x3: age of car

– x4: car type (1-20 score, 1=Toyota Corolla, 20 = Porsche)

Page 5: Linear Models Alan Lee Sample presentation for STATS 760.

Exploratory Analysis

• Plot Y against other variables

• Scatterplot matrix

• Smooth as necessary

Page 6: Linear Models Alan Lee Sample presentation for STATS 760.

Log claims vs car age

Page 7: Linear Models Alan Lee Sample presentation for STATS 760.

The Model

• Relationship is modelled using the conditional distribution of Y given x1,…xk. (covariates)

• Assume conditional distribution of Y is N(,2) where depends on the covariates.

Page 8: Linear Models Alan Lee Sample presentation for STATS 760.

The Model (2)

• If all covariates are “continuous”, then

xkxk

• In addition, all Y’s are assumed independent.

Page 9: Linear Models Alan Lee Sample presentation for STATS 760.

Estimation and Testing• Estimate the ’s

• Estimate the error variance 2

• Test if ’s

• Check goodness-of-fit

Page 10: Linear Models Alan Lee Sample presentation for STATS 760.

Least SquaresEstimate ’s by values that minimize the sum of squares (Least squares estimates, LSE’s)

2110

1

)...()( ikki

n

ii xxybSS

Minimizing values are the solution of the Normal Equations. Minimum value is the residual sum of squares (RSS)

estimated by RSS/(n-k-1)

Page 11: Linear Models Alan Lee Sample presentation for STATS 760.

Goodness of Fit

• Goodness of fit measured by R2:

2

1

2

)(

1

yyTotal SS

SSTotal

RSSR

n

ii

0R21 (why?)

R2=1 iff perfect fit (data all on a plane)

Page 12: Linear Models Alan Lee Sample presentation for STATS 760.

Prediction

• Y predicted by

where the hat indicates the LSE

• Standard errors: 2 kinds, one for mean value of Y for a set of x’s, the other for an individual y for a particular set of x’s

kk xx ˆ...ˆˆ110

Page 13: Linear Models Alan Lee Sample presentation for STATS 760.

Interpretation of Coefficients

• The LSE for variable xj is the amount we expect y to increase if xj is increased by a unit amount, assuming all the other x’s are held fixed

• The test for j = 0 is that variable j makes no contribution to the fit, given all other variables are in the model

Page 14: Linear Models Alan Lee Sample presentation for STATS 760.

Checking Assumptions (1)

• Tools are residuals, fitted values and hat matrix diagonals

• Fitted values

• Residuals

• Hat matrix diagonals

(Measure the effect of an observation on its fitted value)

kk xxy ˆ...ˆˆˆ 110

yye ˆ

iii

iiiiiii

hh

yhyhyhyhy

112211 ......ˆ

Page 15: Linear Models Alan Lee Sample presentation for STATS 760.

Checking Assumptions (2)

Assumptions are– Mean linear in the x’s (plot residuals v

fitted values, partial residual plot, CERES plots)

– Constant variance (plot squared residuals v fitted values)

– Independence (time series plot, residuals v preceding)

– Normality/outliers (normal plot)

Page 16: Linear Models Alan Lee Sample presentation for STATS 760.

Remedial Action

• Transform variables

• Delete outliers

• Weighted least squares

Page 17: Linear Models Alan Lee Sample presentation for STATS 760.

Software

• SAS: PROC REG, PROC GLM• R-Plus, R: lm

• Usage:lm(model formula, dataframe, weights,…)

Page 18: Linear Models Alan Lee Sample presentation for STATS 760.

Model Formula• Assume k=3

• If x1,x2,x3 all continuous, fit a planeY~x1 + x2 + x3

• If x1 categorical (eg gender) and x2, x3 continuous, fit a different plane/curve in x2,x3 for each level of x1: Y~x1 + x2 + x3 (planes parallel)Y~x1 + x2 + x3 + x1:x2 + x1:x3 (planes

different)

Page 19: Linear Models Alan Lee Sample presentation for STATS 760.

Insurance Example (1) cars.lm<-lm(logad~poly(CARAGE,2)+PRIMAGEN+gender) summary(cars.lm)

Call:lm(formula = logad ~ poly(CARAGE, 2) + PRIMAGEN + gender)

Residuals: Min 1Q Median 3Q Max -3.9713 -0.4610 0.2376 0.8092 3.9767

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.986329 0.077533 77.210 < 2e-16 ***poly(CARAGE, 2)1 -7.308946 1.229095 -5.947 2.92e-09 ***poly(CARAGE, 2)2 -8.038865 1.232416 -6.523 7.58e-11 ***PRIMAGEN 0.004014 0.001339 2.999 0.00272 ** gender 0.015633 0.041474 0.377 0.70624 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.226 on 4995 degrees of freedomMultiple R-Squared: 0.01611, Adjusted R-squared: 0.01532 F-statistic: 20.45 on 4 and 4995 DF, p-value: < 2.2e-16

Page 20: Linear Models Alan Lee Sample presentation for STATS 760.

Insurance Example (2)> plot(cars.lm)