Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points...

35
Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables

Transcript of Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points...

Page 1: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Chapter 9Regression Wisdom

*Subsets*Extrapolation

*Outliers, Leverage, and Influence Points*Lurking Variables

Page 2: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Subsets

• The data should be homogeneous (of the same or a similar kind or nature)

• If the data is made up of two or more groups that have been thrown together, it is usually best to fit different linear models to each group

• Residual plots can help find subsets in the data

Page 3: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Cereal – without subgroups

Page 4: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Cereal – with subgroups

Page 5: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Extrapolation• Although linear models provide an easy way to

predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model

• Such extrapolation may pretend to see into the future, but the predictions should not be trusted

• Example: data was collected from 1945 – 2000 in Massachusetts of the number of women in elected positions. We should NOT use the model to predict how many women will hold office in 2015

Page 6: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework• a: 1900 – 1940 there is a

linear pattern; 1940 – 1970 the data is curved up; 1970 – 2000 there is a strong linear pattern

• b: relatively strong from 1970 – 2000

• c: no, on the whole graph. If we look at 1970 – 2000 then yes there would be a high correlation

• d: no. its not straight enough

Page 7: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework• a: plot the data

from 1955 – 1995. The scatterplot has a slight curve. Check the residual plot!! Residual plot has a pattern to it, so it is not a good place to use a linear model. If you did find an equation the predicted value would be 25.3 years.

• b: not too much. The data is not straight enough to use a linear regression.

• c: 50 years is too far from the data to make a prediction

Page 8: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework

• a: knowing only the R2 value is not enough to use a linear regression. We need to check a residual plot and the 3 conditions (straight enough, quantitative variables, and outliers)

• b: no, a linear model might not even fit

Page 9: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework• a: for every degree the

temp rises the cost will go down $2.13

• b: The cost when the temp is 00F

• c: Too high, the residual is negative showing the model overestimates the cost.

• d: cost = $111. 70• e: actual = $106.70• f: No, the residual plot has

a curve to it. The data are probably not linear

• g: no, there would be no change. The relationship does not depend on the units

Page 10: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Outlier

• Any data point that stands away from the others

• In regression, outliers can be extraordinary in two ways– having a large residual– having a high leverage

Page 11: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Remember

• Linear models do not fit values with large residuals well

• Large residuals always need a second look

Page 12: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Leverage

• Data points whose x-values are far from the mean of x are said to exert leverage of a linear model.

• High leverage points pull the line close to them– large effect of the line– can completely determine the slope and the y-

intercept– with a high enough leverage their residuals can be

deceptively small

Page 13: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Leverage Points

• A linear regression goes through the point– Think of this point as the fulcrum of a lever– The father away a point is from the fulcrum the

more leverage it has

• High Leverage has the potential to change the regression line

Page 14: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

High Leverage Points

• How to decide if the point will change the regression model– Find the regression model with and without the

leverage point

• The point is influential if there is a big change in the model

Page 15: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Influence

• Depends on both leverage and residual– high leverage point whose y value is on the line

the point is NOT influential– moderate leverage point with a very high residual

the point is influential

• YOU HAVE TO CHECK THE MODELS!!

Page 16: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Unusual Points

• Unusual points can sometimes tell us more about a model or data than any other point

• A model based on 1 point is unlikely to be helpful to understand the rest of the data

• Looking at 1 point against the rest of the data is the best way to understand the point

Page 17: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Warning!

• Do NOT throw away points!!!!– Take out unusual points to look at the model

without them

– Throwing them away can give us a false sense of how accurate the model is

• Look for the unusual points in the scatterplot– they can hide in the residual plots

Page 18: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Checking In• Each of these scatterplots shows an unusual point. For each,

tell whether the point is a high leverage point, would have a rage residual, or is influential.

Page 19: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Causation

• No matter how strong the association…• No matter how large the R2 value…• No matter how straight the line is…

you can NOT conclude from the regression alone that one variable CAUSES another

Page 20: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Lurking Variable

• Only for observational data– opposed to data from a designed experiment

• We can not be sure that a lurking variable is not the cause of a strong or weak association

Page 21: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Life Expectancy

• The relationship between life expectance (years) and availability of doctors (measured as √(doctors/person)) for the countries of the world

Page 22: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Life Expectancy

• The relationship between life expectancy (years) and the availability of TVs ((measured as √(TVs/person)) for the countries of the world

Page 23: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Warning!!

• Summarized Data: can give a false sense of how good an association is

Weight (lb) against height (in) for a sample of men. R2 = 41.5%

Mean weight (lb) against height (in). R2 = 80.1%

Means vary less

than individual

values

Page 24: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 10a. slope = -.1 for every mph you increase your mpg

decreases by .1b. y-int = 32 the y-int would be your mpg at 0 mph. c. the residuals are negative, so the model is

overestimating mpgd. 27 mpge. predicted = 27.5 mpg + 1 (residual) = 28.5 mpgf. strong but not linearg. no. the residual plot shows the data is not linear

Page 25: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 11 a

1. high leverage, low residual2. no, not influential to the slope3. correlation would decrease4. the slope would stay about the same because

the point is on the line

Page 26: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 11 b

1.high leverage, small residual (remember the point is pulling the line towards it)2. yes, influential3. correlation would weekend and become less negative 4. the slope would increase toward 0

Page 27: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 11 c

1. some leverage, high residual2. slightly influential 3. correlation would increase because scatter

would decrease4. slope would increase

Page 28: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 11 d

1. low/no leverage, high residual2. not influential 3. correlation would become stronger4. slope would increase

Page 29: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 15

a) stronger, the point has high leverage and is influential so its pulling the line toward it. slope and correlation would both increase

b) you could take the humans out. Now your data is for non-human mammals.

c) moderately strongd) for every year an animal is expected to line it

has to live 15.5 days in its mother before being born

e) 270.4 days

Page 30: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 16

a) hippos would make the association stronger because it is farther from the pattern

b) increasec) no, there must be a good reason to take out

pointsd) yes, the slope changed from 15.5 to 11.6.

that is a big difference

Page 31: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 19

No! There is a high leverage pointwith point:

without point:

There is a large change in R2 and the slope

Page 32: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework #20

a) only 7% of the variation in time is accounted for by the regression on year

b) we can’t say with such a bad regressionc) probably not, the point doesn’t have much

leveraged) 15.9% is better, it appears that swimmers are

taking 14 minutes off there time each year

Page 33: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 22

2 subgroups:1965 – 1985; linear and positive

1994 – 1998; linear and flat (horizontal)

Page 34: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 23

a) the graph is clearly nonlinear, however from about 1972 and on appears to be a positive linear relationshipb) In 2010 CPI = $218.60

Page 35: Chapter 9 Regression Wisdom *Subsets *Extrapolation *Outliers, Leverage, and Influence Points *Lurking Variables.

Homework # 24a) not including Costa Rica the data has a strong negative linear

associationb) Costa Rica has 25 babies/woman. It has to be a mistake,

because it is impossiblec) r = .814 and R2 = 66.4% without Costa Ricad) w/Costa Rica w/out C.R.e) the model with C.R. is not appropriate, the residual plot has some pattern. Without C.R. the residual plot has an even amount of scatter with no patternf) slope: the life expectancy goes down 4.36 years for every baby a woman has. the y-intercept says a woman with no children should live to be 86.8 years old.g) there could be a lurking variable also effecting life expectancy