Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring...

Post on 21-Jan-2016

226 views 0 download

Transcript of Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring...

Chapter 14: Inference for Regression

A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables)

Bi-variate data - relationships between 2 numeric, quantitative variables measured on same individual

Each individual appears as an point (x, y) on the scatter plot

Explanatory variable; response variable

Scatterplot; label & scale; look for overall patterns (DOFS)

Measuring Linear Association: Correlation or “r”

Correlation (r) measures direction and strength of a linear relationship between two quantitative variables

Correlation (r) is always between -1 and 1; makes no sense to have r = -13 or r = 27

Correlation (r) is not resistant (look at formula; based on mean)

Correlation is for scatter plots (not LSRL)

r is in standard units, so r doesn’t change if units are changed

If we change from yards to feet, r is not effected

Measuring Linear Association: Correlation or “r”

r ≈0 not strong linear relationship

r close to 1 strong positive linear relationship

r close to -1 strong negative linear relationship

One better ...Least Squares Regression Line (LSRL)

Least Squares Regression (predicts values)

LSRL Model:

is predicted value of response variable

a is y-intercept of LSRL

b is slope of LSRL; slope is predicted (expected) rate of change

x is explanatory variable

Least Squares Regression(predicts values)

May be asked to interpret slope of LSRL & y-intercept, in context

Caution: Interpret slope of LSRL as the predicted or average change or expected change in the response variable given a unit change in the explanatory variable

NOT change in y for a unit change in x; LSRL is a model; models are not perfect

Extrapolation... What is it again??

Extrapolation! Don’t do it… ever.

Example: Growth data from children from age 1 month to age 12 years … LSRL

What is the predicted height of a 40-year old?

Outliers & Influential Points

All influential points are outliers, but not all outliers are influential points. Influential points/observations: If removed would significantly change LSRL (slope and/or y-intercept)

Coefficient of Determination; r2

r2 tells us how well our LSRL describes our data; how well does this linear model fit the data

r2 is always between 0 and 1 ; 0 ≤ r2 ≤ 1

r2, “fraction of the variation of the values of y that are explained by LSRL”

VERSUS r, correlation, -1 ≤ r ≤ 1; describes direction and strength of the linear relationship in a scatter plot

Chapter 14: Inference for Regression

We are now going to take all of that previous knowledge about bi-variate data and apply it to inference (forming judgments about population parameters on the basis of random sampling; a statistic)

Remember, = a + bx is just an estimate, a predictor, a statistic (like or ), based on a sample

Statistics vary from sample to sample

ˆ y

ˆ p

x

SRS BMW Cars (age & price)

What about another SRS of n = 7? Would data/points possibly be different?

So then would LSRL be different?

What about another SRS?

Data varies from sample to sample

Do we know the true population

parameter? Do we have info

on ALL BMW’s?

SRS BMW Cars (age & price)

So this LSRL is just based on THESE 7 pieces of data

We don’t know the true, unknown population parameter regression line, y = βo + β1x

But we can estimate the true,unknown regression line using aconfidence interval... OR ...we can test a claim using anhypothesis test

Let’s talk about conditions...

We need to be aware/check conditions before we perform inference (confidence intervals, hypothesis testing) with any situation (means, proportions, linear regression, one-sample, two-sample, Chi-Square, etc.)

If conditions are not met, our inference may be very inaccurate; worthless information

Conditions for Linear Regression Inference

1. Linearity: trend is linear (Use Residuals Plot to Check)

2. Normality: errors follow a Normal distribution with a mean of zero; N (0, σ ) (Use QQ Plot/Normal Probability Plot to Check)

3. Constant standard deviation: the standard deviation σ must be the same for all values of the predictor variable (Use Residuals Plot to Check)

4. Independence: Errors must be independent of one another (review raw data and collection process)

Residuals ... we look at these to determine if conditions 1 & 3 are met

Least Squares Regression Line is not perfect, but it’s the best model we have

All points on the scatter plot don’t fit perfectly on the LSRL; very common

Vertical distances from pointto LSRL are called “residuals,”or left-overs

Residuals: Observed y value – expected y value

LSRL is the line that creates the least “left-overs,” aka least residuals

Graphical Tool: Residuals Plot

We plot the residuals (left overs, points on scatterplot that are above or below LSRL) to determine if a line is the best model to describe our scatterplot of bivariate data

Perhaps a line isn’t the best model…. Maybe a quadratic curve or a log curve or square root function is a better model for the data

Residuals Plot (truck example)

On left is scatter plot & LSRL; on right is residuals plot

Graphical Tool: Residuals Plot

To check the linearity and the constant standard deviation conditions, should have no obvious pattern, random, unstructured

In the below case, both conditions are met

Residuals Plot

If there is an obvious pattern, conditions 1 & 3 are not met

Condition #2: Normality...

Errors must follow a Normal distribution

Can examine a Normal Probability Plot (NPP) (or a QQ Plot) of the residuals (left-overs)

If NPP is fairly linear, then condition #2 is satisfied

NPP that shows that errors do not follow a Normal distribution

Condition #4... Independence

Errors must be independent of one another

Exam the collection method of the data if possible

In most cases, we must assume independence until if/when we discover otherwise

Equation...for LSRL (sample statistic)

Sample statistic: = a + b x where

x is the value of the explanatory variable

b is estimated slope (sample statistic)

a is estimated y-intercept (sample statistic)

is the estimated value of the response variable (sample statistic)

ˆ y

ˆ y

Equation... for true, unknown population parameter line

Population parameter: y = βo + β1x

x is the value of the explanatory variable

β1 is the true, actual (but unknown) population slope

β0 is the true, actual (but unknown) population y-intercept

y is the true, actual (but unknown) value of the population parameter response variable

Hypothesis testing...

Majority of time, we are most interested in performing an hypothesis test on slope (not y-intercept)

Ho: Slope = 0

(OR β1 = 0 OR there is no linear association between two variables OR correlation = 0)

Ha: Slope ≠ 0 (> or <)

(OR β1 ≠ 0 OR there is a linear association between the two variables OR correlation ≠ 0) ... or > or <

Hypothesis testing...

Same 4 steps:

State null and alternative hypothesis

Check conditions

Do calculations

Interpret results in context

Random sample of 9th grade students...

... going on their annual backpacking trip each fall

Is there a linear relationship between body weight and backpack weight?

www.whfreeman.com/tps5e

Body Weight (lbs) vs. Backpack Weight (lbs)

Body Weight

120 187 109 103 131 158 116

Backpack Weight

26 30 26 24 29 31 28

Ho: No linear relationship between body weight & backpack weight (or β1 = 0)

Ha: There is a linear relationship between body weight & backpack weight (or β1 ≠ 0)

Conditions: Assume all conditions have been checked & met.

Calculations: Enter data into Minitab (one column for body weight & another for backpack weight); then go to regression, simple regression. Careful of response & predictor (backwards). Choose linear, 95% confidence.

Interpretation: Decision, α level, p-value, context.www.whfreeman.com/tps5e

Body Weight (lbs) vs. Backpack Weight (lbs)

Body Weight 120 187 109 103 131 158 116Backpack Weight 26 30 26 24 29 31 28

Construct a confidence interval at the 95% level.

Conditions: Assume all conditions have been checked & met.

Calculations: Enter data into Minitab (one column for body weight & another for backpack weight); then go to regression, simple regression. Careful of response & predictor (backwards). Choose linear, 95% confidence.

Interpretation: We are 95% confident that the true, unknown population parameter, the true slope, β, is between ...

www.whfreeman.com/tps5e

Body Weight (lbs) vs. Backpack Weight (lbs)

Body Weight 120 187 109 103 131 158 116

Backpack Weight 26 30 26 24 29 31 28

Do customers who stay longer at buffets give larger/smaller tips?

Xx

Time (minutes) Tip ($)

23 5.00

39 2.75

44 7.75

55 5.00

61 7.00

65 8.88

67 9.01

70 5.00

74 7.29

85 7.5

90 6.00

99 6.50

Do customers who stay longer at buffets give larger/smaller tips?

A statistics student investigated this question as part of her project. She obtains a SRS of receipts which included this information.

Does this data provide convincing evidence that customers who stay longer tip differently than customers who stay shorter periods of time?

Ho: β = 0 (no relationship between variables)

Ha: β ≠ 0 (customers who stay longer give larger tips)

www.whfreeman.com/tps5e

Do customers who stay longer at buffets give larger/smaller tips?

Ho: β = 0 (no relationship between variables)

Ha: β ≠ 0 (customers who stay longer give larger tips)

Conditions: Assume all conditions have been checked and met.

Calculations: Enter data into Minitab and run calculations.

Interpretation: Decision, α level, p-value, context.www.whfreeman.com/tps5e

Homework...

Homework

Section Quiz

Our next test ...