Chapter 4

19
03/27/22 1 Chapter 4 Regression

description

Chapter 4. Regression. Regression. Like correlation, regression addresses linear relationships between quantitative variables X & Y Objective of correlation  quantify direction and strength of linear association - PowerPoint PPT Presentation

Transcript of Chapter 4

Page 1: Chapter 4

04/19/23 1

Chapter 4

Regression

Page 2: Chapter 4

04/19/23 2

Regression• Like correlation, regression addresses

linear relationships between quantitative variables X & Y

• Objective of correlation quantify direction and strength of linear association

• Objective of regression derive best fitting line that describes the association

• We are especially interested in the slope of the line

Page 3: Chapter 4

Country Per Capita GDP X

Life ExpectancyY

Austria 21.4 77.48Belgium 23.2 77.53Finland 20.0 77.32France 22.7 78.63Germany 20.8 77.17Ireland 18.6 76.39Italy 21.5 78.51Netherlands 22.0 78.15Switzerland 23.8 78.99UK 21.2 77.37

Same illustrative data as Ch 3

En

ter da

ta in

to ca

lcula

tor

Page 4: Chapter 4

Algebraic equation for a line

• y = a + b∙Xwhere

• b ≡ slope ≡ change in Y per unit X

• a ≡ intercept ≡ value of Y when x = 0

Page 5: Chapter 4

04/19/23 5

ŷ = a + b∙X

where: ŷ ≡ predicted average of Y at a given level

of X

a ≡ interceptb ≡ slope

a and b are called regression coefficients

Statistical Equation for a Line

Page 6: Chapter 4

04/19/23 6

How do we find the equation for the best fitting line through the scatter cloud?

76

77

78

79

18 20 22 24

Lif

e e

xpec

tanc

y (y

rs)

Per Capital GDP

Ans: We use the “least squares method”

Page 7: Chapter 4

04/19/23 7

Slope

y

x

sb r

s

These formulas derive the coefficients for the least squares regression line

Intercept

a y bx

Page 8: Chapter 4

04/19/23 8

0.795 1.532

0.809 77.754 21.52

yx ss

ryx

Statistics for illustrative data (calculated with TI-30XSII)

y

x

sb r

s

0.795(0.809)

1.532

Illustrative Example (GDP & Life Expectancy)

Calculation of regression coefficients by hand:

a y bx 77.754 - (0.420)(21.52)

0.420

68.716

Page 9: Chapter 4

04/19/23 9

“Least Squares” Regression Coefficients via TI-30XIIS

BEWARE! The TI-30XIIS mislabels the slope & intercept. The slope is mislabeled as a and the intercept is mislabeled as b.

It should be the other way around!

STAT > 2-VAR > DATA > STATVAR

Page 10: Chapter 4

04/19/23 10

ŷ = 68.7 + 0.42∙X

Interpretation of Slope (GDP & Life Expectancy)

1 unit X

b = increase in Y per unit X = 0.42 years

Each ↑$1K in GDP associated with a 0.42 year increase in life expectancy

Page 11: Chapter 4

04/19/23 11

Interpretation of Intercept• Mathematically =

the predicted value of Y when X = 0

• In real-world = has no interpretation unless a value of X = 0 is plausible

Page 12: Chapter 4

04/19/23 12

Regression Line for Prediction• Example: What is the

predicted life expectancy of a country with a GDP of 20?

• Ŷx=20 = 68.7 + (0.42)X

= 68.7+(0.42)(20)

= 77.12

• The regression line will always go through (x-bar, y-bar) which in this case is (21.5, 77.8)

• To draw the regression line, connect any two points on the line

Case Study (Life Expectancy)

76

77

78

79

18 20 22 24

Per Capital GDP

Lif

e ex

pec

tan

cy (

yrs)

x

x

Page 13: Chapter 4

04/19/23 13

Coefficient of Determination r2

Interpretation: proportion of the variability in Y mathematically explained by X

Our example r =.809

r2 = .8092 = 0.66.

Interpretation: 66% of the variability in Y (life expectancy) mathematically explained* by X (GDP)

* mathematically explained ≠ causally explained

Page 14: Chapter 4

04/19/23 14

Cautions about linear regression

1. Applies to linear relationships only

2. Strongly influenced by outliers, especially when outlier is in the X direction

3. Do not extrapolate!

4. Association ≠ causation (Beware of lurking variables.)

Page 15: Chapter 4

04/19/23 15

Outliers / Influential Points• Outliers in the X

direction have strong influence (tip the line)

• Example (right)

– Child 18 = outlier in X direction

– Changes the slope substantially

w/o outlier

with outlier

Page 16: Chapter 4

04/19/23 16

0

1

2

3

4

5

6

7

8

0 5 10 15 20 25 30 35

age (years)

hei

gh

t (f

eet)

Do Not Extrapolate! • Example (right): Sarah’s

height from age 3 to 5 • Least squares regression

line: ŷ = 2.32 + .159(X)• Predict height at age 30• ŷ = 2.32 + .159(X)

= 2.32 + .159(30) = 8.68’ (ridiculous)

Do NOT extrapolate beyond the range of X

Page 17: Chapter 4

04/19/23 17

Association ≠ Causation

• “Association” not the same as “causation”

• Lurking variable ≡ an extraneous factor (Z) that is associated with both X and Y

• Lurking variables can confound an association

Page 18: Chapter 4

Example of Confounding by a Lurking Variable

• Explanatory variable X ≡ number of prior children

• Response variable Y ≡ the risk of Down’s syndrome

• Lurking variable Z ≡ advanced age of mother

• X is associated with Y, but does not cause Y in this example

• Z does cause Y

04/19/23 18

Older mother

Number of children Mental retardation

Page 19: Chapter 4

04/19/23 19

Criteria used to establish causality with examples about smoking (X) and lung cancer (Y)

• Strength of association– X & Y strongly correlated

• Consistency of findings– Many studies have shown X & Y correlated

• Dose-response relationship– The more you smoke, the more you increase risk

• Temporality (time relation) – Lung cancer occurs after 10 – 20 years of smoking

• Biological plausibility– Chemical in cigarette smoke are mutagenic