C HAPTER 5 Summarizing Bivariate Data What conclusions can be made when considering the effect of...

CHAPTER 5

Summarizing Bivariate Data

What conclusions can be made when considering the effect of one treatment on another?

SCATTERPLOTS5-1 What is a scatterplot and what can be

determined from them?

TYPES OF DATA

Univariate—one list

Bivariate—two lists

Multivariate—multiple lists

SCATTERPLOT

The most important graphical representation of bivariate data

Plotted on a Cartesian coordinate system

graphs

5.1 HOMEWORK

Page 150-151 2, 4, 6, 8

CORRELATION5-2

WHAT IS MEANT BY CORRELATION?

Strong Negative Correlation

As x increases, y decreases

Strong Positive Correlation

As x increases, y increases

No Correlation

x and y do not appear to related

Correlation coefficient—

Indicates the strength of the relationship of bivariate data.

Pearson’s correlation coefficient is the most commonly used and often called simply THE correlation coefficient.

Find , Sx (ave. x, sd of x)

, Sy (ave. y, sd of y)

zx (calc the z-score for each xi)

zy (calc the z-score for each yi)

multiply zx zy (multiply the zx and the zy)

Calc. r

remember -1 ≤ r ≤ 1

To calculate Pearson’s Correlation Coefficientby hand

X Y zx zy zx zy1n

zzr yx

xy

Use the chart to help

Enter the data in L1, L2 Turn on the diagnostics Find the linear

regression for the data

To calculate Pearson’s Correlation Coefficientby calculator

Strong Negative Correlation

As x increases, y decreases

Strong Positive Correlation

As x increases, y increases

No Correlation

x and y do not appear to related

Correlation values:-1 to -.8 and .8 to 1 strong-.8 to -.5 and .5 to .8 moderate-.5 to .5 weak

Same Slide as before with an addition

EXAMPLE 1observation 1 2 3 4 5 6 7 8 9 10

crisis management score 20 13 27 18 19 21 0 21 21 11

family strength score 50 60 67 57 49 72 50 68 60 58

Find the correlation coefficient for crisis management vs family strength

Using both the calculator and excel

Repeat switching L1 and L2 on the calculator

what does this indicate?

n

yy

n

xx

n

yxxy

r2

22

2 )()(Alternate method:

Listed on formula sheet

Will only be used if they give you summary statistics

Properties of r Does not depend on the unit of measurement Does not depend on which is labeled x Is always between -1 and 1 1 indicates a strong positive correlation 0 indicates no correlation -1 indicates a strong negative correlation--measures the extent to which x and y have a linear

relationship

r – for the sample

Correlation DOES NOT imply causation Often two items have a high correlation not because

they impact each other but because they are strongly related to a third item

EX.Among elementary students, there is a strong positive correlation between vocabulary size and the number of cavities. WHY?

They are both related to age.

Spearman’s Rank correlation Coefficient Not as effected by “outliers” Order the x’s low to high Order the y’s low to high Keep the original x and y togetherEX

Use the calculator as before OR

12)1)(1(

4)1(

))((2

nnn

nnyrankxrank

rs

2 1 3 4

X 3 -2 5 7

Y 6 9 4 12

2 3 1 4

-1< rs < 1

5.2 HOMEWORK

P 163 5.9, 5.10, 5.12, 5.13,

5.14, 5.16, 5.18, 5.22

5.3 FITTING A LINE TO BIVARIATE DATA How do you fit a line to linear data?

5.3 FITTING A LINE TO BIVARIATE DATA Activation:

Given the following points, find the equation

X Y .-2 2

0 -2 2 -6

VARIABLES DEFINED

X = the independent or explanatory variable

Y = the dependent or response variable

Stat version of the linear regression (#8)y = a + bx

Algebra and calculus version (#4)y = ax + b

The slope and y-intercept are the same but stat prefers the other set up

REGRESSION LINEFORMED BY THE PRINCIPLE OF LEAST SQUARES

Determine the vertical distance each point is to the line which is supposed to represent the overall pattern of the data

if y = a + bx then

the predicted points are (x1, y1), (x2, y2), (x3, y3), etc.

the vertical distance is

yi – (a + bxi)

if this is positive yi is above the prediction line

if this is negative yi is below the prediction line

The least squares regression line is the one that minimizes

The formula for the least squares line is

a and b can be calculated by

(on the AP STAT formula sheet)

LEAST SQUARES REGRESSION LINE

2))(( ii bxay

bxay ˆ

2)(

))((

xx

yyxxb xbya

CALCULATING BY HAND

n

xx

n

yxxy

b

2

2 )(

xbya

These values can be calculated straight from the data. This formula is not on the formula sheet and is only used when the summary values are given.

LEAST SQUARES REGRESSION LINE

USE for INTERPOLATION not EXTRAPOLATION

Interpolation—data values between the given values

Extrapolation—data values beyond the given values If you are asked to extrapolate always state that

the values may not be accurate due to extrapolation

EXAMPLEAge in months Height in inches

19 22

21 23

23

24 25

27 28

29 31

31 28

34 32

38 34

43 39

50 45

72 48

84 54

58

120 62

128

Find the linear regression line for the given data: then find the values for the missing data

MINITAB INFOxy 407.354.61ˆ

a

The Regression equation isChollevl=61.5 + 3.41 perchgwt

Predictor Coef Stdev t-ratio pConstant 61.537 2.268 27.13 0.000Perchgwt 3.407 1.028 3.31 0.007

value of a value of b (slope)% weight change

Cho

lest

erol

leve

l

Should only be used to predict cholesterol from weight. And only weights from -5 to 3 should be used with any certainty.

USING PEARSON’S CORRELATION COEFFICIENT AND ALGEBRAIC MANIPULATION:

Given and

1) If

Then

2) If r =1

if

if

3) If it is not a perfect correlation let r =.5

Then substituting

this means that y will be r standard deviations from

that x is from

Hence it pulls (regresses) y back into the line

x

y

s

srb )(ˆ xx

s

sryy

x

y

xx

yy ˆ

)(ˆ xxs

syy

x

y

xsxx 1

ysyy ˆ

xsxx 2

ysyy 2ˆ

)(5.ˆ xxs

syy

x

y

xsxx 1

ysyy 5.ˆ

yx

5.3 HOMEWORK

Page 174-176 26, 27, 28, 31, 32, 34

5.4 ASSESSING THE FIT OF A LINE

How do you assess how well a line fits the data?

3 CHECKS FOR FIT

1) Is a line an appropriate way to summarize the data (does it the shape appear to be linear)

2) Are there any unusual aspects of the data that

need to be considered before making predictions

3) How accurate can we expect these predictions to

be

FINDING RESIDUALS The distance from the actual or observed to the

predicted value (HINT: this is an AP class a residual is Actual – Predicted)

ii yy ˆUsing the calculator to find residuals L1=x L2=y L3= predicted L3

vars stat 5EqReg EQreplace the X in Reg EQ w/L1

L4 = residuals

L4 type L2 – L3

PLOTTING RESIDUALS OR

There are two types of residuals that can be plotted Each gives us a picture that can be examined

Residuals for a good fit should have no particular pattern but should be in a band not be too far from zero

)ˆ,( yyx )ˆ,ˆ( yyy

WHAT TO LOOK FOR

Removal of the data causing a single large residual has a minimal impact on the regression line

Removal of a single influential point, has a large impact on the regression line.

An influential point is one where the x is not in the same group as the rest of the values.

THE COEFFICIENT OF DETERMINATION

Gives the proportion of variation in y that is attributed to the approximate linear relationship between x and y.

0

2 Re1

SST

sidSSr

Amount actually attributed to the linear relationship

Possible amount explained by a linear relationship

Amount not attributed to a linear relationship

SST0 AND SSRESID CALCULATIONS

SST0

Total sum of squares squared variation from

mean of

SSResid The amount of variation

not attributed to a linear relationship

Referred to as the errorsum of squares

SSResid ≤SST0

y2

0 )( yySST i

2)ˆ(Re ii yysidSS

Easy Computational Formulas

SST0=

SSResid =

All items can be obtained from the regression line and 2 variable stats function including the coefficient of determination

n

yy

22 )(

xybyay2

0

2 Re1

SST

sidSSr

STANDARD DEVIATION ABOUT THE LEAST SQUARES LINE

Denoted Se => means the Standard Deviation of error

n-2 relates to degrees of freedom—to be discussed later

For a truly good fit r2 must be larger than .5 and Se should be low

2

Re

n

sidSSSe

MINITAB AND CORRELATION

Page 179

EXAMPLE

Page use data from 5.441)Use the calculator to :

a)draw a scatterplot

b) find the regression line

c) find the correlation coefficient

d) calculate the predicted values

e) calculate the residuals

f) graph the residuals

X Y

92 1.7

92 2.3

96 1.9

100 2.0

102 1.5

102 1.7

106 1.6

106 1.8

121 1.0

143 0.3

5.4 HOMEWORK

Page 188-191 37, 38, 39, 41, 42, 43, 48, 51 c&d

5.5 NONLINEAR RELATIONSHIPS AND TRANSFORMATION

How are nonlinear relationships explained?

TRANSFORMATIONS

DO NOT mean moved from the parent function

DO mean adjusting x and/or y values so that the new points appear linear

Common transformations are sq. roots, logs, and reciprocals

originalAlgebraic transformation

QUADRATIC AND CUBIC FUNCTIONS

Use a graphing calculator or a STAT package such as minitab or fathom

Quadratic equations can be done by hand although it is not recommended

2)ˆ( yy

0

2

0

2

)ˆ(1

Re1

SST

yy

SST

sidSSR

UNDOING A TRANSFORMATION y’ = 1.14 – 1.92x where y’ = log (y)log y = 1.14 – 1.92x10log y = 10 1.14 – 1.92x

y = 101.14 – 1.92x

y = (101.14)(10-1.92x) y = 13.8038 (10-1.92x)

Undoing a transformation yields a curve that fits the data, but is not a least squares line.

DETERMINING WHICH TRANSFORMATION TO USE

+y

-y

-x +x

12

43

If the curve resembles one of the numbered curves to achieve a linear transformation move up(+) or down (-) the power chart as indicated by the closest part of the x or y axis.

Power Function Name

3 X3 Cube

2 X2 Square

1 X No transformation

½ Sq. Root

1/3 Cube Root

0 log x Log

-1 1/x Reciprocal

3 x

x

EXAMPLE

frying time moisture

x y5 16.310 9.715 8.120 4.225 3.430 2.945 1.960 1.3

#3 curve therefore x and/or y down

frying time moisture transformation

x y log(y)5 16.3 1.21218760410 9.7 0.98677173415 8.1 0.90848501920 4.2 0.6232492925 3.4 0.53147891730 2.9 0.46239799845 1.9 0.27875360160 1.3 0.113943352

Is the transformed data linear?

Find the linear regression on the transformation

Check the residual pattern. Try a different transformation. Plot this residual pattern. Which one looks better? Which has a better r value.

5.5 HOMEWORK

Page 206-207 52, 53, 59

5.6 INTERPRETING THE RESULTS OF STATISTICAL ANALYSIS

Read pages 208-209

REVIEW

Page 210-213 61, 63, 64, 66, 68, 69

C HAPTER 5 Summarizing Bivariate Data What conclusions can be made when considering the effect of...

Documents

Transcript of C HAPTER 5 Summarizing Bivariate Data What conclusions can be made when considering the effect of...