Chapter 4

80
Chapter 4 Describing Bivariate Numerical Data Created by Kathy Fritz

description

Chapter 4. Describing Bivariate Numerical Data. Created by Kathy Fritz. This line can be used to estimate the age of a crime victim from a blood test. - PowerPoint PPT Presentation

Transcript of Chapter 4

Page 1: Chapter 4

Chapter 4

Describing Bivariate Numerical Data

Created by Kathy Fritz

Page 2: Chapter 4

Forensic scientists must often estimate the age of an unidentified crime victim. Prior to 2010, this was usually done by analyzing teeth and bones, and the resulting estimates were not very reliable. A study described in the paper “Estimating Human Age from T-Cell DNA Rearrangements” (Current Biology [2010]) examined the

relationship between age and a measure based on a blood test. Age and the blood test measure were recorded for 195 people ranging in age from a few weeks to 80 years. A scatterplot of the data appears to the right.

Do you think there is a relationship? If so, what kind? If not, why not?

This line can be used to estimate the age of a

crime victim from a blood test.

Page 3: Chapter 4

Correlation

Pearson’s Sample Correlation Coefficient

Properties of r

Page 4: Chapter 4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

Yes

Page 5: Chapter 4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

Yes

Page 6: Chapter 4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

No, looks curved

Page 7: Chapter 4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

No, looks parabolic

Page 8: Chapter 4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

No

Page 9: Chapter 4

Linear relationships can be either positive or negative in direction.

Are these linear relationships positive or negative?

Positive

Negative

Page 10: Chapter 4

When the points in a scatterplot tend to cluster tightly around a line, the relationship is described as strong.

Try to order the scatterplots from strongest relationship to the weakest.

These four scatterplots were constructed using data from graphs in Archives of General Psychiatry (June 2010).

A B

C D

A, C, B, D

Page 11: Chapter 4

Pearson’s Sample Correlation Coefficient• Usually referred to as just the correlation

coefficient• Denoted by r• Measures the strength and direction

of a linear relationship between two numerical variables

The strongest values of the correlation coefficient are r = +1 and r = -1.

The weakest value of the correlation coefficient is r = 0.An important definition!

Page 12: Chapter 4

Properties of r

1. The sign of r matches the direction of the linear relationship.

r is positive

r is negative

Page 13: Chapter 4

Properties of r

2. The value of r is always greater than or equal to -1 and less than or equal to +1.

0 .5 .8 1-1 -.8 -.5

Weak correlation

Strong correlation

Moderate correlation

Page 14: Chapter 4

Properties of r

3. r = 1 only when all the points in the scatterplot fall on a straight line that slopes upward. Similarly, r = -1 when all the points fall on a downward sloping line.

Page 15: Chapter 4

Properties of r

4. r is a measure of the extent to which x and y are linearly related

Find the correlation for these points:

Compute the correlation coefficient?

Sketch the scatterplot.

x 2 4 6 8 10 12 14

y 40 20 8 4 8 20 40

r = 0

r = 0, but the data set has a definite

relationship!

Does this mean that there is NO relationship between these points?

1 2 3 4 5 6 7 8 9 10 11 12 13 14

10

20

30

40

Page 16: Chapter 4

Properties of r

5. The value of r does not depend on the unit of measurement for either variable.

Mare Weight (in

Kg)

Foal Weight (in Kg)

556 129.0

638 119.0

588 132.0

550 123.5

580 112.0

642 113.5

568 95.0

642 104.0

556 104.0

616 93.5

549 108.5

504 95.0

515 117.5

551 128.0

594 127.5

Calculate r for the data set of

mares’ weight and the weight of

their foals.r = -0.00359

Mare Weight (in

lbs)

Foal Weight (in Kg)

1223.2 129.0

1403.6 119.0

1293.6 132.0

1210.0 123.5

1276.0 112.0

1412.4 113.5

1249.6 95.0

1412.4 104.0

1223.2 104.0

1355.2 93.5

1207.8 108.5

1108.8 95.0

1111.0 117.5

1212.2 127.5

1306.8 127.5

Change the mare weights to pounds by multiply Kg by

2.2 and calculate r.

r = -0.00359

Page 17: Chapter 4

Calculating Correlation CoefficientThe correlation coefficient is calculated using the following formula:

 

where

 and

 

Page 18: Chapter 4

The web site www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student-related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000.

Here is the scatterplot:

Does the relationship appear linear?Explain.

Expenditures

8810

7780 8112 8149 8477 7342 7984

Graduation rates

66.1 52.4 48.9 48.1 42.0 38.3 31.3

Page 19: Chapter 4

College Expenditures Continued:

To compute the correlation coefficient, first find the z-scores.

x y zx zy zxzy

8810 66.1 1.52 1.74 2.64

7780 52.4 -0.66 0.51 -0.34

8112 48.9 0.04 0.19 0.01

8149 48.1 0.12 0.12 0.01

8477 42.0 0.81 -0.42 -0.34

7342 38.3 -1.59 -0.76 1.21

7984 31.3 -0.23 -1.38 0.32

   

To interpret the correlation coefficient, use the definition –

There is a positive, moderate linear relationship between six-year graduation rates and student-related expenditures.

Page 20: Chapter 4

How the Correlation Coefficient Measures the Strength of a Linear Relationship

zx is positivezy is positivezxzy is positive

zx is negativezy is negativezxzy is positive

zx is negativezy is positivezxzy is negative

Will the sum of zxzy be positive

or negative?

Page 21: Chapter 4

How the Correlation Coefficient Measures the Strength of a Linear Relationship

zx is positivezy is positivezxzy is positive

zx is negativezy is negativezxzy is positive

zx is negativezy is positivezxzy is negative

Will the sum of zxzy be positive

or negative?

zx is negativezy is positivezxzy is negative

Page 22: Chapter 4

How the Correlation Coefficient Measures the Strength of a Linear Relationship

Will the sum of zxzy be positive or

negative or zero?

Page 23: Chapter 4

Does a value of r close to 1 or -1 mean that a change in one variable causes a change in the other variable?

Consider the following examples:• The relationship between the number of

cavities in a child’s teeth and the size of his or her vocabulary is strong and positive.

• Consumption of hot chocolate is negatively correlated with crime rate.

These variables are both strongly related to the age of

the child

Both are responses to cold weatherCausality can only be shown by carefully

controlling values of all variables that might be related to the ones under study. In other words, with a well-controlled, well-designed

experiment.So does this mean I should feed children more candy to increase their vocabulary?

Should we all drink more hot chocolate to lower the crime rate?

Association does NOT

imply causation.

Page 24: Chapter 4

Linear Regression

Least Squares Regression Line

Page 25: Chapter 4

Suppose there is a relationship between two numerical variables.

Let x be the amount spent on advertising and y be the amount of sales for the product during a given period.

You might want to predict product sales (y) for a month when the amount spent on advertising is $10,000 (x).

The letter y is used to denoted the

variable you want to predict, called the

response variable (or dependent

variable).

The other variable, denoted by x, is the predictor variable (sometimes called independent or explanatory variable).

Page 26: Chapter 4

Where:b – is the slope of the line

– it is the amount by which y increases when x increases by 1 unit

a – is the intercept (also called y-intercept or vertical intercept)– it is the height of the line above x = 0– in some contexts, it is not reasonable

to interpret the intercept

bxay The equation of a line is:

Page 27: Chapter 4

The Deterministic Model

Notice, the y-value is determined by substituting the x-value into the equation of the line.

Also notice that the points fall on the line.

We often say x determines y.

But, when we fit a line to data, do all the points fall on the line?

Page 28: Chapter 4

How do you find an appropriate line for describing a bivariate data set?

5 10 15 20

10

20

30

40

y = 10 + 2x

To assess the fit of a line, we look at how the points deviate

vertically from the line.

This point is (20,45).

The predicted value for y when x = 20 is:

= 10 + 2(20) = 50

The deviation of the point (20,45) from the line is: 45 - 50 = -5

What is the meaning of a negative

deviation?

The point (15,44) has a deviation of +4.

To assess the fit of a line, we need a way to combine the n deviations

into a single measure of fit.What is the meaning

of this deviation?

Page 29: Chapter 4

Least squares regression line

The least squares regression line is the line that minimizes the sum of squared deviations.The slope of the least squares regression line is:

and the y-intercept is:

The equation of the least square regression line is:

The most widely used measure of the fit of a line y = a + bx to bivariate data is the sum of

the squared deviations about the line.

 

 

Page 30: Chapter 4

(0,0)

(3,10)

(6,2)

Sum of the squares = 54

33

1ˆ xy

Use a calculator to find the least

squares regression line

Find the vertical deviations from the line

-3

6

-3

What is the sum of the deviations

from the line?

Will the sum always be zero?

The line that minimizes the sum of squared deviations is the least

squares regression line.

Find the sum of the squares of the

deviations from the line

Let’s investigate the meaning of the least squares regression line. Suppose we have a data set that consists of the observations (0,0), (3,10) and 6,2).

Hmmmmm . . .

Why does this seem so familiar?

Page 31: Chapter 4

Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer.

In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time.

(x = number of days after injection of cancer cells in mice assigned to plain water and y = average tumor volume (in mm3)

x 11 15 19 23 27

y 150 270 450 580 740

Sketch a scatterplot for this data set. 100

200

300

400

500

600

700

800

10 12 14 16 18 20 22 24 26 28

Number of days after injection

Avera

ge t

um

or

volu

me

Page 32: Chapter 4

 

 

Interpretation of slope:The average volume of the tumor increases by

approximately 37.25 mm3 for each day increase in the number of days after injection.

Does the intercept have meaning in this context? Why or why not?

Computer software and graphing calculators can calculate the least squares regression

line.

Page 33: Chapter 4

Pomegranate study continued

Predict the average volume of the tumor for 20 days after injection.

Predict the average volume of the tumor for 5 days after injection.

 

 

This is the danger of extrapolation. The least squares line should not be

used to make predictions for y using x-values outside the range in the data

set.

It is unknown whether the pattern observed in the scatterplot continues

outside the range of x-values.Why?

Can volume be negative?

Page 34: Chapter 4

Why is the line used to summarize a linear relationship called the least squares regression line? An alternate expression for the slope b is:

The least squares regression line passes through the point of averages

This terminology comes from the relationship between the

least squares line and the correlation coefficient.

Using the point-slope form of a line and r = 1, we can substitute the alternative slope and the point of averages.

which is

If r = 1, what do you know about

the location of the points?

Suppose that a point on the line is one standard deviation above the mean of x. The value of this point would be . Substitute this value for x in our

equation.Notice that when r = 1, the y-value will be one standard deviation above the

mean of y, , for an x-value one standard deviation above the mean of x, .

Page 35: Chapter 4

Why is the line used to summarize a linear relationship called the least squares regression line?

Let’s investigate what happens when r < 1.

Suppose r = 0.5 and . Substitute these values in our equation.

Notice that when r = 0.5, the y-value will be one-half standard deviation above the mean of y, , for an x-value one standard deviation

above the mean of x, .

Using the least squares line, the predicted y is pulled back in (or regressed)

toward .

What would happen if r = 0.4? . . . 0.3? . . . 0.2?

Page 36: Chapter 4

The regression line of y on x should not be used to predict x, because it is not the line

that minimizes the sum of the squared deviations in the x direction.

If you want to predict x from y, can you use the least squares line of y on x?

The slope of the least squares line for predicting x will be not . Also, the intercepts of the lines are almost always different.

Page 37: Chapter 4

Assessing the Fit of a Line

ResidualsResidual Plots

Outliers and Influential PointsCoefficient of Determination

Standard Deviation about the Line

Page 38: Chapter 4

Assessing the fit of a lineImportant questions are:

1. Is the line an appropriate way to summarize the relationship between x and y ?

2. Are there any unusual aspects of the data set that you need to consider before proceeding to use the least squares regression line to make predictions?

3. If you decide that it is reasonable to use the line as a basis for prediction, how accurate can you expect predictions to be?

Once the least squares regression line is obtained, the next step is to examine how effectively the line

summarizes the relationship between x and y.

This section will

look at graphical

and numerical

methods to answer these

questions.

Page 39: Chapter 4

Residuals

Recall, the vertical deviations of points from the least squares regression line are called deviations.

These deviations are also called residuals.

Page 40: Chapter 4

In a study, researchers were interested in how the distance a deer mouse will travel for food (y) is related to the distance from the food to the nearest pile of fine woody debris (x). Distances were measured in meters.

Distance from Debris (x)

Distance Traveled (y)

6.94 0.00

5.23 6.13

5.21 11.29

7.10 14.35

8.16 12.03

5.50 22.72

9.19 20.11

9.05 26.16

9.36 30.65

14.76 -14.76

9.23 -3.10

9.16 2.13

15.28 -0.93

18.70 -6.67

10.10 12.62

22.04 -1.93

21.58 4.58

22.59 8.06

Calculate the predicted y and the residuals.

Dis

tance

tra

vele

d

Distance to debris

If the point is below the line the residual will be

negative.

If the point is above the line the residual will be positive.

Minitab was used to fit the least squares regression line. The regression line is:

Page 41: Chapter 4

Residual plots

• A residual plot is a scatterplot of the (x, residual) pairs.

• Residuals can also be graphed against the predicted y-values

• Isolated points or a pattern of points in the residual plot indicate potential problems.

A careful look at the residuals can reveal many potential

problems.

A residual plot is a graph of the residuals.

Page 42: Chapter 4

Deer mice continuedDistance from

Debris (x)Distance

Traveled (y)

6.94 0.00

5.23 6.13

5.21 11.29

7.10 14.35

8.16 12.03

5.50 22.72

9.19 20.11

9.05 26.16

9.36 30.65

14.76 -14.76

9.23 -3.10

9.16 2.13

15.28 -0.93

18.70 -6.67

10.10 12.62

22.04 -1.93

21.58 4.58

22.59 8.06

Plot the residuals against the distance from debris (x)

Page 43: Chapter 4

-15

-10

-5

5

10

15

5 6 7 8 9Distance f rom debris

Res

idua

lsAre there any isolated points?

Is there a pattern in the points?

Deer mice continued

The points in the

residual plot appear scattered

at random.

This indicates that a line is a reasonable way to describe the relationship between the distance from debris and the distance

traveled.

Page 44: Chapter 4

-15

-10

-5

5

10

15

10 15 20 25 9

Predicted Distance traveled

Resi

dual

s

-15

-10

-5

5

10

15

5 6 7 8 9Distance f rom debris

Res

idua

ls

Residual plots can be plotted against either the x-values or the predicted y-values.

Deer mice continued

Page 45: Chapter 4

Residual plots continuedLet’s examine the accompanying data on x = height (in inches) and y = average weight (in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts).

x 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

y 113

115

118

121

124

128

131

134

137

141

145

150

153

159

164

The scatterplot appears rather

straight.

The residual plot displays a definite curved

pattern.

Even though r = 0.99, it is

not accurate to say that weight

increases linearly with

height

Page 46: Chapter 4

5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age 5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age

Let’s examine the data set for 12 black bears from the Boreal Forest.

x = age (in years) and y = weight (in kg)

Sketch a scatterplot with the fitted regression line.

x 10.5 6.5

28.5 10.5

6.5 7.5 6.5 5.5

7.5 11.5

9.5 5.5

Y 54 40 62 51 55 56 62 42 40 59 51 50

Do you notice anything unusual about this data set?

This observation has an x-value that differs greatly from the others in the data set.

What would happen to the regression line if this point is

removed?

If the point affects the placement of the least-squares regression line,

then the point is considered an influential

point.

Page 47: Chapter 4

Black bears continued

5 10 15 20 25 30

Predicted Distance traveled

45

40

50

55

60

Wei

ght

Age

Notice that this observation falls far away from the regression

line in the y direction.

An observation is an outlier if it has a large residual.

x 10.5 6.5

28.5 10.5

6.5 7.5 6.5 5.5

7.5 11.5

9.5 5.5

Y 54 40 62 51 55 56 62 42 40 59 51 50

Page 48: Chapter 4

Coefficient of Determination

• The coefficient of determination is the proportion of variation in y that can be attributed to an approximate linear relationship between x & y

• Denoted by r2

• The value of r2 is often converted to a percentage.

Suppose that you would like to predict the price of houses in a particular city from the size of the

house (in square feet). There will be variability in house price, and it is this variability that makes

accurate price prediction a challenge.

If you know that differences in house size account for a large proportion of the variability in house price, then knowing the size of a house will help

you predict its price.

Page 49: Chapter 4

Suppose you didn’t know any x-values. What distance would you expect deer mice to travel?

Let’s explore the meaning of r2 by revisiting the deer mouse data set.

x = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

To find the total amount of variation in the distance traveled (y) you need to find the sum of the squares of these deviations from the mean.

Total amount of variation in the distance traveled (y) is

SSTo = 773.95 m2

 Why do we square the deviations?

 

Page 50: Chapter 4

Now let’s find how much variation there is in the distance traveled (y) from the least squares regression line.

Deer mice continuedx = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

x 6.94 5.23 5.21 7.10 8.16 5.50 9.19 9.05 9.36

y 0 6.13 11.29

14.35

12.03

22.72

20.11

26.16

30.65

The amount of variation in the distance traveled (y) from the least squares regression line is

SSResid = 526.27 m2

To find the amount of variation in the distance traveled (y), find the sum of the squared residuals.

Dis

tance

tra

vele

d

Distance to debris

Why do we square the residuals?

 

Page 51: Chapter 4

The amount of variation in y values from the regression line is

SSResid = 526.27 m2

Total amount of variation in the distance traveled (y) is

SSTO = 773.95 m2.

Approximately what percent of the variation in distance traveled (y) can be explained by the linear relationship?

Deer mice continuedx = the distance from the food to the nearest pile of fine woody debris

y = distance a deer mouse will travel for food

r2 = 32%

How does the variation in y

change when we used the least

squares regression line?

 

 

If the relationship between the two variables is negative, then you would

use

Page 52: Chapter 4

The standard deviation about the least squares regression line is

The value of se can be interpreted as the typical amount an observation deviates from the least squares regression line.

Standard Deviation about the Least Squares Regression Line

The coefficient of determination (r 2) measures the extent of variability about the least squares

regression line relative to overall variability in y. This does not necessarily imply that the deviations

from the line are small in an absolute sense.

Page 53: Chapter 4

Partial output from the regression analysis of deer mouse data:

Predictor Coef SE Coef T P

Constant -7.69 13.33 -0.58 0.582

Distance to debris 3.234 1.782 1.82 0.112

S = 8.67071 R-sq = 32.0% R-sq(adj) = 22.3%

Analysis of Variance

Source DF SS MS F P

Regression 1 247.68 247.68 3.29 0.112

Resid Error 7 526.27 75.18

Total 8 773.95

The coefficient of determination (r2):Only 32% of the observed variability in the distance

traveled for food can be explained by the approximate linear relationship between the distance traveled for

food and the distance to the nearest debris pile.

The standard deviation (s):This is the typical amount by which an observation

deviates from the least squares regression line

The y-intercept (a):This value has no meaning in context since it

doesn't make sense to have a negative distance.

The slope (b):The distance traveled to food increases by

approximately 3.234 meters for an increase of 1 meter to the nearest debris pile.

SSResid

SSTo

Page 54: Chapter 4

A small value of se indicates that residuals tend to be small. This value tells you how much accuracy you can expect when using the least squares regression line to make predictions.

A large value of r2 indicates that a large proportion of the variability in y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y.

A useful regression line will have a reasonably small value of se and a reasonably large value of r2.

Interpreting the Values of se and r2

Page 55: Chapter 4

A study (Archives of General Psychiatry[2010]: 570-577) looked at how working memory capacity was related to scores on a test of cognitive functioning and to scores on an IQ test. Two groups were studied – one group consisted of patients diagnosed with schizophrenia and the other group consisted of healthy control subjects.

For the patient group, the typical deviation of the observations from the regression line is about 10.7,

which is somewhat large. Approximately 14% (a relatively small amount) of the variation in the

cognitive functioning score is explained by the linear relationship.

For the control group, the typical deviation of the observations from the regression line is about 6.1,

which is smaller. Approximately 79% (a much larger amount) of the variation in the cognitive functioning

score is explained by the regression line.

Thus, the regression line for the control group would produce more accurate predictions than the

regression line for the patient group.

Page 56: Chapter 4

Putting it All Together

Describing Linear RelationshipsMaking Predictions

Page 57: Chapter 4

Steps in a Linear Regression Analysis1. Summarize the data graphically by constructing a scatterplot2. Based on the scatterplot, decide if it looks like the relationship

between x an y is approximately linear. If so, proceed to the next step.

3. Find the equation of the least squares regression line.4. Construct a residual plot and look for any patterns or unusual

features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step.

5. Compute the values of se and r2 and interpret them in context.

6. Based on what you have learned from the residual plot and the values of se and r2, decide whether the least squares regression line is useful for making predictions. If so, proceed to the last step.

7. Use the least squares regression line to make predictions.

Page 58: Chapter 4

Revisit the crime scene DNA data

Recall the scientists were interested in predicting age of a crime scene victim (y) using the blood test measure (x).Step 1: Scientist first constructed a scatterplot of

the data. Step 2: Based on the scatterplot, it does appear that there is a reasonably strong negative linear relationship between and the blood test measure.

 

Page 59: Chapter 4

Step 4: A residual plot constructed from these data showed a few observations with large residuals, but these observations were not far removed from the rest of the data in the x direction. The observations were not judged to be influential. Also there were no unusual patterns in the residual plot that would suggest a nonlinear relationship between age and the blood test measure.

Step 5: se = 8.9 and r2 = 0.835

Approximately 83.5% of the variability in age can be explained by the linear relationship. A typical difference between the predicted age and the actual age would be about 9 years.

Page 60: Chapter 4

Step 6: Based on the residual plot, the large value of r2, and the relatively small value of se, the scientists proposed using the blood test measure and the least squares regression line as a way to estimate ages of crime victims.

Step 7: To illustrate predicting age, suppose that a blood sample is taken from an unidentified crime victim and that the value of the blood test measure is determined to be -10. The predicted age of the victim would be

Page 61: Chapter 4

Modeling Nonlinear Relationships

Page 62: Chapter 4

Choosing a Nonlinear Function to Describe a Relationship

Function Equation Looks Like

Quadratic

Square root

Reciprocal

5 10

5

10

5 10

5

10

50 100

20

50 100

10

0

-10

�̂�=𝑎+𝑏1𝑥+𝑏2❑𝑥2

�̂�=𝑎+𝑏√𝑥

�̂�=𝑎+𝑏 ( 1𝑥 )

50 100

10

11

12

50 100

8

9

10

Page 63: Chapter 4

Choosing a Nonlinear Function to Describe a Relationship

Function Equation Looks Like

Log

Exponential

Power

50 100

10

5

50 100

10

5

5 10

5

10

5 10

1

2

2

4

2 4

�̂�=𝑎+𝑏 ln𝑥

�̂�=𝑒𝑎+𝑏𝑥

�̂�=𝑎𝑥𝑏

While statisticians often use these

nonlinear regressions, in AP Statistics, we will linearize our data

using transformations. Then we can use what we already

know about the least squares regression

line.

The common log (base 10) may also be used.

Page 64: Chapter 4

Models that Involve Transforming Only xThe square root, reciprocal, and log models all have the form

Where the function of x is square root, reciprocal, or log.

Model Transformation

Square root

Reciprocal

Log

This suggest that if the pattern in the scatterplot of (x, y) pairs looks like one of

these curves, an appropriate transformation of the x values should

result in transformed data that shows a linear relationship.

Read “x prime”

Let’s look at an

example.

Page 65: Chapter 4

Is electromagnetic radiation from phone antennae associated with declining bird populations? The accompanying data on x = electromagnetic field strength (Volts per meter) and y = sparrow density (sparrows per hectare)

Field Strengt

h

Sparrow

Density

0.11 41.71

0.20 33.60

0.29 24.74

0.40 19.50

0.50 19.42

0.61 18.74

1.01 24.23

1.10 22.04

0.70 16.29

0.80 14.69

0.90 16.29

1.20 16.97

1.30 12.83

1.41 13.17

1.50 4.64

1.80 2.11

1.90 0.00

3.01 0.00

3.10 14.69

3.41 0.00

First look at a scatterplot of the data.

The data is curved and

looks similar to the graph

of the log model.

Page 66: Chapter 4

Field Strength vs. Sparrow Density Continued

Ln Field

Strength

Sparrow

Density

-2.207 41.71

-1.609 33.60

-1.238 24.74

-.0916 19.50

-0.693 19.42

-0.494 18.74

0.001 24.23

0.095 22.04

-0.357 16.29

-0.223 14.69

-0.105 16.29

0.182 16.97

0.262 12.83

0.344 13.17

0.405 4.64

0.588 2.11

0.642 0.00

1.102 0.00

1.131 14.69

1.227 0.00

Second, we will transform the data by

using . . .

. . . and graph the scatterplot of y on x’Notice that the

transformed data is now linear. We can find the least

squares regression line.

Sparrow Density = 14.8 – ln (Field Strength)

Predictor Coef SE Coef

T P

Constant 14.805 1.238 11.96 0.000

Ln (field strength)

-10.546 1.389 -7.59 0.000

S = 5.50641 R-Sq = 76.2% R-Sq(adj) = 74.9%

Page 67: Chapter 4

Field Strength vs. Sparrow Density ContinuedSparrow Density = 14.8 – ln (Field Strength)

Predictor Coef SE Coef

T P

Constant 14.805 1.238 11.96 0.000

Ln (field strength)

-10.546 1.389 -7.59 0.000

S = 5.50641 R-Sq = 76.2% R-Sq(adj) = 74.9%A residual plot from the

least squares regression line fit to the transformed data, shown below, has no apparent patterns or unusual features. It appears that the log model is a reasonable choice for describing the relationship between sparrow density and field strength.

The value of R 2 for this model is 0.762 and se = 5.5.

Page 68: Chapter 4

Field Strength vs. Sparrow Density ContinuedSparrow Density = 14.8 – ln (Field Strength)

Predictor Coef SE Coef

T P

Constant 14.805 1.238 11.96 0.000

Ln (field strength)

-10.546 1.389 -7.59 0.000

S = 5.50641 R-Sq = 76.2% R-Sq(adj) = 74.9%This model can now be used to predict sparrow density

from field strength. For example, if the field strength is 1.6 Volts per meter, what is the prediction for the sparrow density?

Page 69: Chapter 4

Models that Involve Transforming yLet’s consider the remaining nonlinear models, the exponential model and the power model.

Model Transformation

Exponential

Power

Exponential Model

Using properties of logarithms, it follows

that . . .

Power Model

Using properties of logarithms, it follows

that . . .

Notice that using the transformations below, the exponential and power models are linearized.

Page 70: Chapter 4

In a study of factors that affect the survival of loon chicks in Wisconsin, a relationship between the pH of lake water and blood mercury level in loon chicks was observed. The researchers thought that it is possible that the pH of the lake could be related to the type of fish that the loons ate. A scatterplot of the data is shown below.

The curve appears to be exponential, therefore use

to transform the data.

The scatterplot of ln(blood mercury level) on lake pH

appears linear.The linear model is

.

Ln(blood mercury level)= 1.06-0.396 Lake pH

Predictor Coef SE Coef T P

Constant 1.0550 0.5535 1.91 0.065

Lake pH -0.3956 0.0826 -4.79 0.000

S = 0.6056 R-Sq = 39.6%

R-Sq(adj) = 37.8%

Page 71: Chapter 4

Choosing Among Different Possible Nonlinear Models

Often there is more than one reasonable model that could be used to describe a nonlinear relationship between two variables.

How do you choose a model?

1) Consider scientific theory. Does it suggest what model the relationship is?

2) In the absence of scientific theory, choose a model that has small residuals (small se) and accounts for a large proportion of the variability in y (large R 2).

Page 72: Chapter 4

Common Mistakes

Page 73: Chapter 4

Avoid these Common Mistakes1. Correlation does not imply causation. A

strong correlation implies only that the two variables tend to vary together in a predictable way, but there are many possible explanations for why this is occurring other than one variable causing change in the other.

Don’t fall into this trap!

The number of fire trucks at a house that is on fire and the amount of

damage from the fire have a strong, positive correlation.

So, to avoid a large amount of damage if your house is on fire – don’t allow several fire trucks to

come to your house?

Page 74: Chapter 4

Avoid these Common Mistakes2. A correlation coefficient near 0 does not

necessarily imply that there is no relationship between two variables. Although the variables may be unrelated, it is also possible that there is a strong but nonlinear relationship.

Be sure to look at a scatterplot!

1 2 3 4 5 6 7 8 9 10 11 12 13 14

10

20

30

40

Page 75: Chapter 4

Avoid these Common Mistakes3. The least squares regression line for

predicting y from x is NOT the same line as the least squares regression line for predicting x from y.

The ages (x, in months) and heights (y, in inches) of seven children are given.

x 16 24 42 60 75 102 120

y 24 30 35 40 48 56 60

To predict height from age:

To predict age from height:

Page 76: Chapter 4

Avoid these Common Mistakes4. Beware of extrapolation. Using the least

squares regression line to make predictions outside the range of x values in the data set often leads to poor predictions.

Predict the height of a child that is 15 years (180 months) old.

It is unreasonable that a 15 year-old would be 81.6 inches or 6.8 feet tall

Page 77: Chapter 4

Avoid these Common Mistakes5. Be careful in interpreting the value of the

intercept of the least squares regression line. In many instances interpreting the intercept as the value of y that would be predicted when x = 0 is equivalent to extrapolating way beyond the range of x values in the data set.

The ages (x, in months) and heights (y, in inches) of seven children are given.

x 16 24 42 60 75 102 120

y 24 30 35 40 48 56 60

Page 78: Chapter 4

Avoid these Common Mistakes6. Remember that the least squares

regression line may be the “best” line, but that doesn’t necessarily mean that the line will produce good predictions.

This has a relatively large se – thus we can’t accurately predict IQ from

working memory capacity.

Page 79: Chapter 4

Avoid these Common Mistakes7. It is not enough to look at just r2 or just se

when evaluating the regression line. Remember to consider both values. In general, your would like to have both a small value for se and a large value for r2.

This indicates that deviations from the line tend to be small.This indicates that the linear

relationship explains a large proportion of the variability in

the y values.

Page 80: Chapter 4

Avoid these Common Mistakes8. The value of the correlation coefficient, as

well as the values for the intercept and slope of the least squares regression line, can be sensitive to influential observations in the data set, particularly if the sample size is small.

Be sure to always start with a plot to check for potential influential observations.