More on Two-Variable Data. Chapter Objectives Identify settings in which a transformation might be...

Post on 20-Dec-2015

217 views 2 download

Transcript of More on Two-Variable Data. Chapter Objectives Identify settings in which a transformation might be...

More on Two-Variable Data

Chapter Objectives

• Identify settings in which a transformation might be necessary in order to achieve linearity.

• Use transformations involving powers and logarithms to linearize curved relationships.

• Explain what is meant by a two-way table, and describe its parts.

• Give an example of Simpson’s Paradox.• Explain what gives the best evidence for

causation.• Explain the criteria for establishing causation

when experimentation is not feasible.

The Goal

• Our goal is to fit a model to curved data so that we can make predictions as we did in chapter 3.

• HOWEVER, the only statistical tool we have to fit a model is the least-squares regression model.

• THEREFORE, in order to find a model for curved data, we must first “straighten it out”….

Transforming Relationships

• Data that displays a curved pattern can be modeled by a number of different functions.

• Two most common:– Exponential (y=ABx)– Power (y=AxB)

• Chapter 4 focuses on these two models

pp. 195 – 6

• Example 4.1

• Brain weight v. body weight

• Note about variables:– Sometimes we wish to transform x, or y, or

both x and y.– Therefore we refer to variables generically

as t.

Why

• Linear transformations cannot straighten a curved relationship between two variables.

• Because of this, we must resort to functions that are not linear.

A Note about Monotonic Functions

4.1

• A. y = 2.54 xmonotonic increasing

• B. y = 60/xmonotonic decreasing

• C. circumference = π(diameter)monotonic increasing

• D. SquaredError = (time – 5)2

Not monotonic

Figure 4.5

• What can we learn?– The graph of a linear function (power p = 1) is a straight line.– Powers greater than 1 (like p = 2 and p = 4) give graphs that

bend upward. The sharpness of the bend increases as p increases.

– Powers less than 1 but greater than 0 (like p = 0.5) give graphs that bend downward.

– Powers less than 0 (like p = -0.5 and p = -1) give graphs that decrease as x increases. Greater negative values of p result in graphs that decrease more quickly.

– Look at the p = 0 graph. You may be surprised that this is not the graph of y = x0. Why not? The 0th power x0 is just the constant 1, which is not very useful. The p = 0 entry in the figure is not constant; it is the logarithm, log x. That is, the logarithm fits into the hierarchy of power transformations at p = 0.

pp. 201 - 202

• Example 4.2 runs through several steps from the ladder of power transformations.

• This emphasizes that the process can be one of – (a) making a good guess, based on observations of a

graph of the data, about the type of transformation needed and

– (b) trying several types of the transformation chosen.• This can get tedious, so the next section

introduces a more analytic approach.• The first approach is to look for an exponential

growth pattern, which has the advantage that it can be linearized by taking logarithms (of the response variable) to transform the data.

4.3

• Weight = c1 (height)3 and

strength = c2 (height)2;

therefore, strength = c (weight)2/3, where

c is a constant.

4.4

• A graph of the power law y =x2/3 shows that strength does not increase linearly with body weight, as would be the case if a person 1 million times as heavy as an ant could lift 1 million times more than the ant. Rather, strength increases more slowly. For example, if weight is multiplied by 1000, strength will increase by a factor of (1000)2/3 = 100.

4.5

• Let y = average heart rate and x = body weight.• Keibler’s law says that total energy consumed is

proportional to the three-fourths power of body weight, that is, Energy = c1x3/4.

• But total energy consumed is also proportional to the product of the volume of blood pumped by the heart and the heart rate, that is, Energy = c2(volume)y.

• The volume of blood pumped by the heart is proportional to body weight, that is, Volume = c3x.

• Putting these three equations together yields

c1x3/4 = c2(volume)y = c2(c3x)y.• Solving for y, we obtain 4/1

32

4/31 cxxcc

xcy

Exponential Growth

• Linear growth: adding a fixed increment in each equal time period.

• Exponential growth: multiplying by a fixed number in each equal time period.– Can also be looked at as growing by a fixed

percentage.

p. 205

• Example 4.4• Is this exponential growth?• What is the projected amount for 2005?• Actual was 203,000,000 (2005)• Other interesting statistics:

– 2,000,000,000 cell phones world wide• 4.5% world without

– Average American spends 13 talking hours per month– Average American in 18 – 24 age group spends 22

talking hours per month

Texting in the United States

Logarithm

logbx=y if and only if by=x

The rules for logarithms are

XpX

BAB

A

BAAB

p loglog

logloglog

logloglog

p. 209

• Example 4.6

4.6

• A.

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

1977 1978 1979 1980 1981 1982

Year

Acr

es

4.6

• B. 226260/63024 = 3.59

907075/226260 = 4.01

2826095/907075 = 3.12

• C. log y yields 4.7996, 5.3546, 5.9576, 6.4512

4.6

• C.

4.5000

4.70004.9000

5.1000

5.30005.5000

5.7000

5.9000

6.10006.3000

6.5000

1977 1978 1979 1980 1981 1982

Year

log

(ac

res)

4.6

• D. use calculator to confirm

• E. The residual plot of the transformed data shows no clear pattern, so the line is a reasonable model for these points.

4.6

• F. xy 5558.051.1094ˆlog xy 5558.051.1094ˆlog 1010 xy 5558.051.109410ˆ

xy 5558.051.1094 1010ˆ

4.6

• G. The predicted number of acres defoliated in 1982 is the exponential function evaluated at 1982, which gives 10,719,964.92 acres.

4.9

162 41 x

048576,12 45 x

4.10

• A. Year # children killed

1951 2

1952 4

1953 8

1954 16

1955 32

1956 64

1957 128

1958 256

1959 512

1960 1024

4.10

• B.

0

200

400

600

800

1000

1200

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961

Year

# C

hil

dre

n K

ille

d

4.10

• C. If x = number of years after 1950, then y = the number of children killed x years after 1950 = 2x.

At x = 45, y = 245 = 3.52 x 1013, or

35,200,000,000,000.

4.10

• D.

0

0.5

1

1.5

2

2.5

3

3.5

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961

Year

log

(#

chil

dre

n k

ille

d)

4.10

• E. b = 0.3010

a = -587.008

xy 3010.0008.587ˆlog

p. 215

• Exponential growth models become linear when we apply the logarithm transformation to the response variable y.

• Power law models become linear when we apply the logarithm transformation to both variables.

4.17

• A. Year Value

1 537.50

2 577.81

3 621.15

4 667.73

5 717.81

6 771.65

7 829.52

8 891.74

9 958.62

10 1030.52

4.17

• B.

500.00

600.00

700.00

800.00

900.00

1000.00

1100.00

0 1 2 3 4 5 6 7 8 9 10 11

Year

Val

ue

4.17

• C. 2.73, 2.76, 2.79, 2.82, 2.86, 2.89, 2.92, 2.95, 2.98, 3.01

2.70

2.75

2.80

2.85

2.90

2.95

3.00

3.05

0 2 4 6 8 10 12

Year

log

(Val

ue)

4.18

• Alice has

• Fred has

17.3049075.1500 25

00.300025100500

Cautions About Correlation and Regression

Our Tools for Describing Data Sets

• Correlation– r: Strength, form, direction

• Regression– Generalized pattern– Useful for predictions

• Limitations of our tools– Correlation and regression describe only

linear relationships– The correlation “r” and the “LSRL” are NOT

RESISTANT

Other Cautions

• Extrapolation– The use of a regression line for prediction far

outside the domain used.– Examples:

• Age v. Height• Time v. Death Rate ( Swine Flu)• Time v. Water Level of a Lake• Time v. Children gunned down

Other Cautions

• Lurking Variables– A variable that is not among the explanatory

or response variables in a study and yet may influence the interpretation of relationships among these variables.

– Can falsely suggest relationship between x and y

– Can hide actual relationship between x and y

Other Cautions

• Lurking Variables– An example….

• There's this guy who's going to clean the windows of a mental asylum. A patient follows him shouts to him "I gotta secret, I gotta secret...", he ignores the patient. Again the patient follows him, but he ignores his cries. By the time he's nearly finished the building, he's really curious about what the patients secret is, so he decides to ask the patient. The patient pulls a matchbox out of his pocket, opens it and puts it on a table. Out crawls this little spider. The patient says "spider go left", and the spider walks to it's left a bit. Then he says "spider go right", the spider walks to its right a little bit. He says "spider turn around, walk forward then go right", and sure enough the spider turns around, walks forward, and then goes right a bit. The window cleaner is amazed "Wow! He says, that's amazing!", "No, that's not my secret says the patient, watch". He picks up the spider in his hand and pulls all its legs off then puts it back on the table. "Spider go right", the spider doesn't move, "spider go Left", the spider doesn't move, "Spider turn around" again the spider doesn't move. "There!" he says, "that's my secret, if you pull all a spiders legs off they go deaf....................

• The answer is not available in the original data, but was discovered through some additional research on the Buick Estate Wagon. These data were collected by Consumer's Union on a test track (rather than using the EPA test values for fuel efficiency) following the manufacturer's recommendations for each car's maintenance. Additional research revealed that starting with this model year, Buick recommended a higher tire inflation pressure for the Buick Estate Wagon. The recommended inflation pressure level was higher than the level for other cars in the survey. Harder tires present less rolling resistance and improve gas mileage; therefore, the Buick Estate Wagon outperformed our expectations based on our regression model, which did not account for tire inflation pressure. In our model Tire Pressure is a lurking variable, variable that seems to help in predicting gas mileage but is not included in the model.

Other Cautions

• Using averaged data– Pay particular attention to data that has been

averaged– The correlation and LSRL of these data sets

should not be applied to the individuals that the averages came from

• Example– Examining monthly data and attempting to apply it to a

day of that month.

Beware the post-hoc fallacyBeware the post-hoc fallacy

“Post hoc, ergo propter hoc.”

To avoid falling for the post-hoc fallacy, assuming that an observed correlation is due to causation, you must put any statement of relationship through sharp inspection.

Causation can not be established “after the fact.” It can only be established through well-designed experiments. {see Ch 5}

Explaining AssociationExplaining Association

Strong Associations can generally be explained by one of three relationships.

ConfoundingConfounding: x may cause y, but y may instead be caused by a confounding variable z

CommonCommon ResponseResponse: x and y are reacting to a lurking variable z

CausationCausation:x causes y

CausationCausation

Causation is not easily established.

The best evidence for causation comes from experiements that change x while holding all other factors fixed.

Even when direct causation is present, it is rarely a complete explanation of an association between two variables.

Even well established causal relations may not generalize to other settings.

Common ResponseCommon Response

“Beware the Lurking Variable”

The observed association between two variables may be due to a third variable.

Both x and y may be changing in response to changes in z.

ConfoundingConfounding

Two variables are confounded when their effects on a response variable cannot be distinguished from each other.Confounding prevents us from drawing conclusions about causation.

We can help reduce the chances of confounding by designing a well-controlled experiment.

ExampleExample

People with two cars tend to live longer than people who own only one car. Owning three cars is even better, and so on. What might explain the association?

p. 238

• 4.38: People who use artificial sweeteners in place of sugar tend to be heavier than people who use sugar. Does artificial sweetener use cause weight gain?– There may be a causative effect, but in the

direction opposite to the one suggested: People who are overweight are more likely to be on diets, and so choose artificial sweeteners over sugar. Also, heavier people are at a higher risk to develop diabetes; if they do, they are likely to switch to artificial sweeteners.

p. 238

4.39: Women who work in the production of computer chips have abnormally high numbers of miscarriages. The union claimed chemicals cause the miscarriages. Another explanation may be the fact these workers spend a lot of time on their feet.– Time standing up is a confounding variable in

this case.

p. 239p. 239

4.41: Children who watch many hours of TV get lower grades on average than those who watch less TV. Why does this fact not show that watching TV causes low grades?

p. 239

4.43: High school students who take the SAT, enroll in an SAT coaching course, and take the SAT again raise their mathematics score from an average of 521 to 561. Can this increase be attributed entirely to taking the course?

The effect of coaching and confounded with those of experience. A student who has taken the SAT once may improve his ro her score on the second attempt because of increased familiarity with the test.