Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf ·...

Post on 18-Aug-2018

227 views 0 download

Transcript of Residuals Outliers and influential points. Correlation vs ...jackd/Stat203_2011/Wk08_3_Full.pdf ·...

Residuals

Outliers and influential points.

Correlation vs. causation

When we use correlation, we make certain assumptions

about the data:

1. A straight-line relationship.

2. Interval data

3. Random sampling

4. Normal distributed characteristics (approximate is OK)

Today we’re going to look at ways these assumptions can

be violated.

First, a tool for finding problems in correlation: Residuals.

One way to show a correlation is to fit a line through the

middle of the data. (Line of best fit)

If the line is definitely upwards and keeps close to the data,

you have a correlation.

Since a line won’t perfect describe the relationship between

two variables, especially when randomness is involved,

there’s some error left over.

These leftovers are called residuals. (as in “left behind”)

Looking at a graph of the residuals can magnify patterns

that were not immediately obvious in the data before.

In this case, the points dip below the line and then come

back above it.

If the relationship between two variables really is linear,

then any other patterns should be random noise.

That means if we see any obvious pattern in the residuals,

including this one, a correlation coefficient isn’t going to tell

you the whole story.

Sometimes people try to correlate interval data to

something ordinal or nominal. This is dumb.

These are residuals from trying to correlate a yes/no

response to something interval.

Using ordinal data will leave huge jumps from one level to

the next. Nominal data simply won’t find on a scatterplot.

Both cases violate of the assumption of interval data.

Sometimes the pattern isn’t a trend in the center of the

data, it can also be a trend in the spread of the data.

If the variation in y changes as x changes, the relationship

between x and y is called heteroscedastic

Hetero means “different” and Scedastic means “scattered”.

Heteroscedastic means there is a different amount of

scatter at different data points. If you encounter it, it could

lower your correlation so it’s worth mentioning. (Look for

fan shapes)

If the variation in y is the same everywhere, we call that

Homoscedastic, meaning “same-scatter”

If the Toronto Stock Exchange Index were correlated to

something linearly, the residual graph would resemble this.

As the index numbers get higher, they tend to jump up and

down more. Going from 10,000 to 10,100 is no big deal.

Going from 500 to 600 is a big deal.

Residuals should look like this: a horizontal band of noise.

There should be no obvious trends or patterns.

The occasional point can be outside the data without issue.

But how far out is too far?

What happens when you get there?

Outliers are a violation of the assumption of normality, and

correlation can be sensitive to outliers.

A value that is far from the rest of the data can throw off

the correlation, or create a false correlation.

Example: In the 1960’s, a survey was done to get various

facts about TV stars.

Intelligence Quotient (IQ) was found to be positively

correlated with shoe size. (r = 0.503, n = 30)

(This story is true, the exact data has been made up)

Could this be a fluke of the data? Did they falsely find a

correlation? (r=.503, n=30)

t-score = 3.08

t* = 2.048 for df=28, .05 significance (2 tailed)

t* = 2.763 for df=28, .01 sig

So p < 0.01. (By computer: p=.0046)

That means it’s possible, but highly unlikely we’ll see a

correlation of this strength in uncorrelated data by chance

alone.

Standard practice is to visualize the data when possible.

There’s no obvious trend, except...

What is that?

There’s one person with very high IQ and very large shoes.

In other words, an outlier.

It`s Bozo the clown.

He had huge clown shoes, and he was a verified genius.

We can’t assume normality with that Bozo in the way.

So what now?

We could remove Bozo from the dataset, but if we remove

data points we don’t like, we could come to almost any

“conclusion” we wanted.

That’s why we have assumption of random selection

If Bozo can’t be in the sample, then his chance of being

selected is zero (no longer equal chance in population).

We can remove him and keep randomization, but it implies

that Bozo was not in the population of interest. (Equal

chance of selection among non-clowns?)

Most respondents wear shoes that fit their feet. Bozo wore

absurdly large shoes, much larger than his feet, for

entertainment.

So dismissing the Bozo data as an outlier is reasonable, his

shoes are fundamentally different.

Let’s try the analysis again without including Bozo’s data.

r = -.006, n=29

Is there still a significant correlation?

...not even close.

t* = 1.314 at .20 significance,

so p-value > .20 (actually p-value = 0.9975)

So removing Bozo the Clown from the dataset completely

changed our results.

Bozo wasn’t just an outlier, he was an influential outlier

because he alone influenced the results noticeably.

Not every outlier is influential, and

Not every influential point is an outlier.

Outliers are points that don’t fit the pattern. Correlation

assumes a linear pattern.

r = .032, p-value = .866

An outlier is anything outside the linear trend.

For a point to be influential, it just has to change the linear

trend. If it’s far enough for the mean in the x-direction, it

doesn’t have to be far from the trend to change the results.

Changing this one point from IQ 100 to IQ 110

Changes the correlation from .016 to .155.

More formally, an outlier is anything with a large residual.

Since normality, and hence symmetry is assumed, the 3

standard deviation rule applies.

Anything with a residual of 3 standard deviations above or

below zero is considered an outlier.

If residuals show heteroscedasticity, outliers are more

likely to show up, and in greater numbers.

Look at your data closely. Get right in its face.

You can use statistics and graphs to intimately know a

dataset, but numbers and pictures aren’t a substitute for

reasoning.

Just because two things happen together (or one after

another) doesn’t mean that one of them causes the other.

A correlation between two things doesn’t imply causation.

Consider this crime and sales data of a large city over five

years (one point = one month)

Homicide rates are strongly positively correlated with ice

cream sales. (r = .652, n=60)

Jumping from correlation to causation, we find that

availability of ice cream is driving people to kill each other.

But correlation works both ways. Ice cream sales are

correlated with homicide rates.

That also must mean that nothing builds an appetite for

cold, cold ice cream like...

cold, cold murder.

Causation works in one direction. Correlation works both

ways. That alone should be enough not to make that leap.

Often there’s a common explanation to increases in both

variables. In this case it’s heat. Both increase in summer.

Simple right? Then how do mistakes like this get made?

Study: Mercury can cause NY loon population drop. (source: Wall Street Journal June 28, 2:21pm)

“ A 10-year study of Adirondack loons shows mercury contamination can lead to

population declines because birds with elevated mercury levels produce fewer chicks

than those with low levels”

“But how can we ever tell causation with statistics?”

Short answer: You can’t

Good answer: You can’t with statistics alone, because

dealing with numbers after the fact is observational.

But you can use it in combination with other fields

(Experimental Design) to manipulate variables.

Indoor greenhouses can manipulate soil type, moisture, and

light directly, but plants still have randomness.

Better answer: (for interest only)

Google books for a preview, or look up the term

“Contrapositive”

Next time, we expand to multiple correlations and partial

correlations. We may finish chapter 10 early.

ASSIGNMENTS DUE AT 4:30PM!!!!