Wed, June 26, (Lecture 8-2). Nonlinearity. Significance ...jackd/Stat203_2011/Wk08_2.pdf ·...

Post on 11-May-2020

11 views 0 download

Transcript of Wed, June 26, (Lecture 8-2). Nonlinearity. Significance ...jackd/Stat203_2011/Wk08_2.pdf ·...

Wed, June 26, (Lecture 8-2). Nonlinearity.

Significance test for correlation

R-squared, SSE, and SST.

Correlation in SPSS.

Last time, we looked at scatterplots, which show the

interaction between two variables, and correlation.

The correlation coefficient r measures how well the pairs of

values fit on a line.

r is positive when two values increase together.

r is negative when two one value goes up as the other goes

down.

However, correlation only shows the linear relation between

two variables.

The variables could still be related in a non-linear way and

have little or no correlation.

In real world contexts, the most common form of non-linear

relationship is a curvilinear one. (SOURCE: GAPMINDER.ORG)

One common reason is a scaling issue, where a fixed change in

one thing doesn’t mean a fixed change in another.

Life expectancy increases with the logarithm of income, not

with income. (SOURCE: GAPMINDER.ORG)

When we rescale income into a log-scale (a scale that shows

very small and very large numbers equally well), a line appears.

Another reason for non-linearity could be two competing

factors.

In a too-easy course, nobody learns anything new.

In a too-hard course, nobody learns anything at all.

________ correlation is a measure that can handle curves as

long as the trend doesn’t switch between increasing and

decreasing. The only time we’ll be using this is as a check in

SPSS.

Everything else we do in Ch.10 and 11 is the…

________correlation, which is restricted to linear

relationships. We use the Pearson correlation because it

produces stronger results and the math is simpler.

Math: The ugly sweater around an otherwise pretty graph.

You can do hypothesis testing. We may be interested in

whether or not there is a correlation between two variables.

Since samples are random, the sample correlation between

two variables will show up as a little above or below zero by

chance.

How far from zero correlation does something have to be

before it’s significant?

This formula gives the t-score of correlation.

The null hypothesis is: true correlation = zero.

The alternative is: correlation not zero.

The t in this formula is the same t-score as in chapters 6 and 7.

This t-score gets compared the critical values in the t-table at

n-2 degrees of freedom.

The stronger the correlation, the farther r goes from zero.

As r gets farther from zero, t-score gets bigger.

So a stronger correlation gives you higher t-score.

Stronger correlation better evidence of a correlation.

t-score also increases with sample size. As usual, it’s under a

square root.

Having more data points makes it easier to detect correlations.

A larger t-score meant more evidence against the null, just like

before.

So a large t-score means more evidence of a correlation.

If there’s a weak correlation and a small sample, we might not

detect it. (Example: n=10, r=.25)

t* = 1.397, at 8 df, 0.20 significance.

t* = 2.306, at 8 df, 0.05 significance.

No significant evidence of a correlation. ________

What if we get a larger sample of this correlation?

(n=46, r=0.25)

We should get some evidence of a correlation, but not much.

t* = 1.684, at 44 df, 0.10 significance.

t* = 2.021, at 44 df, 0.05 significance.

Weak evidence of a correlation, ____________

What happens when you get a near perfect correlation?

(Example: n=10, r=.99).

Expectation: Very strong evidence of a correlation.

t* = 2.306, at 8 df, 0.05 significance.

t* = 5.041, at 8 df, 0.001 significance.

Reality: Very strong evidence of a correlation.

The bottom gets very small, and dividing by a small number

gives you something huge.

The same thing happens with a near-perfect negative

correlation, but the t-score is negative and huge.

For interest: You can always put a line exactly through two

points.

With only two points, we have no idea what the true

correlation is. Points after the first two tell us about

correlation.

That’s why correlation has n-2 degrees of freedom.

More math? More ugly sweaters!

Show your pet some love by forcing it into a tea cosy.

First, we need to set down a convention. We’re looking at two

variables of the same object. We call these variables x and y.

Example: If we were talking about dragons, X could be the

length and Y could be the width.

X is the independent/explanatory variable (the one we control

or can measure more perfectly),

Y is the dependent/response variable.

When x and y are correlated, we say that some of the variation

in y is ____________.

Meaning: Across all the x, the range of y can be large.

But if we only consider a particular x (or a small x-interval), the

range of y shrinks considerably.

Y varies less for a particular X.

Y has less variance when accounting for X.

r2 is the proportion that the variance of y is reduced when

accounting for x.

r = 0.6 in this graph, so r2 = 0.62 = 0.36.

________ of the variation in Y is explained by X.

The same proportion of variance is explained for a negative

correlation of equal strength. A negative times itself is

positive, so r2 is always between ____________

In a perfect correlation, knowing x automatically gives you y as well. So there is no variation in y left to explain.

r = 1 or -1, so r2 = 1.

________ of the variation in y is explained by x.

When two values are uncorrelated, using a linear function of x

to guess at y is useless.

r = 0, so r2 = 0

________ of the variation in y is explained by x.

The total squared difference from the mean of y is called the

________________________, or SST

SST is the total square length of all the vertical red lines.

If we fit a line through the middle of the points in the scatter

plot (called a regression line, the subject of chapter 11),

the lines, on average, get shorter.

The total squared length of these lines is the

________________________, or SSE.

The stronger the correlation, the shorter the vertical lines get.

In other words, the smaller our errors get, and with them the

Sum of Squared Error does too.

Here, the correlation is very strong, and there are barely and

errors at all.

r2 can also be expressed in terms of SSE and SST.

SST is the total amount of variation in Y

SSE is the amount of variation in Y left unexplained by X.

When r2 is zero, SSE is same as SST

When r2 is one, SSE disappears completely.

An ugly sweater for every occasion!

Even SPSS!

To find a correlation in SPSS, go to

____ ____ ____ ____ ____ ____ (Means two-variable)

Pick the variables you want to correlate, drag them right.

Pearson correlation coefficient MUST be selected.

Spearman coefficient is optional.

There is a correlation of r = .940 between weight and height.

It’s a significant correlation, with a p-value of less than .001

(shows up as Sig. (2-tailed) = .000)

Also, anything correlates with itself perfectly, so the

correlation between length and length is r= 1

To build a scatterplot, go to

________________________________

Choose Simple Scatter if it’s not already picked, and click

________.

Move the independent variable into the x-axis,

And the dependent variable into the y-axis,

, then click OK (way at the bottom)

Our result:

There is a definite upward trend, so the strong positive

correlation of r = 0.940 makes sense.

Next time: Residuals, Outliers and Influence, and the

assumption of constant variance.