Linear Regression and Correlation. Fitted Regression Line.

Post on 23-Dec-2015

235 views 0 download

Tags:

Transcript of Linear Regression and Correlation. Fitted Regression Line.

Linear Regression and Correlation

Fitted Regression Line

54 56 58 60 62 64 66 68 7080

100

120

140

160

180

200

Length (cm)

Y=

Wei

ght(

g)

Equation of the Regression Line

XbbY 10 Least squares regression line of Y on X

2)(

))((1 xx

yyxx

i

iib

xbyb 10

Regression Calculations

Plotting the regression line

Residuals

Using the fitted line, it is possible to obtain an estimate of the y coordinate

The “errror” in the fit we term the “residual error”

ii xbby 10ˆ

ii yy ˆ

54 56 58 60 62 64 66 68 7080

100

120

140

160

180

200

Length (cm)

Y=

Wei

ght(

g)

Residual

Residual Standard Deviation

2

)ˆ( 2

|

n

yys i

XY

54 56 58 60 62 64 66 68 7080

100

120

140

160

180

200

Length (cm)

Y=

Wei

ght(

g)

Residuals from example

Other ways to evaluate residuals

Lag plots, plot residuals vs. time delay of residuals…looks for temporal structure.

Look for skew in residualsKurtosis in residuals – error not distributed

“normally”.

0

20

40

0

20

40

0

20

40

0

20

40

0 0.3-0.3

Model Residuals: constrained

Pairwise model

Independent model

0 0.3-0.3

Model Residuals: freely moving

Pairwise model

Independent model

-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

-0.4 -0.3 -0.2 -0.1 0 0.1-0.4

-0.3

-0.2

-0.1

0

0.1

Pairwise model

Independent model

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

-0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Pairwise model

Independent model

Parametric Interpretation of regression: linear models

Conditional Populations and Conditional DistributionsA conditional population of Y values associated

with a fixed, or given, value of X.A conditional distribution is the distribution of

values within the conditional population above

XY |

X|Y

Population mean Y value for a given X

Population SD of Y value for a given X

The linear model

Assumptions:LinearityConstant standard deviation

XY

X

Y

10

10X|Y

X|Y

Statistical inference concerning

You can make statistical inference on model parameters themselves

1

XY |1b

0b estimates0

estimates

XYs | estimates

1

Standard error of slope

95% Confidence interval for

2

|

)(1

xx

sSE

i

XYb

1

1025.01 bSEtb where

Hypothesis testing: is the slope significantly different from zero?

1 = 0

Using the test statistic:

1

1

bs SE

bt

df=n-2

Coefficient of Determination

r2, or Coefficient of determination: how much of the variance in data is accounted for by the linear model.

2

2

)(

)ˆ(1

yy

yy

i

i

Line “captures” most of the data variance.

Correlation Coefficient

R is symmetrical under exchange of x and y.

X

Y

s

srb *1

22 )()(

))((

yyxx

yyxxr

ii

ii

What’s this?It adjusts R to compensate for the factThat adding even uncorrelated variables tothe regression improves R

Statistical inference on correlations

Like the slope, one can define a t-statistic for correlation coefficients:

21

1

1

1r

nr

SE

bt

bs

Consider the following some “Spike Triggered Averages”:

-5 0 5 10 15-500V

0500V

-5 0 5 10 15

-2mV0

2mV

-5 0 5 10 15-500V

0500V

-5 0 5 10 15-500V

0

500V

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-10mV

010mV

-5 0 5 10 15

-5mV0

5mV

-5 0 5 10 15-1mV

01mV

-5 0 5 10 15-2mV

0

2mV

-5 0 5 10 15-500V

0500V

-5 0 5 10 15

-2mV0

2mV

-5 0 5 10 15-500V

0500V

-5 0 5 10 15-500V

0

500V

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-10mV

010mV

-5 0 5 10 15

-5mV0

5mV

-5 0 5 10 15-1mV

-5 0 5 10 15-500V

0500V

-5 0 5 10 15

-2mV0

2mV

-5 0 5 10 15-500V

0500V

-5 0 5 10 15-500V

0

500V

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-5mV

05mV

-5 0 5 10 15-10mV

010mV

-5 0 5 10 15

-5mV0

5mV

-5 0 5 10 15-1mV

01mV

-5 0 5 10 15-2mV

0

2mV

STA example

-10 -5 0 5-6

-4

-2

0

2

4

6

R2=0.25.Is this correlation

significant?

N=446, t = 0.25*(sqrt(445/(1-0.25^2))) = 5.45

21

1

1

1r

nr

SE

bt

bs

When is Linear Regression Inadequate?

CurvilinearityOutliersInfluential points

Curvilinearity

60 61 62 63 64 65 66 67 68 69 70-25

-20

-15

-10

-5

0

OutliersCan reduce correlations and unduly influence the

regression lineYou can “throw out” some clear outliersA variety of tests to use. Example? Grubb’s test

SD

valuemeanZ

Look up critical Z value in a table Is your z value larger? Difference is significant and data

can be discarded.

Influential pointsPoints that have a lot of influence on

regressed modelNot really an outlier, as residual is small.

50 55 60 65 70 75 80 85 90 95 100 10550

100

150

200

250

300

Conditions for inference

Design conditions Random subsampling model: for each x observed, y is viewed as

randomly chosen from distribution of Y values for that X Bivariate random sampling: each observed (x,y) pair must be

independent of the others. Experimental structure must not include pairing, blocking, or an internal hierarchy.

Conditions on parameters X10X|Y

XYs | is not a function of X

Conditions concerning population distributions Same SD for all levels of X Independent Observatinos Normal distribution of Y for each fixed X Random Samples

Error Bars on Coefficients of Model

MANOVA and ANCOVA

MANOVA

Multiple Analysis of VarianceDeveloped as a theoretical construct by

S.S. Wilks in 1932

Key to assessing differences in groups across multiple metric dependent variables, based on a set of categorical (non-metric) variables acting as independent variables.

MANOVA vs ANOVA

ANOVA

Y1 = X1 + X2 + X3 +...+ Xn

(metric DV) (non-metric IV’s)

MANOVA

Y1 + Y2 + ... + Yn = X1 + X2 + X3 +...+ Xn

(metric DV’s) (non-metric IV’s)

ANOVA Refresher

SS Df MS F

Between SS(B) k-1

Within SS(W) N-k

Total SS(W)+SS(B) N-1

1

)(

kBSS

)(

)(

WMS

BMS

kN

WSS

)(

Reject the null hypothesis if test statistic is greater than critical F value with k-1Numerator and N-k denominator degrees of freedom. If you reject the null,At least one of the means in the groups are different

MANOVA Guidelines

Assumptions the same as ANOVAAdditional condition of multivariate

normalityall variables and all combinations of the

variables are normally distributed

Assumes equal covariance matrices (standard deviations between variables should be similar)

Example The first group receives technical

dietary information interactively from an on-line website. Group 2 receives the same information in from a nurse practitioner, while group 3 receives the information from a video tape made by the same nurse practitioner.

User rates based on usefulness, difficulty and importance of instruction

Note: three indexing independent variables and three metric dependent variables

Hypotheses

H0: There is no difference between treatment group (online learners) from oral learners and visual learners.

HA: There is a difference.

Order of operations

MANOVA Output 2

Individual ANOVAs not significant

MANOVA output

Overall multivariate effect is signficant

Post hoc tests to find the culprit

Post hoc tests to find the culprit!

Once more, with feeling: ANCOVA

Analysis of covarianceHybrid of regression analysis and ANOVA

style methodsSuppose you have pre-existing effect

differences between subjectsSuppose two experimental conditions, A

and B, you could test half your subjects with AB (A then B) and the other half BA using a repeated measures design

Why use?Suppose there exists a particular variable that *explains* some of

what’s going on in the dependent variable in an ANOVA style experiment.

Removing the effects of that variable can help you determine if categorical difference is “real” or simply depends on this variable.

In a repeated measures design, suppose the following situation: sequencing effects, where performing A first impacts outcomes in B. Example: A and B represent different learning methodologies.

ANCOVA can compensate for systematic biases among samples

(if sorting produces unintentional correlations in the data).

Example

Results

Second Example

How does the amount spent on groceries, and the amount one intends to spend depend on a subjects sex?

H0: no dependenceTwo analyses:

MANOVA to look at the dependenceANCOVA to determine if the root of there is

significant covariance between intended spending and actual spending

MANOVA

Results

ANCOVA

ANCOVA Results

So if you remove the amount the subjects intend to spend from the equation,No significant difference between spending. Spending difference not a resultOf “impulse buys”, it seems.

Principal Component Analysis

Say you have time series data, characterized by multiple channels or trials. Are there a set of factors underlying the data that explain it (is there a simpler exlplanation for observed behavior)?

In other words, can you infer the quantities that are supplying variance to the observed data, rather than testing *whether* known factors supply the variance.

0 500 1000 1500 2000 2500

Example: 8 channels of recorded EMG activity

PCA works by “rotating” the data (considering a time series as a spatial vector) to a “position” in the abstract space that minimizes covariance.

Don’t worry about what this means.

0 500 1000 1500 2000 2500

0 500 1000 1500 2000 2500

Note how a single component explainsalmost all of the variance in the 8 EMGsRecorded.

Next step would be to correlatethese components with some other parameter in the experiment.

0 500 1000 1500 2000 2500

Largest PCN

eura

l firi

ng r

ates

Some additional uses: Say you have a very large data set, but

believe there are some common features uniting that data set

Use a PCA type analysis to identify those common features.

Retain only the most important components to describe “reduced” data set.