Linear Modelling I

37
Linear Modelling I Richard Mott Wellcome Trust Centre for Human Genetics

description

Linear Modelling I. Richard Mott Wellcome Trust Centre for Human Genetics. Synopsis. Linear Regression Correlation Analysis of Variance Principle of Least Squares. Correlation. Correlation and linear regression. Is there a relationship? How do we summarise it? Can we predict new obs? - PowerPoint PPT Presentation

Transcript of Linear Modelling I

Page 1: Linear Modelling I

Linear Modelling I

Richard MottWellcome Trust Centre for Human

Genetics

Page 2: Linear Modelling I

Synopsis

• Linear Regression• Correlation• Analysis of Variance• Principle of Least Squares

Page 3: Linear Modelling I

Correlation

Page 4: Linear Modelling I

Correlation and linear regression

• Is there a relationship?• How do we summarise it?• Can we predict new obs?• What about outliers?

Page 5: Linear Modelling I

Correlation Coefficient r• -1 < r < 1

• r=0 no relationship

• r=0.6

• r=1 perfect positive linear

• r=-1 perfect negative linear

Page 6: Linear Modelling I

Examples of Correlation(taken from Wikipedia)

Page 7: Linear Modelling I

Calculation of r

• Data

Page 8: Linear Modelling I

Correlation in R

> cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")[1] 0.2577617

> cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")

Pearson's product-moment correlation

data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL t = 11.1473, df = 1746, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2134566 0.3010088 sample estimates: cor 0.2577617

> pt(11.1473,df=1746,lower.tail=FALSE) # T distribution on 1746 degrees of freedom[1] 3.154319e-28

Page 9: Linear Modelling I

Linear Regression

Fit a straight line to data

• a intercept• b slope• ei error

– Normally distributed– E(ei) = 0

– Var(ei) = s2

Page 10: Linear Modelling I

Example: simulated data

R code> # simulate 30 data points> x <- rnorm(30) > e <- rnorm(30)> x <- 1:30> e <- rnorm(30,0,5)> y <- 1 + 3*x + e

> # fit the linear model> f <- lm(y ~ x)

> # plot the data and the predicted line> plot(x,y)> abline(reg=f)

> print(f)

Call:lm(formula = y ~ x)

Coefficients:(Intercept) x -0.08634 3.04747

Page 11: Linear Modelling I

Least Squares

• Estimate a, b by least squares

• Minimise sum of squared residuals between y and the prediction a+bx

• Minimise

Page 12: Linear Modelling I

Why least squares?

• LS gives simple formulae for the estimates for a, b

• If the errors are Normally distributed then the LS estimates are “optimal”

In large samples the estimates converge to the true valuesNo other estimates have smaller expected errorsLS = maximum likelihood

• Even if errors are not Normal, LS estimates are often useful

Page 13: Linear Modelling I

Analysis of Variance (ANOVA)LS estimates have an important property: they partition the sum of squares (SS) into fitted and error components

• total SS = fitting SS + residual SS• only the LS estimates do this

Component Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS 1

Residual SS n-2

Total SS n-1

Page 14: Linear Modelling I

ANOVA in R

Component SS Degrees of freedom

Mean Square F-ratio

Fitting SS 20872.7 1 20872.7 965Residual SS 605.6 28 21.6Total SS 21478.3 29

> anova(f)Analysis of Variance Table

Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 20872.7 20872.7 965 < 2.2e-16 ***Residuals 28 605.6 21.6

> pf(965,1,28,lower.tail=FALSE)[1] 3.042279e-23

Page 15: Linear Modelling I

Hypothesis testing• no relationship between y and x

• Assume errors ei are independent and normally distributed N(0,s2)

• If H0 is true then the expected values of the sums of squares in the ANOVA are

• degrees freedom

• Expectation

• F ratio = (fitting MS)/(residual MS) ~ 1 under H0

• F >> 1 implies we reject H0

• F is distributed as F(1,n-2)

Page 16: Linear Modelling I

Degrees of Freedom• Suppose are iid N(0,1)

• Then ie n independent variables

• What about ?

• These values are constrained to sum to 0:

• Therefore the sum is distributed as if it comprised one fewer observation, hence it has n-1 df (for example, its expectation is n-1)

• In particular, if p parameters are estimated from a data set, then the residuals

have p constraints on them, so they behave like n-p independent variables

Page 17: Linear Modelling I

The F distribution• If e1….en are independent and identically distributed (iid)

random variables with distribution N(0,s2), then:• e1

2/s2 … en2/s2 are each iid chi-squared random variables with

chi-squared distribution on 1 degree of freedom c12

• The sum Sn = Si ei2/s2 is distributed as chi-squared cn

2

• If Tm is a similar sum distributed as chi-squared cm2, but

independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F random variable F(n,m)

• Special cases:– F(1,m) is the same as the square of a T-distribution on m df– for large m, F(n,m) tends to cn

2

Page 18: Linear Modelling I

ANOVA – HDL example> ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol)> ff

Call:lm(formula = bioch$Biochem.HDL ~

bioch$Biochem.Tot.Cholesterol)

Coefficients: (Intercept) bioch$Biochem.Tot.Cholesterol 0.2308 0.4456

> anova(ff)Analysis of Variance Table

Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value

Pr(>F) bioch$Biochem.Tot.Cholesterol 1 149.660 149.660 1044 Residuals 1849 265.057 0.143

> pf(1044,1,28,lower.tail=FALSE)[1] 1.040709e-23

HDL = 0.2308 + 0.4456*Cholesterol

Page 19: Linear Modelling I

correlation and ANOVA

• r2 = FSS/TSS = fraction of variance explained by the model

• r2 = F/(F+n-2)– correlation and ANOVA are equivalent– Test of r=0 is equivalent to test of b=0 – T statistic in R cor.test is the square root of the ANOVA F statistic

– r does not tell anything about magnitudes of estimates of a, b– r is dimensionless

Page 20: Linear Modelling I

Effect of sample size on significance

Total Cholesterol vs HDL dataExample R session to sample subsets of data and compute correlations

seqq <- seq(10,300,5)corr <- matrix(0,nrow=length(seqq),ncol=2)colnames(corr) <- c( "sample size", "P-value")n <- 1for(i in seqq) {

res <- rep(0,100)for(j in 1:100) {

s <- sample(idx,i)data <- bioch[s,]co <- cor.test(data$Biochem.Tot.Cholesterol,

data$Biochem.HDL,na="pair")res[j] <- co$p.value

}m <- exp(mean(log(res)))cat(i, m, "\n")corr[n,] <- c(i, m)n <- n+1

}

Page 21: Linear Modelling I

Calculating the right sample size n

• The R library “pwr” contains functions to compute the sample size for many problems, including correlation pwr.r.test() and ANOVA pwr.anova.test()

Page 22: Linear Modelling I

Problems with non-linearityAll plots have r=0.8 (taken from Wikipedia)

Page 23: Linear Modelling I

Multiple Correlation

• The R cor function can be used to compute pairwise correlations between many variables at once, producing a correlation matrix.

• This is useful for example, when comparing expression of genes across subjects.

• Gene coexpression networks are often based on the correlation matrix.

• in Rmat <- cor(df, na=“pair”)

– computes the correlation between every pair of columns in df, removing missing values in a pairwise manner

– Output is a square matrix correlation coefficients

Page 24: Linear Modelling I

One-Way ANOVA

• Model y as a function of a categorical variable taking p values– eg subjects are classified into p families– want to estimate effect due to each family and

test if these are different– want to estimate the fraction of variance

explained by differences between families – ( an estimate of heritability)

Page 25: Linear Modelling I

One-Way ANOVA

LS estimators

average over group i

Page 26: Linear Modelling I

One-Way ANOVA

• Variance is partitioned in to fitting and residual SS

total SS

n-1

fitting SSbetween groups

p-1

residual SSwith groups

n-p degrees of freedom

Page 27: Linear Modelling I

One-Way ANOVA

Component SS Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS p-1

Residual SS n-p

Total SS n-1

Under Ho: no differences between groupsF ~ F(p-1,n-p)

Page 28: Linear Modelling I

One-Way ANOVA in Rfam <- lm( bioch$Biochem.HDL ~ bioch$Family )> anova(fam)Analysis of Variance Table

Response: bioch$Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 ***Residuals 1727 18.5478 0.0107 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >

Component SS Degrees of freedom

Mean Square(ratio of SS to df)

F-ratio (ratio of FMS/RMS)

Fitting SS 6.3870 173 0.0369 3.4375

Residual SS 18.5478 1727 0.0107

Total SS 24.9348 1900

Page 29: Linear Modelling I

Factors in R

• Grouping variables in R are called factors• When a data frame is read with read.table()

– a column is treated as numeric if all non-missing entries are numbers

– a column is boolean if all non-missing entries are T or F (or TRUE or FALSE)

– a column is treated as a factor otherwise– the levels of the factor are the set of distinct values– A column can be forced to be treated as a factor using the function as.factor(), or as a numeric vector using as.numeric()

– BEWARE: If a numeric column contains non-numeric values (eg “N” being used instead of “NA” for a missing value, then the column is interpreted as a factor

Page 30: Linear Modelling I

Linear Modelling in R

• The R function lm() fits linear models• It has two principal arguments (and some

optional ones)• f <- lm( formula, data )– formula is an R formula– data is the name of the data frame containing the

data (can be omitted, if the variables in the formula include the data frame)

Page 31: Linear Modelling I

formulae in R

• Biochem.HDL ~ Biochem$Tot.Cholesterol

– linear regression of HDL on Cholesterol – 1 df

• Biochem.HDL ~ Family

– one-way analysis of variance of HDL on Family– 173 df

• The degrees of freedom are the number of independent parameters to be estimated

Page 32: Linear Modelling I

ANOVA in R• f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem)• [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)]

• a <- anova(f)

• f <- lm(Biochem.HDL ~ Family, data=biochem)• a <- anova(f)

Page 33: Linear Modelling I

Non-parametric Methods

• So far we have assumed the errors in the data are Normally distributed

• P-values and estimates can be inaccurate if this is not the case• Non-parametric methods are a (partial) way round this

problem• Make fewer assumptions about the distribution of the data

– independent– identically distributed

Page 34: Linear Modelling I

Non-Parametric CorrelationSpearman Rank Correlation Coefficient

• Replace observations by their ranks• eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4)• Compute sum of squared differences between

ranks

• in R:– cor( x, y, method=“spearman”)– cor.test(x,y,method=“spearman”)

Page 35: Linear Modelling I

Spearman Correlation> cor.test(xx,y, method=“pearson”)

Pearson's product-moment correlation

data: xx and y t = 0.0221, df = 28, p-value = 0.9825alternative hypothesis: true correlation is not

equal to 0 95 percent confidence interval: -0.3566213 0.3639062 sample estimates: cor 0.004185729

> cor.test(xx,y,method="spearman")

Spearman's rank correlation rho

data: xx and y S = 2473.775, p-value = 0.01267alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4496607

Page 36: Linear Modelling I

Non-ParametricOne-Way ANOVA

• Kruskall-Wallis Test• Useful if data are highly non-Normal– Replace data by ranks– Compute average rank within each group– Compare averages– kruskal.test( formula, data )

Page 37: Linear Modelling I

Permutation Testsas non-parametric tests

• Example: One-way ANOVA: – permute group identity between subjects– count fraction of permutations in which the

ANOVA p-value is smaller than the true p-value

a = anova(lm( bioch$Biochem.HDL ~ bioch$Family))p = a[1,5]pv = rep(0,1000)

for( i in 1:1000) {perm = sample(bioch$Family,replace=FALSE)a = anova(lm( bioch$Biochem.HDL ~ perm ))pv[i] = a[1,5]

}pval = mean(pv <p)