Introduction to R

Introduction to R Linear Modelling An introduction using R Karim Malki 19 June 2014


Introduction to R. Linear Modelling A n introduction using R. 19 June 2014. Karim Malki. AGENDA. R for statistical analysis Understanding Linear Models Data pre-processing Building Linear Models in R Graphing Reporting Results Further Reading. R For Statistics. - PowerPoint PPT Presentation

Transcript of Introduction to R

Page 1: Introduction to R

Introduction to R

Linear ModellingAn introduction using R

Karim Malki

19 June 2014

Page 2: Introduction to R

AGENDA• R for statistical analysis

• Understanding Linear Models

• Data pre-processing

• Building Linear Models in R

• Graphing

• Reporting Results

• Further Reading

Page 3: Introduction to R

R For StatisticsR is a powerful statistical program but it is first and foremost a programming language.

Many routines have been written for R by people all over the world and made freely available on the R project website as "packages".

The base installation contains a powerful set of tools for many statistical purposes including linear modelling.

Requires library orMore advanced

Page 4: Introduction to R

Variance and CoVarianceVariance

• Sum of each data point minus the mean for that variable, squared

• When a participant deviates from the mean on one variable, do they deviate on another variable in a similar, or opposite, way? = “Covariance”.


1x X


Page 5: Introduction to R

Correlationx <- runif(10, 5.0, 15) y <- sample(5:15, 10, replace=T)




Page 6: Introduction to R

CorrelationStandardising covariance measures Standardising a covariance value gives a measure of the strength of the relationship -> Correlation coefficient

E.g. covariance divided by (sd of X * sd of Y) is the ‘Pearson product moment correlation coefficient’ This will give coefficients between -1 (perfect negative relationship) and 1 (perfect positive relationship)




Page 7: Introduction to R



> summary (faithful) eruptions waiting Min. :1.600 Min. :43.0 1st Qu.:2.163 1st Qu.:58.0 Median :4.000 Median :76.0 Mean :3.488 Mean :70.9 3rd Qu.:4.454 3rd Qu.:82.0 Max. :5.100 Max. :96.0

>describe(faithful) var n mean sd median trimmed mad min max range skeweruptions 1 272 3.49 1.14 4 3.53 0.95 1.6 5.1 3.5 -0.41waiting 2 272 70.90 1 3.59 76 71.50 11.86 43.0 96.0 53.0 -0.41 kurtosis seeruptions -1.51 0.07waiting -1.16 0.82>

Page 8: Introduction to R


Correlation graphsUse the basic defaults to create a scatter plot of your two variables plot(eruptions~ waiting)

Change the axes titleplot(eruptions, waiting, xlab="X-axis", ylab="Y-axis")

This changes the plotting symbol to a solid circle plot(eruptions, waiting, pch=16)

Adds a line of best fit to your scatter plot abline(lm(waiting ~ eruptions)

The default correlation returns the pearson correlation coefficient cor(eruptions, waiting)

If you specify "spearman" you will get the spearman correlation coefficient

cor(eruptions, waiting, method = "spearman”)

If you use a datset instead of separate variables you will return a matrix of all the pairwise correlation coefficients

cor(dataset, method = "pearson")

Page 9: Introduction to R

Correlationls()hist(faithful$eruptions, col="grey")hist(faithful$waiting, col="grey")attach(faithful)plot(eruptions~waiting)abline(lm(faithful$eruptions~faithful$waiting))

cor(eruptions,waiting)cor(faithful, method = "pearson”)

library(car)scatterplot(eruptions~waiting, reg.line=lm, smooth=TRUE, spread=TRUE, id.method='mahal', id.n = 2, boxplots='xy', span=0.5, data=faithful)


Page 10: Introduction to R


Page 11: Introduction to R


Page 12: Introduction to R


corr.mat<-cor.matrix(variables=d(eruptions,waiting),, data=faithful, test=cor.test, method='pearson’, alternative="two.sided")

> print(corr.mat)

Pearson's product-moment correlation

eruptions waiting Eruptions cor 1 0.9008 N 272 272 CI* (0.8757,0.9211) p-value 0.0000

Page 13: Introduction to R

Regression• Linear Regression is conceptually similar to correlation

• However, correlation does not treat the two variables differently

• In contrast, Linear Regression is asking about the effect of one on the other.

It distinguishes between IVs (the thing that may influence) and DVs (the

things being influenced)

• So, sometimes problematically, you choose which you expect to have the

causal effect

• Fits a straight line that minimises squared error in the DV (vertical distances

of points from the line = “Method of Least Squares”

• And then asks about the relative variance explained by this straight line

model relative to the unexplained variance

Page 14: Introduction to R

Regressionx <- c(173, 169, 176, 166, 161, 164, 160, 158, 180, 187)y <- c(80, 68, 72, 75, 70, 65, 62, 60, 85, 92)# plot scatterplot and the regression linemod1 <- lm(y ~ x)plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))abline(mod1, lwd=2)# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre <- predict(mod1)# plot distances between points and the regression linesegments(x, y, x, pre, col="red")# add labels (res values) to pointslibrary(calibrate)textxy(x, y, res, cx=0.7)

Page 15: Introduction to R


Page 16: Introduction to R

Method of Least square

Page 17: Introduction to R

ParametersThe regression model know what is the best fitting line but it can tell you only two things. The slope (gradient or coefficient) and the intercept (or constant)

Page 18: Introduction to R


Page 19: Introduction to R

Linear Modelling#data(faithful);ls()mod1<- lm(eruptions~waiting,data=faithful)mod1

Coefficients:(Intercept) waiting -1.87402 0.07563

summary(mod1)Call:lm(formula = eruptions ~ waiting, data = faithful)

Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 ***waiting 0.075628 0.002219 34.09 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4965 on 270 degrees of freedomMultiple R-squared: 0.8115, Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

Page 20: Introduction to R


# calculate residuals and predicted valuesres <- signif(residuals(mod1), 5)pre<- predict(mod1)

# Residuals should be normally distributed and this is easy to checkhist(res)


Plot your regressionattach(faithful)mod1 <- lm(eruptions~waiting)plot(waiting, eruptions, xlim=c(min(faithful$waiting)-10, max(faithful$waiting)+5), ylim=c(min(faithful$eruptions)-3, max(faithful$eruptions))+1);abline(mod1, lwd=2)# plot distances between points and the regression linesegments(faithful$waiting, faithful$eruptions, faithful$waiting, pre, col='red')

Linear Modelling

Page 21: Introduction to R

Return p-valuelmp <- function (modelobject) { if (class(modelobject) != "lm") stop("Not an object of class 'lm' ") f <- summary(modelobject)$fstatistic p <- pf(f[1],f[2],f[3],lower.tail=F) attributes(p) <- NULLprint(modelobject) return(p)}

Page 22: Introduction to R

Model FitIf the model does not fit, it may be because of:


Unmodelled covariates

Heteroscedasticity (residuals have unequal variance)

Clustering (residuals have lower variance within subgroups)

Autocorrelation (correlation between residuals at successive time points)

Page 23: Introduction to R

Model Fitlibrary(MASS, pysch, lattice, grid, hexbin)library(solaR)data(hills)splom(~hills)

data <- subset(hills, select=c('dist', 'time', 'climb' ))splom(hills, panel=panel.hexbinplot, colramp=BTC, diag.panel = function(x, ...){ yrng <- current.panel.limits()$ylim d <- density(x, na.rm=TRUE) d$y <- with(d, yrng[1] + 0.95 * diff(yrng) * y / max(y) ) panel.lines(d) diag.panel.splom(x, ...) }, lower.panel = function(x, y, ...){ panel.hexbinplot(x, y, ...) panel.loess(x, y, ..., col = 'red') }, pscale=0, varname.cex=0.7 )

Page 24: Introduction to R

Model Fitmod2=lm(time~dist,data=hills)summary(mod2)attach(hills)co2=coef(mod2)plot(dist,time)abline(co2)

fl=fitted(mod2)for(i in 1:35)


#Can you spot outliers?


Page 25: Introduction to R

Model Fitattach(hills)plot(dist,time, ylim=c(0,250))abline(coef(lm1))identify(dist,time, labels=rownames(hills))

Page 26: Introduction to R

What to do with outliersData driven methods for the removal of outliers – some limitations

Fit a better model

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean.

Influence: An observation is influential if removing the observation substantially changes the estimate of the regression coefficients.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Page 27: Introduction to R

What to do with outliersattach(hills);summary(ols <- lm(time ~ dist))

opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))plot(ols, las = 1)Influence.measures(lm1)

#Using M estimator


attach(hills)plot(dist,time, ylim=c(0,250))abline(coef(lm1))abline(coef(rlm1),col="red")identify(dist,time)

Page 28: Introduction to R

What to do with outliersattach(hills);summary(ols <- lm(time ~ dist))

opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))plot(ols, las = 1)Influence.measures(lm1)

Using M estimator


attach(hills)plot(dist,time, ylim=c(0,250))abline(coef(lm1))abline(coef(rlm1),col="red")identify(dist,time, labels=rownames(hills))

Page 29: Introduction to R

Linear ModellingI will be around for questions or, for a slower response, email me:

[email protected]