R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University...
-
Upload
tori-bermingham -
Category
Documents
-
view
219 -
download
2
Transcript of R for psychologists Professor Thom Baguley, Psychology, Nottingham Trent University...
R for psychologists
Professor Thom Baguley, Psychology, Nottingham Trent University
2
0. Overview
1. What is R?
2. Why use R?
3. R basics
4. R graphics
5. Linear models in R
6. ANOVA and ANCOVA
6. Writing your own R functions
8. Simulation and bootstrapping
3
1. What is R?
R is a software environment for statistical analysis
“the lingua franca of computational statistics”
De Leeuw & Mair (2007)
4
2. Why use R? It is:
- free- open source- cross-platform (Mac, Linux, PC)
It has:
- excellent basic statistics functions- powerful and versatile graphics
- hundreds of user-contributed packages
- a large community of users
5
3. R basics
Some common R objects:
- characters (text strings e.g., 'a' or ’1’)
- numbers (numbers like 2 or 1e+3)
- vectors (1D set of numbers or other elements)
- data frames (like vectors organized in columns)
- matrices (r by c array of numbers)
- lists (objects that contain other objects)
6
Assignment
Input:
> vector1 <- 6
> vector1
Output:
[1] 6
7
Arithmetic
> 6 + 6 * 2
[1] 18
> vector1^2
[1] 36
8
Calling functions
> c(1, 9, 25) > log(vector1)
[1] 1 9 25 [1] 1.791759
> rnorm(1, 100, 15) ?rnorm
[1] 123.5336 help(rnorm)
9
Loading data
> dat1 <- read.csv('pride.csv')
> library(foreign)
> h05 <- read.spss('Hayden_2005.sav', to.data.frame = TRUE)
10
Addressing the contents of an object
> vector1[1]
[1] 6
> days <- h05$days
> days[1:9]
[1] 2864 1460 2921 2921 2921 1460 2921 1460 31
11
Addressing the contents of an object
> vector1[1]
[1] 6
> days <- h05$days
> days[1:9]
[1] 2864 1460 2921 2921 2921 1460 2921 1460 31
Or combine the steps with h05$days[1:9]
12
Data frames
> is.data.frame(h05)
[1] TRUE
> dim(h05) > names(h05)
[1] 43 2 [1] "names" "days”
- this data frame has 43 rows and 2 named columns
- it can be helpful to think of a data frame as a set of named variables (vectors) bound into columns
13
Matrix objects (matrices)
> cells <- c(3677,56,3924,31)
> cat.names <- list(c("Before", "After"), c("Alive", "Dead"))
> checklist <- matrix(cells, 2, 2, byrow=TRUE, dimnames= cat.names)
> checklist
Alive Dead
Before 3677 56
After 3924 31
14
The power of objects
> chisq.test(checklist)
Pearson's Chi-squared test with Yates' continuity correction
data: checklist
X-squared = 8.1786, df = 1, p-value = 0.004239
> chisq.test(c(2, 18))
Chi-squared test for given probabilities
data: c(2, 18)
X-squared = 12.8, df = 1, p-value = 0.0003466
15
Calling functions: defaults and named arguments
> chisq.test(checklist, correct=FALSE)
Pearson's Chi-squared test
data: checklist
X-squared = 8.8072, df = 1, p-value = 0.003000
- the default argument (for matrix input is) correct=TRUE
- setting correct=FALSE over-rides the default
- because unnamed arguments to functions are matched in order, this argument must be named (otherwise naming the arguments is optional)
16
Exercise 1 (R Basics)
- entering data
- working with objects
- simple statistical functions
17
4. R graphics
n = 43
> boxplot(h05$days)
18
> hist(h05$days) > plot(density(h05$days))
19
Distribution functions
e.g., Normal distribution
dnorm(x, mean = 0, sd = 1) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE) rnorm(n, mean = 0, sd = 1)
> rdat <- rnorm(30, 100, 15)
20
> curve(dnorm(x), xlim=c(-4,4), col='purple')> curve(dt(x, 1), col='red', add=TRUE, lty=3)
21
22
Exercise 2 (R Graphics)
- exploratory plots
- plot parameters
- plotting distribution functions
- plotting a serial position curve (optional)
23
5. Linear models in R
> expenses <- read.csv('expenses.csv')
> t.test(majority ~ problem, data = expenses)
Welch Two Sample t-test
data: majority by problem t = -3.7477, df = 505.923, p-value = 0.000199alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -2044.457 -638.160 sample estimates:mean in group 0 mean in group 1 7092.712 8434.021
Data courtesy of Mark Thompson http://markreckons.blogspot.com/
24
> expenses <- read.csv('expenses.csv')
> lm(majority ~ problem, data = expenses)
Call:
lm(formula = majority ~ problem, data = expenses)
Coefficients:
(Intercept) problem 7093 1341
25
> lm.out <- lm(majority ~ problem, data = expenses)
> max(rstandard(lm.out))
[1] 2.629733
> AIC(lm.out)
[1] 12674.85
> lm.out$coefficients
(Intercept) problem
7092.712 1341.308 > ?lm
26
> summary(lm.out)
Call:lm(formula = majority ~ problem, data = expenses)
Residuals: Min 1Q Median 3Q Max -8397.0 -3403.5 -260.9 3118.8 11543.3
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7092.7 218.9 32.397 < 2e-16 ***problem 1341.3 357.0 3.758 0.000187 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4395 on 644 degrees of freedomMultiple R-squared: 0.02145, Adjusted R-squared: 0.01993 F-statistic: 14.12 on 1 and 644 DF, p-value: 0.0001871
27
> glm(problem~I(majority/10000), family='binomial', data = expenses)
Call: glm(formula = problem ~ I(majority/10000), family = "binomial", data = expenses)
Coefficients:
(Intercept) I(majority/10000)
-1.0394 0.6878
Degrees of Freedom: 645 Total (i.e. Null); 644 Residual
Null Deviance: 855.5
Residual Deviance: 841.6 AIC: 845.6
28
> lrm.out <- glm(problem ~ I(majority/10000), family='binomial', data = expenses)
> plot(election$majority, lrm.out$fitted, col='dark green')
29
Exercise 3 (Linear models)
- using a formula in a call to a model function
- linear models
- plotting a regression line
- generalized linear models (optional)
- plotting confidence bands (optional)
30
6. ANOVA
> factor1 <- gl(10,4)> factor 1
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4Levels: 1 2 3 4
> library(foreign)> diag.data <- read.spss("diagram.sav", to.data.frame = T)
> diag.fit <- lm(descript ~ group, data=diag.data)
31
Regression model (dummy coding)
> diag.fit
Call:
lm(formula = descript ~ group, data = diag.data)
Coefficients:
(Intercept) groupPicture groupFull Diagram groupSegmented
17.8 0.5 4.6 9.5
32
aov()
> aov(diag.fit)
Call:
aov(formula = diag.fit)
Terms:
group Residuals
Sum of Squares 583.7 2440.2
Deg. of Freedom 3 36
33
anova()
> summary(aov(diag.fit))
> anova(diag.fit)
Analysis of Variance Table
Response: descript
Df Sum Sq Mean Sq F value Pr(>F)
group 3 583.7 194.567 2.8704 0.04977 *
Residuals 36 2440.2 67.783
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
34
Factorial ANOVA and ANCOVA
> lm(DV ~ factor1 + factor2 + factor1:factor2, data=data2)
> lm(DV ~ factor1 * factor2, data=data2)
> lm(descript ~ group + time, diag.data)
> lm(descript ~ 0 + group + scale(time, scale=F), diag.data)
> diag.ancov <- lm(descript ~ group + scale(time, scale=F), diag.data)
35
Type I and Type II Sums of squares
> drop1(diag.ancov, test = 'F')
Single term deletions
Model:
descript ~ group + scale(time, scale = F)
Df Sum of Sq RSS AIC F value Pr(F)
<none> 2154.4 169.46
group 3 821.45 2975.9 176.38 4.4484 0.009478 **
scale(time, scale = F) 1 285.78 2440.2 172.44 4.6427 0.038149 *
- sequential sums of squares (Type I) tests are given by default
- hierarchical sums of squares (Type II) tests are given by the drop1() function
[Software such as SAS and SPSS defaults to unique (Type III) SS]
36
Repeated measures (within-subjects) factors
- a weakness in the base R installation, but easily done using packages such as ez, nlme or lme4 (the latter two able to fit a wider range of repeated measures models)
> pride.long <- read.csv("pride_long.csv")
> pride.mixed <- aov(accuracy ~ emotion*condition + Error(factor(participant)/emotion), pride.long)
> summary(pride.mixed)
[Software such as SAS and SPSS defaults to unique (Type III) SS]
37
Multiple comparisons and contrasts
- you can run contrasts by changing default contrasts in R
- Fisher LSD, Tukey’s HSD and p adjustment (e.g., Hochberg, Holm, Bonferroni, FDR) in R base installation
- powerful modified Bonferroni (e.g., Shaffer, Westfall) and general linear contrasts available in multcomp package
- flexible contrast and estimable functions in gmodels package
38
Exercise 4 (ANOVA)
- factors
- aov() and anova()
- ANCOVA
- repeated measures (within-subjects)
- contrasts and multiple comparisons (optional)
39
7. Writing your own functions
sd.pool <- function(sd.1, n.1, sd.2, n.2 = n.1){num <- (n.1-1)*sd.1^2+(n.2-1)*sd.2^2denom <- n.1+n.2-2output <- sqrt(num/denom)output}
> sd.pool(6.1, 20, 5.3, 25)[1] 5.66743
40
Exercise 5 (Write a function)
- pick a simple statistical procedure
- write a function to automate it
e.g., pooled standard deviation
41
8. Simulation and bootstrapping
- distribution functions
> rnorm(100, 10, 1)
- R boot package
- sample() and replicate()
42
Bootstrap CIs for a median or timmed mean
43
The bootstrap (and other Monte Carlo methods)
Bootstrapping involves repeatedly resampling with replacement from a data set
e.g., Data set = {0,1}
7 simulated samples:
{1,0}, {0,1}, {0,0}, {0,0}, {0,1}, {0,0}, {1,1}
Bootstrapping effectively treats the sample distribution as the population distribution
> replicate(7, sample(x,2,replace=TRUE))
44
Bootstrapping using R
> library(boot)> x1.boot <- boot(samp,function(x,i) median(x[i]),R=10000)> boot.ci(x1.boot)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 10000 bootstrap replicates
CALL : boot.ci(boot.out = x1.boot)
Intervals : Level Normal Basic 95% (-0.1747, 0.6948 ) (-0.1648, 0.6614 )
Level Percentile BCa 95% (-0.1538, 0.6724 ) (-0.2464, 0.6620 ) Calculations and Intervals on Original Scale
Normal 95% CI for mean (using t) [-2.00, 1.33]
45
Exercise 6 (Bootstrapping)
- resample data
- construct a simple percentile bootstrap
- run a BCa bootstrap using the boot package (optional)
46
47
9. Advanced statistical modelling in R
Multilevel modeling nlme package
lme4 package
MCMCglmm package
Meta-analysis meta package
metafor package
… and many, many more other packages
48
metafor package
> m.e <- c(10.7, 16.24, 10.03, 3.65, 5.73)
> n.e <- c(31, 57, 9, 17, 10)
> m.c <- c(2.87, 6.83, 7.18, 4.65, 7.47)
> n.c <- c(14, 52, 9, 18, 10)
> sd.pooled <- c(7.88, 9.84, 8.67, 6.34, 5.74)
> diff <- m.c - m.e
> se.diffs <- sd.pooled * sqrt(1/n.e + 1/n.c)
49
> install.packages('metafor')> library(metafor)> meta.out <- rma(yi=diff, sei=se.diffs, method = 'FE')> meta.out
Fixed-Effects Model (k = 5)
Test for Heterogeneity: Q(df = 4) = 21.0062, p-val = 0.0003
Model Results:
estimate se zval pval ci.lb ci.ub -4.1003 1.0750 -3.8141 0.0001 -6.2073 -1.9933 ***
---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
50
> funnel(trimfill(meta.out))
51
> forest(rma(yi=diff, sei=se.diffs))
52
53
54
55
56
57