Post on 01-Nov-2014
description
plyrOne data-analytic strategy
Hadley WickhamRice University
Friday, 29 May 2009
1. Motivation: Deseasonlising ozone measurements
2. Outline of strategy: split-apply-combine
3. Specifics: input vs. output
4. Fiddly details
5. Thoughts on data analysis
Friday, 29 May 2009
−20
−10
0
10
20
30
−110 −85 −60
24 x 24 x 72 = 41,472
Friday, 29 May 2009
−20
−10
0
10
20
30
−110 −85 −60
24 x 24 x 72 = 41,472
Friday, 29 May 2009
−1.0
−0.5
0.0
0.5
1.0
●
●
−1.0 −0.5 0.0 0.5 1.0
Friday, 29 May 2009
−1.0
−0.5
0.0
0.5
1.0
●
●
−1.0 −0.5 0.0 0.5 1.00.0
0.2
0.4
0.6
0.8
1.0
●
0.0 0.2 0.4 0.6 0.8 1.0
Friday, 29 May 2009
time
value
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Friday, 29 May 2009
timeresid
(des
eas1
) + m
ean(
one$
valu
e)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Friday, 29 May 2009
timeresid
(des
eas1
) + m
ean(
one$
valu
e)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Friday, 29 May 2009
How can we do this for all 24 x 24 locations?
(assume ozone levels stored in a 24 x 24 x 72 array)
Friday, 29 May 2009
models <- as.list(rep(NA, 24 * 24))
dim(models) <- c(24, 24)
deseas <- array(NA, c(24, 24, 72))
dimnames(deseas) <- dimnames(ozone)
for (i in seq_len(24)) {
for(j in seq_len(24)) {
mod <- deseasf(ozone[i, j, ])
models[[i, j]] <- mod
deseas[i, j, ] <- resid(mod)
}
}
With a for loop
Friday, 29 May 2009
models <- as.list(rep(NA, 24 * 24))
dim(models) <- c(24, 24)
deseas <- array(NA, c(24, 24, 72))
dimnames(deseas) <- dimnames(ozone)
for (i in seq_len(24)) {
for(j in seq_len(24)) {
mod <- deseasf(ozone[i, j, ])
models[[i, j]] <- mod
deseas[i, j, ] <- resid(mod)
}
}
With a for loop
Friday, 29 May 2009
models <- apply(ozone, 1:2, deseasf)
resids <- unlist(lapply(models, resid))
dim(resids) <- c(72, 24, 24)
deseas <- aperm(resids, c(2, 3, 1))
dimnames(deseas) <- dimnames(ozone)
With apply
Friday, 29 May 2009
models <- apply(ozone, 1:2, deseasf)
resids <- unlist(lapply(models, resid))
dim(resids) <- c(72, 24, 24)
deseas <- aperm(resids, c(2, 3, 1))
dimnames(deseas) <- dimnames(ozone)
With apply
Friday, 29 May 2009
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)
With plyr
Succinct, but you need to know what aaply does
cf. onomatopoeia, schadenfreude, soliloquyFriday, 29 May 2009
−20
−10
0
10
20
30
−110 −85 −60
avg250260270280290300310
Friday, 29 May 2009
−20
−10
0
10
20
30
−110 −85 −60
Friday, 29 May 2009
Many problems involve splitting up a large data structure, operating on each piece and joining the results back together:
split-apply-combine
Friday, 29 May 2009
How you split up depends on the type of input: arrays, data frames, lists
How you combine depends on the type of output: arrays, data frames, lists, nothing
Friday, 29 May 2009
array data frame list nothing
array
data frame
list
aaply adply alply a_ply
daply ddply dlply d_ply
laply ldply llply l_ply
Friday, 29 May 2009
array data frame list nothing
array
data frame
list
apply adply alply a_ply
daply aggregate by d_ply
sapply ldply lapply l_ply
Friday, 29 May 2009
21
1
2 1,2
Split: array, data frame, list
Friday, 29 May 2009
3
21
1 2 3
1,2 1,3 2,31,2,3
Split: array, data frame, list
Friday, 29 May 2009
models <- aaply(ozone, 1:2, deseasf)
deseas <- aaply(models, 1:2, resid)
Splitting up ozone gives 576 vectors of length 72.Splitting up models gives 576 rlm models
Take 3d array, split up by first two dimensions.
How are they combined?
Friday, 29 May 2009
4D!
Combine: array, data frame, list
Friday, 29 May 2009
Combine: array, data frame, list
Friday, 29 May 2009
name age sex
John 13 Male
Peter 13 Male
Roger 14 Male
John 13 Male
Mary 15 Female
Alice 14 Female
Peter 13 Male
Roger 14 Male
Phyllis 13 Female
name age sex
Mary 15 Female
Alice 14 Female
Phyllis 13 Female
name age sex
John 13 Male
Peter 13 Male
Phyllis 13 Female
name age sex
Mary 15 Female
name age sex
Alice 14 Female
Roger 14 Male
name age sex
.(sex) .(age)
Split: array, data frame, list
Friday, 29 May 2009
Combine: array, data frame, list
sex
Male
Female
value
3
3
age
13
14
value
3
2
15 2
age
13
14
value
2
1
sex
Male
Male
14 1
15 1
Female
Female
Female 13 1
.(sex) .(age) .(sex, age)
Applying nrow to each piece
Friday, 29 May 2009
Case study: Baseball
Friday, 29 May 2009
id year team g ab r h
ruthba01 1914 BOS 5 10 1 2
ruthba01 1915 BOS 42 92 16 29
ruthba01 1916 BOS 67 136 18 37
ruthba01 1917 BOS 52 123 14 40
ruthba01 1918 BOS 95 317 50 95
ruthba01 1919 BOS 130 432 103 139
ruthba01 1920 NYA 142 457 158 172
ruthba01 1921 NYA 152 540 177 204
ruthba01 1922 NYA 110 406 94 128
ruthba01 1923 NYA 152 522 151 205
ruthba01 1924 NYA 153 529 143 200
ruthba01 1925 NYA 98 359 61 104
ruthba01 1926 NYA 152 495 139 184
ruthba01 1927 NYA 151 540 158 192
ruthba01 1928 NYA 154 536 163 173
ruthba01 1929 NYA 135 499 121 172
21 699 records
1228 players
15-31 years for each player
Friday, 29 May 2009
How does performance (rbi/ab) change over the course of a career?
First need to add column that gives “career year”
Easy for a single player.
baberuth <- subset(baseball, id == "ruthba01") baberuth <- transform(baberuth, cyear = year - min(year) + 1)
For many players, use ddply + transform
baseball <- ddply(baseball, "id", transform, cyear = year - min(year) + 1)
Friday, 29 May 2009
baseball <- subset(baseball, ab >= 25)
xlim <- range(baseball$cyear, na.rm=TRUE)
ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)
plotpattern <- function(df) {
qplot(cyear, rbi / ab, data = df, geom = "line",
xlim = xlim, ylim = ylim)
}
pdf("paths.pdf", width = 8, height = 4)
d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE)
dev.off()
Draw time series for all 1228 players
Friday, 29 May 2009
rsquare
count
0
50
100
150
200
0.0 0.2 0.4 0.6 0.8 1.0
Friday, 29 May 2009
slope
intercept
−0.5
0.0
0.5
1.0
−0.04−0.020.000.020.040.060.08
rsquare0.000.250.500.751.00
slope
intercept
−0.10
−0.05
0.00
0.05
0.10
0.15
0.20
0.25
−0.010 −0.005 0.000 0.005 0.010
rsquare0.000.250.500.751.00
Friday, 29 May 2009
Fiddly details
Labelling
Progress bars
Consistent argument names
Missing values / Nulls
Friday, 29 May 2009
Data analysis
What other patterns of data analysis are waiting to be discovered?
How can we identify these strategies and then develop software to support them?
Does teaching these patterns make it easier for novices to become experts?
Friday, 29 May 2009