Plyr, one data analytic strategy

Post on 01-Nov-2014

5.772 views 2 download

Tags:

description

 

Transcript of Plyr, one data analytic strategy

plyrOne data-analytic strategy

Hadley WickhamRice University

Friday, 29 May 2009

1. Motivation: Deseasonlising ozone measurements

2. Outline of strategy: split-apply-combine

3. Specifics: input vs. output

4. Fiddly details

5. Thoughts on data analysis

Friday, 29 May 2009

−20

−10

0

10

20

30

−110 −85 −60

24 x 24 x 72 = 41,472

Friday, 29 May 2009

−20

−10

0

10

20

30

−110 −85 −60

24 x 24 x 72 = 41,472

Friday, 29 May 2009

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

Friday, 29 May 2009

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.00.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

time

value

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

timeresid

(des

eas1

) + m

ean(

one$

valu

e)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

timeresid

(des

eas1

) + m

ean(

one$

valu

e)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

How can we do this for all 24 x 24 locations?

(assume ozone levels stored in a 24 x 24 x 72 array)

Friday, 29 May 2009

models <- as.list(rep(NA, 24 * 24))

dim(models) <- c(24, 24)

deseas <- array(NA, c(24, 24, 72))

dimnames(deseas) <- dimnames(ozone)

for (i in seq_len(24)) {

for(j in seq_len(24)) {

mod <- deseasf(ozone[i, j, ])

models[[i, j]] <- mod

deseas[i, j, ] <- resid(mod)

}

}

With a for loop

Friday, 29 May 2009

models <- as.list(rep(NA, 24 * 24))

dim(models) <- c(24, 24)

deseas <- array(NA, c(24, 24, 72))

dimnames(deseas) <- dimnames(ozone)

for (i in seq_len(24)) {

for(j in seq_len(24)) {

mod <- deseasf(ozone[i, j, ])

models[[i, j]] <- mod

deseas[i, j, ] <- resid(mod)

}

}

With a for loop

Friday, 29 May 2009

models <- apply(ozone, 1:2, deseasf)

resids <- unlist(lapply(models, resid))

dim(resids) <- c(72, 24, 24)

deseas <- aperm(resids, c(2, 3, 1))

dimnames(deseas) <- dimnames(ozone)

With apply

Friday, 29 May 2009

models <- apply(ozone, 1:2, deseasf)

resids <- unlist(lapply(models, resid))

dim(resids) <- c(72, 24, 24)

deseas <- aperm(resids, c(2, 3, 1))

dimnames(deseas) <- dimnames(ozone)

With apply

Friday, 29 May 2009

models <- aaply(ozone, 1:2, deseasf)

deseas <- aaply(models, 1:2, resid)

With plyr

Succinct, but you need to know what aaply does

cf. onomatopoeia, schadenfreude, soliloquyFriday, 29 May 2009

−20

−10

0

10

20

30

−110 −85 −60

avg250260270280290300310

Friday, 29 May 2009

−20

−10

0

10

20

30

−110 −85 −60

Friday, 29 May 2009

Many problems involve splitting up a large data structure, operating on each piece and joining the results back together:

split-apply-combine

Friday, 29 May 2009

How you split up depends on the type of input: arrays, data frames, lists

How you combine depends on the type of output: arrays, data frames, lists, nothing

Friday, 29 May 2009

array data frame list nothing

array

data frame

list

aaply adply alply a_ply

daply ddply dlply d_ply

laply ldply llply l_ply

Friday, 29 May 2009

array data frame list nothing

array

data frame

list

apply adply alply a_ply

daply aggregate by d_ply

sapply ldply lapply l_ply

Friday, 29 May 2009

21

1

2 1,2

Split: array, data frame, list

Friday, 29 May 2009

3

21

1 2 3

1,2 1,3 2,31,2,3

Split: array, data frame, list

Friday, 29 May 2009

models <- aaply(ozone, 1:2, deseasf)

deseas <- aaply(models, 1:2, resid)

Splitting up ozone gives 576 vectors of length 72.Splitting up models gives 576 rlm models

Take 3d array, split up by first two dimensions.

How are they combined?

Friday, 29 May 2009

4D!

Combine: array, data frame, list

Friday, 29 May 2009

Combine: array, data frame, list

Friday, 29 May 2009

name age sex

John 13 Male

Peter 13 Male

Roger 14 Male

John 13 Male

Mary 15 Female

Alice 14 Female

Peter 13 Male

Roger 14 Male

Phyllis 13 Female

name age sex

Mary 15 Female

Alice 14 Female

Phyllis 13 Female

name age sex

John 13 Male

Peter 13 Male

Phyllis 13 Female

name age sex

Mary 15 Female

name age sex

Alice 14 Female

Roger 14 Male

name age sex

.(sex) .(age)

Split: array, data frame, list

Friday, 29 May 2009

Combine: array, data frame, list

sex

Male

Female

value

3

3

age

13

14

value

3

2

15 2

age

13

14

value

2

1

sex

Male

Male

14 1

15 1

Female

Female

Female 13 1

.(sex) .(age) .(sex, age)

Applying nrow to each piece

Friday, 29 May 2009

Case study: Baseball

Friday, 29 May 2009

id year team g ab r h

ruthba01 1914 BOS 5 10 1 2

ruthba01 1915 BOS 42 92 16 29

ruthba01 1916 BOS 67 136 18 37

ruthba01 1917 BOS 52 123 14 40

ruthba01 1918 BOS 95 317 50 95

ruthba01 1919 BOS 130 432 103 139

ruthba01 1920 NYA 142 457 158 172

ruthba01 1921 NYA 152 540 177 204

ruthba01 1922 NYA 110 406 94 128

ruthba01 1923 NYA 152 522 151 205

ruthba01 1924 NYA 153 529 143 200

ruthba01 1925 NYA 98 359 61 104

ruthba01 1926 NYA 152 495 139 184

ruthba01 1927 NYA 151 540 158 192

ruthba01 1928 NYA 154 536 163 173

ruthba01 1929 NYA 135 499 121 172

21 699 records

1228 players

15-31 years for each player

Friday, 29 May 2009

How does performance (rbi/ab) change over the course of a career?

First need to add column that gives “career year”

Easy for a single player.

baberuth <- subset(baseball, id == "ruthba01") baberuth <- transform(baberuth, cyear = year - min(year) + 1)

For many players, use ddply + transform

baseball <- ddply(baseball, "id", transform, cyear = year - min(year) + 1)

Friday, 29 May 2009

baseball <- subset(baseball, ab >= 25)

xlim <- range(baseball$cyear, na.rm=TRUE)

ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)

plotpattern <- function(df) {

qplot(cyear, rbi / ab, data = df, geom = "line",

xlim = xlim, ylim = ylim)

}

pdf("paths.pdf", width = 8, height = 4)

d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE)

dev.off()

Draw time series for all 1228 players

Friday, 29 May 2009

rsquare

count

0

50

100

150

200

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

slope

intercept

−0.5

0.0

0.5

1.0

−0.04−0.020.000.020.040.060.08

rsquare0.000.250.500.751.00

slope

intercept

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

0.25

−0.010 −0.005 0.000 0.005 0.010

rsquare0.000.250.500.751.00

Friday, 29 May 2009

Fiddly details

Labelling

Progress bars

Consistent argument names

Missing values / Nulls

Friday, 29 May 2009

Data analysis

What other patterns of data analysis are waiting to be discovered?

How can we identify these strategies and then develop software to support them?

Does teaching these patterns make it easier for novices to become experts?

Friday, 29 May 2009

http://had.co.nz/plyr

Friday, 29 May 2009