Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An...

31
Starting R: An Example of Panel Data Danny Kaplan March 11, 2009

Transcript of Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An...

Page 1: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Starting R: An Example of Panel Data

Danny Kaplan

March 11, 2009

Page 2: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Longitudinal versus Cross-Sectional

CAUTION: Most of this example is about data re-organization.There is some grungy programming. The intention is just to showyou some capabilities and give you some examples for your ownreference.The statistical analysis is mostly in one slide at the end.

I Cross-sectional data is a snap shot of a population at onetime.

I Longitudinal data repeats measurements over time for eachindividual.

I Other related names: repeated measures, panel data.

Page 3: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

An Example: How Runners Age

The data set “ten-mile-race.csv” contains times from the CherryBlossom Ten Miler run in 2005 in Washington, DC. The variablesare:

I net — the time from the start line to the finish line: seconds

I gun — the time from the start gun to the finish line: seconds

I sex

I age — the age of the runner

I state — where the runner comes from

Page 4: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Cross-sectional Analysis

How does the net time for runners depend on their age?

> run2005 = read.csv("/Users/kaplan/kaplanfiles/stats-book/DataSets/ten-mile-race.csv")

> m1 = lm(net ~ age, data = run2005)

Estimate Std. Error t value Pr(>|t|)(Intercept) 5297.2192 37.6056 140.86 0.0000

age 8.1899 0.9806 8.35 0.0000.

Sex is an obvious covariate.

> m2 = lm(net ~ age + sex, data = run2005)

Estimate Std. Error t value Pr(>|t|)(Intercept) 5339.1554 35.0487 152.34 0.0000

age 16.8936 0.9444 17.89 0.0000sexM -726.6195 20.0181 -36.30 0.0000

Page 5: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Cross-sectional Critique

I The previous analysis didn’t actually involve any individualperson’s ageing. Instead, it compared different people ofdifferent ages.

I Perhaps this introduces a bias. It might be that the olderrunners who continue running tend to be the faster runners.After all, it’s discouraging to find yourself being passed bymore and more runners as you age. The runners thusdiscouraged might drop out.

I Fortunately, there is a source of longitudinal data: the racehas been run for 10 years and the results have been publishedon the Internet each year. The data include the name of therunner and so give some possibility to identify individualrunners from year to year.

Page 6: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

What the data look like.

Each year’s format is slightly different, but years tend to look likethis, with separate files for men and women.

Credit Union Cherry Blossom 10 Mile Road RaceWashington, DC

Sunday, April 4, 2004Official Men's Results

Place Div/Tot Num Name Ag Hometown Net Gun===== ======== ===== =============== == ========= ======= =======

1 1/2242 13997 Nelson Kiplagat 25 KEN 48:12 48:122 2/2242 39 Samuel Ndereba 25 KEN 48:12 48:14

... and so on ...4145 27/428 12663 Stephen Johnson 46 Oakton VA 2:21:35 2:26:234146 2241/2242 12678 George Harrell 30 Laurel MD 2:22:50 2:26:29

Page 7: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Re-Organizing the Data

I Read it in from the separate files and put them all in adata-frame format.

I Give a unique identifier to each runner, across the years, tosupport the longitudinal analysis.

I Count how many times each runner participated to extractsubsets of the data.

With the re-organized data, we can construct the longitudinalanalysis.

Page 8: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Reading in the Data I

For each year’s format, write a special-purpose operator that parsesthe data and puts it in a data frame format.This is nuts-and-bolts computer programming, not so interestingto most people.Example:

read1999 = function(n=-1){dF = readLines('cb99f.htm',n=n)dM = readLines('cb99m.htm',n=n)M = breaklines1999(dM[-(1:4)])F = breaklines1999(dF[-(1:4)])res = data.frame( rbind(F,M),

sex=c(rep('F', nrow(F)),rep('M',nrow(M))))}

Page 9: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Reading in the Data II

breaklines1999 = function(s) {# fixed formatfirst = substr(s,1,5)first = as.numeric(kill.blanks(first))second = substr(s,6,10)second = as.numeric(kill.blanks(second))third =substr(s,12,15)third= as.numeric(kill.blanks(third))

nm = substr(s,17,37)nm = kill.blanks(nm)age = as.numeric(substr(s,39,40))place = substr(s,42,59)place = kill.blanks(place)gun = substr(s,61,67)gun = to.minutes(gun)net = rep(NA, length(gun))

Page 10: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Reading in the Data III

return(data.frame(position=first, division=second, total = third,name=nm, age=age,place=place, net=net,gun=gun,stringsAsFactors=FALSE) )

}

Page 11: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Parsing the Net and Gun Time

Translating the hh:mm:ss format into minutes. There are somefunctions that can help, e.g., strptime

to.minutes = function(set){res = rep(0,length(set))for (k in 1:length(set)) {s = strsplit(set[[k]], ":")[[1]]s = as.numeric(s)if (length(s)==3 )

res[k] = s[1]*60 + s[2] + s[3]/60elseres[k] = s[1] + s[2]/60

}return(res)

}

Page 12: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Assembling all the data I

read.all.data = function(){r2008 = read2008()r2007 = read2007()

and so onr2000 = read2000()r1999 = read1999()

res = rbind( r2008, r2007, r2006, r2005,r2004, r2003, r2002, r2001, r2000, r1999);

res$name = kill.blanks(tolower(res$name))res$year = c(rep(2008, nrow(r2008)),rep(2007, nrow(r2007)),

and so onrep(2000, nrow(r2000)),rep(1999, nrow(r1999))

Page 13: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Assembling all the data II

)res = subset(res, age > 10 & ! is.na(age)) # eliminate those without age datareturn(res)

}

Page 14: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Using Name as a Unique Identifier I

The person’s name is almost unique, but some names show upmore than 10 times, so there must be duplicates.I show you this to demonstrate the power of being able to chaincomputations: use the output of one computation as the input ofanother.

> foo = read.all.data()> nrow(foo)[1] 82365> names(foo)[1] "position" "division" "total" "name" "age"[6] "place" "net" "gun" "sex" "year"> length(unique(foo$name))[1] 53517

A lot of runners! But how many times does each person run?

Page 15: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Using Name as a Unique Identifier II

> table(foo$name)a. gudu memon a. renee callahan a.j. montes

1 2 1aaren pastor aaron ahlburn aaron aldridge

2 1 1aaron alford aaron alton aaron ansell

1 4 1and so on for 53,517 different names.

Clearly there are people who ran multiple times, but this display isnot so useful because it is too long. BUT ... the output of tablecan be the input to another computation. In this case, table →table does something very sensible.

Page 16: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Using Name as a Unique Identifier III

> table(table(foo$name))1 2 3 4 5

38466 8580 3123 1518 803

6 7 8 9 10441 282 162 87 41

11 12 13 14 228 2 2 1 1

Page 17: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Creating a Unique ID I

To help separate different people with the same name, we needadditional information. Hometown is a possibility, but people maymove.Try Year of birth

foo$yob = foo$year - foo$agefoo$id = paste( foo$name, foo$yob )> head(foo$id)[1] "lineth chepkurui 1988" "angelina mutuku 1983"[3] "lidia simon 1974" "catherine ndereba 1973"[5] "sharon cherop 1984" "aziza aliyu 1986"> table(table(foo$id))

1 2 3 4 5 6 7 8 9 1041117 8457 2954 1384 750 401 247 143 73 25

Page 18: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Creating a Unique ID IIThis is plausible, but not proven as a unique ID.One possible check: There shouldn’t be anyone with the sameName-YOB who ran twice in any one year:

> foo$idWithYear = paste( foo$id, foo$year )> table(table(foo$idWithYear))

1 282245 60

But there are! Does hometown help?

> foo$idHometown = paste(foo$id, foo$place)> table(table( foo$idHometown ) )

1 2 3 4 5 6 7 854083 8023 1754 698 401 214 91 32> foo$idHometownWithYear = paste(foo$idHometown, foo$year)

Page 19: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Creating a Unique ID III

> table(table( foo$idHometownWithYear ) )

1 282307 29

Page 20: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Extracting the Multiple Runners I

Count how many times each person runs and put this in a table

> nruns = aggregate(foo$yob, by=list(who = foo$id), length)> head(nruns)

who x...7 a. gudu memon 1965 18 a. renee callahan 1966 29 a.j. montes 1964 110 aaren pastor 1991 211 aaron ahlburn 1976 1... and so on .

Do a “join” (a relational database operation) with the original tableto add a new variable for each case: how many times that personran.

> goo = merge(foo, nruns, by.x="id", by.y="who")

Page 21: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Extracting the Multiple Runners II

Finally, look at the subset of runners who have run five times ormore:

> five = subset(goo, x>=5)> nrow(five)[1] 9936> table(table(five$id))

5 6 7 8 9 10750 401 247 143 73 25

Are a lot of these the duplicate names:

> table(table(five$idWithYear))

1 29920 8

Page 22: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Extracting the Multiple Runners III

I’m not going to worry about these few, but I could exclude themby using the same approach used to count how many times eachrunner participated.

Page 23: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

The Statistical Analysis I

> m3 = lm( net ~ age, data=five )

Estimate Std. Error t value Pr(>|t|)(Intercept) 75.6471 0.7001 108.05 0.0000

age 0.2436 0.0155 15.76 0.0000

> m4 = lm( net ~ age + sex, data=five )

Estimate Std. Error t value Pr(>|t|)(Intercept) 78.9320 0.6516 121.13 0.0000

age 0.3472 0.0145 23.89 0.0000sexM -11.8040 0.3243 -36.40 0.0000

Page 24: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

The Statistical Analysis II

Allowing each runner to serve as his or her own control.

> m5 = lm( net ~ age + id, data=five )

Estimate Std. Error t value Pr(>|t|)(Intercept) 59.50 2.69 22.12 0.00

age 0.83 0.04 19.99 0.00idabigail grier 1983 18.33 3.81 4.82 0.00idabiy zewde 1967 13.59 3.40 4.00 0.00

idadam anthony 1966 -8.19 3.81 -2.15 0.03idadam knapp 1977 22.91 4.15 5.52 0.00

... and so on.

The longitudinal analysis shows that runners slow down muchfaster than indicated by the cross-sectional analysis.

Page 25: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

A Mixed-Effects Model? I

I don’t know much about mixed effects models, but the fact thatwe’re not interested in the coefficients for individual runnerssuggests that we should be treating them as random effects.

> library(lme4)> m6 = lmer( net ~ 1 + age + (1|id), five)

Page 26: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

A Mixed-Effects Model? II

> summary(m6)Linear mixed model fit by REML

AIC BIC logLik deviance REMLdev52918 52945 -26455 52904 52910Random effects:Groups Name Variance Std.Dev.id (Intercept) 167.489 12.9417Residual 35.007 5.9167Number of obs: 7480, groups: id, 1639

Fixed effects:Estimate Std. Error t value

(Intercept) 67.21559 1.14307 58.8age 0.44137 0.02508 17.6

This coefficient is quite different.

Page 27: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Common Sense: A Separate Model for Each Runner I

> myfun = function(x){lm(net~age,data=x)$coef[2]}> ages = group( five, five$id, myfun)> head(ages)

group result1 aaron glahe 1974 3.12619052 abigail grier 1983 2.46666673 abiy zewde 1967 1.6940476and so on

What’s the mean slope?

Page 28: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Common Sense: A Separate Model for Each Runner II

> t.test(ages$result)One Sample t-testdata: ages$resultt = 11.1364, df = 1638, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval:0.6078309 0.8677138sample estimates:mean of x0.7377724

Page 29: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Slowing and Aging I

Going a bit further: See how runners slow as they age.We need to pull out the YOB for each runner and model theageing slope versus YOB.

> yob = group(five$yob, five$id, mean, resname='yob')> noo = merge(yob,ages)> mm = lm( result ~ yob, data=noo)> summary(mm)

Estimate Std. Error t value Pr(>|t|)(Intercept) 95.760335 12.076539 7.929 4.04e-15 ***yob -0.048496 0.006163 -7.868 6.47e-15 ***

Page 30: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Slowing and Aging II

Year of Birth

Min

utes

per

Yea

r

−1

0

1

2

3

4

1920 1940 1960 1980

Page 31: Starting R: An Example of Panel Data - Macalester Collegekaplan/startingwithr/panel-data.pdf · An Example: How Runners Age The data set\ten-mile-race.csv"contains times from the

Slowing and Aging III

Year of Birth

Min

utes

per

Yea

r

−1

0

1

2

3

4

1920 1940 1960 1980

F

1920 1940 1960 1980

M