STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS...

9
STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1. She sells seashells in the Seychelles (a) The first thing you need to do is make the new data frame, missing Seychelles. Here’s the code to do so: library(rethinking) data(rugged) d <- rugged dd <- d[ complete.cases(d$rgdppc_2000) , ] d2 <- dd[ dd$country!=”Seychelles” , ] So now d2 contains all the complete cases, excepting Seychelles. Fit- ting the interaction model is straightforward, despite how ugly the code looks: m3 <- mle2( log(rgdppc_2000) ~ dnorm( mean=a + aA*cont_africa + br*rugged + bAr*cont_africa*rugged, sd=sigma ) , data=d2 , start=list(a=mean(log(d2$rgdppc_2000)),aA=0,br=0,bAr=0, sigma=sd(log(d2$rgdppc_2000))) ) To ease comparison, I’ll also refit the model using all of the data in dd, which is in the book and lecture. Call this model m3all now. Then to compare the estimates: coeftab(m3,m3all) m3 m3all a 9.2233 9.2232 aA -1.8825 -1.9480 br -0.2028 -0.2029 bAr 0.2976 0.3934 sigma 0.9253 0.9327 nobs 169.0000 170.0000 1

Transcript of STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS...

Page 1: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

STATISTICAL RETHINKINGHOMEWORK, WEEK 6

SOLUTIONS

1. She sells seashells in the Seychelles

(a) The first thing you need to do is make the new data frame, missingSeychelles. Here’s the code to do so:

library(rethinking)data(rugged)d <- ruggeddd <- d[ complete.cases(d$rgdppc_2000) , ]d2 <- dd[ dd$country!=”Seychelles” , ]

So now d2 contains all the complete cases, excepting Seychelles. Fit-ting the interaction model is straightforward, despite how ugly the codelooks:

m3 <- mle2( log(rgdppc_2000) ~ dnorm(mean=a + aA*cont_africa + br*rugged + bAr*cont_africa*rugged,sd=sigma ) , data=d2 ,start=list(a=mean(log(d2$rgdppc_2000)),aA=0,br=0,bAr=0,

sigma=sd(log(d2$rgdppc_2000))) )

To ease comparison, I’ll also refit the model using all of the data in dd,which is in the book and lecture. Call this model m3all now. Then tocompare the estimates:

coeftab(m3,m3all)

m3 m3alla 9.2233 9.2232aA -1.8825 -1.9480br -0.2028 -0.2029bAr 0.2976 0.3934sigma 0.9253 0.9327nobs 169.0000 170.0000

1

Page 2: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

2 HOMEWORK, WEEK 6 SOLUTIONS

So the estimates are very similar, but the magnitude of the interactionparameter, bAr, has gone down, after removing Seychelles. Let’s take alook at the standard error and confidence interval:

precis(m3)

Estimate Std. Error 2.5% 97.5%a 9.2232699 0.13691265 8.95492599 9.49161373aA -1.8825287 0.22539074 -2.32428639 -1.44077094br -0.2028380 0.07587515 -0.35155058 -0.05412544bAr 0.2976138 0.13831228 0.02652667 0.56870085sigma 0.9253428 0.05033012 0.82669754 1.02398799

Still reliably above zero, so seems like removing Seychelles didn’t killthe interaction entirely, even though it did diminish its strength. Com-puting the expected slope in Africa:

βr + βAr(1) ≈ −0.2 + 0.3 = 0.1.

And for the original fit, including Seychelles, we got:

βr + βAr(1) ≈ −0.2 + 0.39 = 0.19.

So the expected slope in Africa has been cut approximately in half, com-pared to the model that used all of the data.

(b) I’m going to plot the predicted relationship for African nations,with a blue regression line for the old fit also on the same plot. Here arethe calculations you are accustomed to by now:

post <- sample.naive.posterior( m3 )rugged.seq <- seq(from=0,to=8,by=0.1)mu <- sapply( rugged.seq , function(z)

mean( post$a + post$aA*1 + post$br*z + post$bAr*1*z ) )mu.ci <- sapply( rugged.seq , function(z)

PCI( post$a + post$aA*1 + post$br*z + post$bAr*1*z ) )

And here is the code to plot it all:

dplot <- dd[dd$cont_africa==1,]plot( log(rgdppc_2000) ~ rugged , data=dplot , col=”slateblue” ,

pch=ifelse(dplot$country==”Seychelles”,16,1) ,ylab=”log GDP per capita” , xlab=”ruggedness” )

abline( a=9.22-1.95 , b=0.19 , col=”slateblue” )lines( rugged.seq , mu )lines( rugged.seq , mu.ci[1,] , lty=2 )

Page 3: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

HOMEWORK, WEEK 6 SOLUTIONS 3

lines( rugged.seq , mu.ci[2,] , lty=2 )

And here’s the plot that results:

0 1 2 3 4 5 6

67

89

ruggedness

log

GD

P p

er c

apita

Seychelles is shown by the filled point. Just like the estimates suggested,the slope has been cut approximately in half. Notice too that the uncer-tainty around the line has grown, compared to the plots in lecture andthe book chapter. The predicted relationship outside of Africa has notchanged: it is still reliably negative.

(c) To fit the other two models now:

m1 <- mle2( log(rgdppc_2000) ~ dnorm(mean=a + br*rugged , sd=sigma ) , data=d2 ,start=list(a=mean(log(d2$rgdppc_2000)),br=0,

sigma=sd(log(d2$rgdppc_2000))) )m2 <- mle2( log(rgdppc_2000) ~ dnorm(

mean=a + aA*cont_africa + br*rugged , sd=sigma ) , data=d2 ,start=list(a=mean(log(d2$rgdppc_2000)),aA=0,br=0,

sigma=sd(log(d2$rgdppc_2000))) )

And here is the comparison, using both AICc and BIC:

compare(m1,m2,m3,nobs=nrow(d2))

k AICc BIC w.AICc w.BIC dAICc dBICm3 5 463.7520 479.0334 0.7726 0.4304 0.000000 0.5601115m2 4 466.1976 478.4732 0.2274 0.5696 2.445592 0.0000000m1 3 536.5347 545.7790 <2e-16 1.381e-15 72.782754 67.3057115

Page 4: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

4 HOMEWORK, WEEK 6 SOLUTIONS

So AICc likes the interaction model more than BIC does, but both met-rics assign a lot of weight to it.

To generate the AICc-averaged predictions:

p <- sample.naive.posterior( list(m1,m2,m3) , n=20000 ,nobs=nrow(d2) )

Ai <- 1r.seq <- seq(from=0,to=8,by=0.1)mu <- sapply( r.seq , function(z)

mean( p$a + p$br*z + p$aA*Ai + p$bAr*z*Ai ) )mu.ci <- sapply( r.seq , function(z)

PCI( p$a + p$br*z + p$aA*Ai + p$bAr*z*Ai ) )dPlot <- d[ d$cont_africa==Ai , ]plot( log(rgdppc_2000) ~ rugged , data=dPlot ,

pch=ifelse(dPlot$country==”Seychelles”,16,1) ,col=”slateblue” )

lines( r.seq , mu )lines( r.seq , mu.ci[1,] , lty=2 )lines( r.seq , mu.ci[2,] , lty=2 )

You can change the value of Ai on the second line to make a similar plotfor nations outside of Africa.

Here are the plots, showing model-averaged predictions for the dataomitting Seychelles:

0 1 2 3 4 5 6

67

89

rugged

log(rgdppc_2000)

Africa

0 2 4 6 8

78

910

11

rugged

log(rgdppc_2000)

non-Africa

What are we to make of these results? I think the model comparisonanalysis, whether you use AICc or BIC, suggests that the relationshipbetween ruggedness and GDP is different inside and outside of Africa.However, there is substantial uncertainty about the magnitude of thedifference. The most probable difference is shown in the lots above:

Page 5: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

HOMEWORK, WEEK 6 SOLUTIONS 5

about flat within Africa and about −0.2 outside of it. But the dat areconsistent with both larger and smaller differences.

2. Language diversity and the environment

(a) I’m going to fit three models: (1) A model predicting log-lang-per-capita with only a constant, (2) a model with only mean growing seasonas a predictor, and (3) a model that includes log(area) as a covariate.Then I’ll compare them, using AICc/BIC.

m0 <- lm( log(lang.per.cap) ~ 1 , data=d )m1 <- lm( log(lang.per.cap) ~ mean.growing.season , data=d )m2 <- lm( log(lang.per.cap) ~ log(area) + mean.growing.season ,

data=d )compare(m0,m1,m2,nobs=nrow(d))

k AICc BIC w.AICc w.BIC dAICc dBICm1 3 267.1315 273.7009 0.50966 0.72091 0.000000 0.000000m2 4 267.2458 275.8824 0.48135 0.24220 0.114271 2.181483m0 2 275.2068 279.6460 0.00899 0.03689 8.075303 5.945081

This is very strong support for using mean growing season as a predictorof language diversity. Controlling for log(area) has little effect, as canbe seen by comparing the estimates from models m1 and m2:

coeftab(m1,m2)

m1 m2(Intercept) -6.6816 -3.8595mean.growing.season 0.1740 0.1438log(area) NA -0.2018nobs 74.0000 74.0000

You can check the standard errors, too, to ensure that the estimate formean.growing.season is reliably positive in both cases.

Plotting the predicted relationship, averaging across models:

p <- sample.naive.posterior( list(m0,m1,m2) , n=40000 ,nobs=nrow(d) )

colnames(p) <- c( ”a” , ”bg” , ”ba” )mu.area <- mean(log(d$area))x.seq <- seq(from=0,to=12,by=0.25)mu <- sapply( x.seq , function(z)

mean( p$a + p$bg*z + p$ba*mu.area ) )mu.ci <- sapply( x.seq , function(z)

Page 6: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

6 HOMEWORK, WEEK 6 SOLUTIONS

PCI( p$a + p$bg*z + p$ba*mu.area ) )plot( log(lang.per.cap) ~ mean.growing.season , data=d ,

col=”slateblue” )lines( x.seq , mu )lines( x.seq , mu.ci[1,] , lty=2 )lines( x.seq , mu.ci[2,] , lty=2 )mtext( paste(”log(area) =”,round(mu.area,2)) , 3 )

And here’s the plot:

0 2 4 6 8 10 12

-8-6

-4-2

mean.growing.season

log(lang.per.cap)

log(area) = 12.93

Hm, that relationship doesn’t actually look very linear. Most of the erroris above the line, at long growing season values on the right side. In anyevent, there does seem to be some positive relationship between the twovariables, as predicted.

(b) Fitting three analogous models and comparing them:

m0 <- lm( log(lang.per.cap) ~ 1 , data=d )m1 <- lm( log(lang.per.cap) ~ sd.growing.season , data=d )m2 <- lm( log(lang.per.cap) ~ log(area) + sd.growing.season ,

data=d )compare(m0,m1,m2,nobs=nrow(d))

k AICc BIC w.AICc w.BIC dAICc dBICm2 4 272.3905 281.0270 0.4534 0.1756 0.00000000 1.9876670m1 3 272.4700 279.0394 0.4357 0.4743 0.07954508 0.0000000m0 2 275.2068 279.6460 0.1109 0.3502 2.81634229 0.6065752

A very similar story, although not as strong a relationship. Take a lookat the standard error and confidence interval from m2:

Page 7: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

HOMEWORK, WEEK 6 SOLUTIONS 7

precis(m2)

Estimate Std. Error 2.5% 97.5%(Intercept) -2.0009096 1.9184850 -5.7610710 1.75925185log(area) -0.2396723 0.1595109 -0.5523079 0.07296334sd.growing.season -0.2092513 0.1904060 -0.5824401 0.16393760

Including log(area) generates considerable uncertainty about the direc-tion of the effect of sd.growing.season. Still, the MLE is rather nega-tive, as predicted.

Plotting the model-averaged relationship:

0 1 2 3 4 5 6

-8-6

-4-2

sd.growing.season

log(lang.per.cap)

log(area) = 12.93

So the relationship isn’t likely flat or positive, but many negative slopesare consistent with the data (and model structure).

(c) My inclination here is to fit a bunch of models and let AICc sortthem out. I do this in order to account for overfitting risk. But you’d getessentially the same answer here, just by fitting the interaction modelalone.

Here’s my model set:

m0 <- lm( log(lang.per.cap) ~ 1 , data=d )

m1 <- lm( log(lang.per.cap) ~ mean.growing.season , data=d )m2 <- lm( log(lang.per.cap) ~ log(area) + mean.growing.season ,

data=d )

m3 <- lm( log(lang.per.cap) ~ sd.growing.season , data=d )m4 <- lm( log(lang.per.cap) ~ log(area) + sd.growing.season ,

data=d )

Page 8: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

8 HOMEWORK, WEEK 6 SOLUTIONS

m5 <- lm( log(lang.per.cap) ~mean.growing.season + sd.growing.season , data=d )

m6 <- lm( log(lang.per.cap) ~log(area) + mean.growing.season + sd.growing.season , data=d )

m7 <- lm( log(lang.per.cap) ~mean.growing.season * sd.growing.season , data=d )

m8 <- lm( log(lang.per.cap) ~log(area) + mean.growing.season * sd.growing.season , data=d )

The model m0 is just the null model with no predictors. The next blockof models, m1 and m2, fits mean growing season with and without thelog(area) covariate. The block with m3 and m4 does the same for standarddeviation of growing season. Then the block with m5 and m6 evaluates anadditive combination of mean and standard deviation of growing season.Finally, the black with m7 and m8 looks at the interaction.

Let’s look at the comparison statistics:

compare(m0,m1,m2,m3,m4,m5,m6,m7,m8,nobs=nrow(d))

k AICc BIC w.AICc w.BIC dAICc dBICm7 5 260.4708 271.1088 0.5897177 0.444887 0.000000 0.000000m8 6 262.8410 275.4117 0.1802831 0.051747 2.370230 4.302917m5 4 263.3332 271.9697 0.1409580 0.289267 2.862364 0.860942m6 5 265.6321 276.2701 0.0446565 0.033689 5.161290 5.161290m1 3 267.1315 273.7009 0.0211002 0.121726 6.660723 2.592088m2 4 267.2458 275.8824 0.0199284 0.040896 6.774994 4.773571m4 4 272.3905 281.0270 0.0015217 0.003123 11.919683 9.918261m3 3 272.4700 279.0394 0.0014623 0.008436 11.999228 7.930594m0 2 275.2068 279.6460 0.0003722 0.006229 14.736026 8.537169

Models that include both mean and standard deviation of growing sea-son do the best. Both AICc and BIC prefer the interaction model, es-pecially without log(area) as a covariate.

Okay, so let’s inspect the estimates themselves:

precis(m7)

Estimate Std. Error 2.5% 97.5%(Intercept) -6.9739181 0.60820348 -8.1659751 -5.7818612mean.growing.season 0.2996624 0.07411048 0.1544085 0.4449163sd.growing.season 0.4222498 0.38289686 -0.3282143 1.1727138mean:sd -0.1088649 0.04839463 -0.2037167 -0.0140132

Page 9: STATISTICAL RETHINKING HOMEWORK, WEEK 6 SOLUTIONS 1.xcelab.net/rm/wp-content/uploads/2012/01/week06-ch6-solutions.pdf · HOMEWORK, WEEK 6 SOLUTIONS 3 lines( rugged.seq , mu.ci[2,]

HOMEWORK, WEEK 6 SOLUTIONS 9

Now keep in mind that we haven’t centered the predictor variables here,so interpreting the main effects is hazardous. But the interaction isclearly negative, as predicted: mean and variance synergistically leadto fewer languages.

Let’s see what the predictions look like. I think you know how toplot these things by now, so I’ll just show you the interaction in a panelof six plots. The top row will show the relationship between log-lang-per-capita and mean.growing.season, across values of sd.growing.season.The bottom row will show log-lang-per-capita on sd.growing.season, acrossvalues of mean.growing.season. This is just to show the two-way inter-action from both perspectives. I’m also going to use transparency, as afunction of distance from the value on the top of each plot, to show howto data change through the 3rd, unplotted, dimension.

0 2 4 6 8 10

-8-6

-4-2

mean.growing.season

log(lang.per.cap)

sd.growing.season = 0.53

0 2 4 6 8 10

-8-6

-4-2

mean.growing.season

log(lang.per.cap)

sd.growing.season = 1.69

0 2 4 6 8 10

-8-6

-4-2

mean.growing.season

log(lang.per.cap)

sd.growing.season = 3.54

0 1 2 3 4 5 6

-8-6

-4-2

sd.growing.season

log(lang.per.cap)

mean.growing.season = 2.45

0 1 2 3 4 5 6

-8-6

-4-2

sd.growing.season

log(lang.per.cap)

mean.growing.season = 7.36

0 1 2 3 4 5 6

-8-6

-4-2

sd.growing.season

log(lang.per.cap)

mean.growing.season = 10.66

So the models suggest that mean growing season increases language di-versity, unless the variance in growing season is also high (top row). Si-multaneously, variance in growing season decreases language diversity,unless the mean growing season is very short (bottom row).