SC968 Panel data methods for sociologists Lecture 1, part 1

65
SC968 Panel data methods for sociologists Lecture 1, part 1 A review of concepts for regression modelling Or things you should know already

description

SC968 Panel data methods for sociologists Lecture 1, part 1. A review of concepts for regression modelling Or things you should know already. Overview. Models OLS, logit and probit Mathematically and practically Interpretation of results, measures of fit and regression diagnostics - PowerPoint PPT Presentation

Transcript of SC968 Panel data methods for sociologists Lecture 1, part 1

Page 1: SC968 Panel data methods for sociologists Lecture 1, part 1

SC968Panel data methods for sociologistsLecture 1, part 1

A review of concepts for regression modellingOr things you should know already

Page 2: SC968 Panel data methods for sociologists Lecture 1, part 1

Overview

Models OLS, logit and probit Mathematically and practically

Interpretation of results, measures of fit and regression diagnostics Model specification Post-estimation commands STATA competence

Page 3: SC968 Panel data methods for sociologists Lecture 1, part 1

Ordinary Least Squares (OLS)

iKiKiiii xxxxy .........332211

Value of dependent variable for individual i (LHS variable)

Value of explanatory variable 1 for person i

Coefficient on variable 1

Residual (disturbance, error term)

Total no. of explanatory variables (RHS variables or regressors) is K

Examples

yi = mental healthx1 = sexx2 = agex3 = marital statusx4 = employment statusx5 = physical health

yi = hourly payx1 = sexx2 = agex3 = educationx4 = job tenurex5 = industryx6 = region

Intercept (constant)

Page 4: SC968 Panel data methods for sociologists Lecture 1, part 1

OLS

iii xy '

Vector of explanatory variables

Vector of coefficients

'Xy

N

K

NKNNN

K

K

K

K

K

N xxxx

xxxxxxxxxxxxxxxxxxxx

y

yyyyy

.

.

.

.

.

.*

..............

..

..

..

..

..

.

.

3

2

1

3

2

1

321

5535251

4434241

3333231

2232221

1131211

5

4

3

2

1

i

K

iKiiii xxxxy

.

.*.. 3

2

1

321

iKiKiiii xxxxy .........332211

In vector form In matrix form

Note: you will often see x’β written as xβ

Page 5: SC968 Panel data methods for sociologists Lecture 1, part 1

OLS

2)(min i

Also called “linear regression” Assumes dependent variable is a linear combination of dependent

variables, plus disturbance “Least squares”: β’s estimated so as to minimise the sum of the ε’s.

Page 6: SC968 Panel data methods for sociologists Lecture 1, part 1

Residuals have zero mean………………………………. Follows that ε’s and X’s are uncorrelated……………….

violated if a regressor is endogenous Eg, number of children in female labour supply models Cure by (eg) Instrumental Variables

Homoscedasticity: all ε’s have same variance ………… Classic example: food consumption and income Cure by using weighted least squares

Nonautocorrelation: ε’s uncorrelated with each other … Data sets where the same individual appears multiple times Adjust standard errors: ‘cluster’ option in STATA

Distubances are iid (normally distributed, zero mean, constant variance)

Basic Assumptions

0)|( ii XE 0)( iE

0)( ii XE

2)( iVar

0)( jiE

Page 7: SC968 Panel data methods for sociologists Lecture 1, part 1

When is OLS appropriate?

When you have a continuous dependent variable Eg, you would use it to estimate regressions for height, but not for whether a

person has a university degree. When the assumptions are not obviously violated As a first step in research to get ball-park estimates

We will use them a lot for this purpose

Worked examples Coefficients, P-values, t-statistics Measures of fit (R-squared, adjusted R-squared) Thinking about specification Post-estimation commands Regression diagnostics.

A note on the data All examples (in lectures and practicals) drawn from a 20% sample of the British

Household Panel Survey (BHPS) – more about the data later!

Page 8: SC968 Panel data methods for sociologists Lecture 1, part 1

Summarize monthly earned income

99% 5003.849 10000 Kurtosis 11.9432195% 3061.355 10000 Skewness 2.1929590% 2471.848 9333.333 Variance 101668575% 1690 9207.083 Largest Std. Dev. 1008.30850% 1073.088 Mean 1282.831

25% 615.3333 2.416667 Sum of Wgt. 1669610% 268.6667 2 Obs 16696 5% 156 1.25 1% 43 1 Percentiles Smallest incm

. sum incm if age >= 17 & age <= 64, d

Page 9: SC968 Panel data methods for sociologists Lecture 1, part 1

For illustrative purposes only. Not an example of good practice.

R-squared = Model SS / Total SS

Tests whether all coeffs except constant are jointly zero

MS = SS/df

Root MSE = sqrt(MSR)

Coefficients + or – 1.96 standard errors

T-stat = coefficient / standard error

First worked example

Analysis of variance (ANOVA) table

_cons -819.931 78.80064 -10.41 0.000 -974.3888 -665.4732 mth_int -5.059072 4.036446 -1.25 0.210 -12.97094 2.8528 ed_deg 1076.674 20.54526 52.40 0.000 1036.403 1116.945 ed_sec 380.5032 14.36582 26.49 0.000 352.3446 408.6618 partner 155.7992 16.62703 9.37 0.000 123.2085 188.39 age2 -1.155281 .0479992 -24.07 0.000 -1.249364 -1.061197 age 101.0994 3.859657 26.19 0.000 93.53401 108.6647 female -594.9641 13.26812 -44.84 0.000 -620.9711 -568.9571 incm Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 1.6626e+10 16457 1010245.5 Root MSE = 847.35 Adj R-squared = 0.2893 Residual 1.1811e+10 16450 718000.667 R-squared = 0.2896 Model 4.8145e+09 7 687785597 Prob > F = 0.0000 F( 7, 16450) = 957.92 Source SS df MS Number of obs = 16458

. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64

. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"

Monthly labour income, for people whose labour income is >= £1

Page 10: SC968 Panel data methods for sociologists Lecture 1, part 1

All coefficients except month of interview are significant 29% of variation explained Being female reduces income by nearly £600 per month Income goes up with age and then down 16458 observations…..oops, this is from panel data, so there are repeated

observations on individuals.

What do the results tell us?

_cons -819.931 78.80064 -10.41 0.000 -974.3888 -665.4732 mth_int -5.059072 4.036446 -1.25 0.210 -12.97094 2.8528 ed_deg 1076.674 20.54526 52.40 0.000 1036.403 1116.945 ed_sec 380.5032 14.36582 26.49 0.000 352.3446 408.6618 partner 155.7992 16.62703 9.37 0.000 123.2085 188.39 age2 -1.155281 .0479992 -24.07 0.000 -1.249364 -1.061197 age 101.0994 3.859657 26.19 0.000 93.53401 108.6647 female -594.9641 13.26812 -44.84 0.000 -620.9711 -568.9571 incm Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 1.6626e+10 16457 1010245.5 Root MSE = 847.35 Adj R-squared = 0.2893 Residual 1.1811e+10 16450 718000.667 R-squared = 0.2896 Model 4.8145e+09 7 687785597 Prob > F = 0.0000 F( 7, 16450) = 957.92 Source SS df MS Number of obs = 16458

. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64

. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"

Page 11: SC968 Panel data methods for sociologists Lecture 1, part 1

Coefficients, R-squared etc are unchanged from previous specification But standard errors are adjusted: standard errors larger, t-statistics are lower

Add ,cluster(pid) as an option

_cons -819.931 132.8455 -6.17 0.000 -1080.431 -559.4306 mth_int -5.059072 4.126102 -1.23 0.220 -13.15006 3.031912 ed_deg 1076.674 64.45131 16.71 0.000 950.2898 1203.058 ed_sec 380.5032 30.36746 12.53 0.000 320.9549 440.0516 partner 155.7992 30.87227 5.05 0.000 95.26099 216.3375 age2 -1.155281 .0933813 -12.37 0.000 -1.338395 -.9721666 age 101.0994 7.323088 13.81 0.000 86.73932 115.4594 female -594.9641 31.81172 -18.70 0.000 -657.3445 -532.5836 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2466 clusters in pid)

Root MSE = 847.35 R-squared = 0.2896 Prob > F = 0.0000 F( 7, 2465) = 135.26Linear regression Number of obs = 16458

. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64, cluster(pid)

Page 12: SC968 Panel data methods for sociologists Lecture 1, part 1

Let’s get rid of the “month” variable

_cons -866.2836 125.9787 -6.88 0.000 -1113.319 -619.2486 ed_deg 1076.837 64.44019 16.71 0.000 950.4745 1203.199 ed_sec 381.0247 30.36183 12.55 0.000 321.4874 440.562 partner 155.5618 30.87778 5.04 0.000 95.01275 216.1109 age2 -1.153834 .0934155 -12.35 0.000 -1.337015 -.9706534 age 100.9827 7.325995 13.78 0.000 86.617 115.3485 female -594.8596 31.80682 -18.70 0.000 -657.2304 -532.4887 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2467 clusters in pid)

Root MSE = 847.33 R-squared = 0.2895 Prob > F = 0.0000 F( 6, 2466) = 156.78Linear regression Number of obs = 16460

. reg incm female age age2 partner ed_sec ed_deg if age >= 17 & age <= 64, cluster(pid)

Think about the female coefficient a bit more. Could it be to do with women working shorter hours?

Page 13: SC968 Panel data methods for sociologists Lecture 1, part 1

Is the coefficient on hours of work reasonable? £5.65 for every additional hour worked – certainly in the right ball park.

Control for weekly hours of work

_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2263 clusters in pid)

Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998

. reg incm female age age2 partner ed_sec ed_deg hrsm if age >= 17 & age <= 64, cluster(pid)

Page 14: SC968 Panel data methods for sociologists Lecture 1, part 1

R-squared jumps from 29% to 46% Coefficient on female goes from -595 to -315 Almost half the effect of gender is explained by women’s shorter hours of work Age, partner and education coefficients are also reduced in magnitude, for similar

reasons Number of observations reduces from 16460 to 13998 – missing data on hours

Looking at 2 specifications together

Root MSE = 847.33 R-squared = 0.2895 Prob > F = 0.0000 F( 6, 2466) = 156.78Linear regression Number of obs = 16460

_cons -866.2836 125.9787 -6.88 0.000 -1113.319 -619.2486 ed_deg 1076.837 64.44019 16.71 0.000 950.4745 1203.199 ed_sec 381.0247 30.36183 12.55 0.000 321.4874 440.562 partner 155.5618 30.87778 5.04 0.000 95.01275 216.1109 age2 -1.153834 .0934155 -12.35 0.000 -1.337015 -.9706534 age 100.9827 7.325995 13.78 0.000 86.617 115.3485 female -594.8596 31.80682 -18.70 0.000 -657.2304 -532.4887 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998

_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Page 15: SC968 Panel data methods for sociologists Lecture 1, part 1

Is the effect of university qualifications statistically different from the effect of secondary education?

What age does income peak?

Income = Y + β1*age + β2*age2

d(Income)/d(age) = β1+ 2β2*age

Derivative = zero when

age = - β1/2β2

= -79.552/(-0.873*2)

= 45.5

Interesting post-estimation activities

Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998

_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Prob > F = 0.0000 F( 1, 2262) = 110.75

( 1) ed_sec - ed_deg = 0

. test ed_sec = ed_deg

Page 16: SC968 Panel data methods for sociologists Lecture 1, part 1

A closer look at “partner” coefficient

Page 17: SC968 Panel data methods for sociologists Lecture 1, part 1

_cons -1382.844 133.0607 -10.39 0.000 -1643.909 -1121.779 hrsm 6.806946 .3051015 22.31 0.000 6.208337 7.405556 ed_deg 819.3002 73.74637 11.11 0.000 674.6098 963.9906 ed_sec 277.2823 31.66175 8.76 0.000 215.1619 339.4026 partner 84.15365 29.27677 2.87 0.004 26.7126 141.5947 age2 -.6229372 .0961605 -6.48 0.000 -.8116041 -.4342702 age 56.20989 7.327688 7.67 0.000 41.83296 70.58682 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 1167 clusters in pid)

Root MSE = 564.65 R-squared = 0.4830 Prob > F = 0.0000 F( 6, 1166) = 125.30Linear regression Number of obs = 7222

-> female = 1

_cons -1907.107 175.5681 -10.86 0.000 -2251.595 -1562.62 hrsm 3.930412 .3784925 10.38 0.000 3.18776 4.673065 ed_deg 1082.255 89.21241 12.13 0.000 907.2087 1257.302 ed_sec 356.7472 41.91151 8.51 0.000 274.5113 438.9832 partner 213.351 46.9817 4.54 0.000 121.1667 305.5354 age2 -1.257366 .1316253 -9.55 0.000 -1.515633 -.9991 age 113.3119 10.56356 10.73 0.000 92.5848 134.039 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 1096 clusters in pid)

Root MSE = 787.93 R-squared = 0.3452 Prob > F = 0.0000 F( 6, 1095) = 115.53Linear regression Number of obs = 6776

-> female = 0

. bysort female: reg incm age age2 partner ed_sec ed_deg hrsm if age >= 17 & age <= 64, cluster(pid)

Men who are part of a couple earn much more than men who are not – women less so.

Other coefficients also differ between men and women, but with current specification, we can’t test whether differences are significant.

Page 18: SC968 Panel data methods for sociologists Lecture 1, part 1

Developed for discrete (categorical) dependent variables Eg, psychological morbidity, whether one has a job…. Think of other

examples. Outcome variable is always 0 or 1. Estimate:

OLS (linear probability model) would set F(X,β) = X’β + ε Inappropriate because:

Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’β or 1-x’β

More seriously, one cannot constrain estimated probabilities to lie between 0 and 1.

Logit and Probit

),(1)0Pr(),()1Pr(

XFY

XFY

Page 19: SC968 Panel data methods for sociologists Lecture 1, part 1

Solution: We need a link function that will transform our dichotomous Y into a continuous form Y’

Looking for a function which lies between 0 and 1: Cumulative normal distribution: Probit model

• Z scores assuming the cumulative normal distribution Φ

Logistic distribution: Logit (logistic) model• Logged odds of probability

They are very similar! Note how they lie between 0 and 1 (vertical axis)

Logit and Probit

)'().()1Pr('

XdttYx

)(1

)1Pr(

xe

eY x

x

Page 20: SC968 Panel data methods for sociologists Lecture 1, part 1

Likelihood function: product of Pr(y=1) = F(x’β) for all observations where y=1 Pr(y=0) = 1- F(x’β) for all observations where y=0 (think of the probability of flipping exactly four heads and two tails, with six dice)

Log likelihood written as

Estimated using an iterative procedure STATA chooses starting values for β’s Computes slopes of likelihood function at these values Adjusts β’s accordingly Stops when slope of LF is ≈0 Can take time!

Maximum likelihood estimation

)](1ln[)(lnln jsj

jjsj

j xFyxFyL

Page 21: SC968 Panel data methods for sociologists Lecture 1, part 1

Let’s look at whether a person works

Total 37,552 100.00 . 9,793 26.08 100.00 other 124 0.33 73.92 gvt trng scheme 67 0.18 73.59lt sick, disabld 1,057 2.81 73.41ft studt, school 1,394 3.71 70.60 family care 1,964 5.23 66.89 maternity leave 320 0.85 61.66 retired 4,726 12.59 60.80 unemployed 1,120 2.98 48.22 employed 14,702 39.15 45.24 self-employed 2,204 5.87 6.08 not answered 2 0.01 0.22 -7 66 0.18 0.21 missing or wild 13 0.03 0.03 activity Freq. Percent Cum. current economic

. tab jbstat, m

gen byte work = (jbstat == 1 | jbstat == 2) if jbstat >= 1 & jbstat != .

Page 22: SC968 Panel data methods for sociologists Lecture 1, part 1

Logit regression: whether have a job

All the iterations

2* (LL of this model – LL of null model)

Measure of amount explained but less intuitive interpretation

From these coefficients, can tell whether estimated effects are positive or negative Whether they’re significant Something about effect sizes – but difficult to draw inferences from coefficients

_cons -3.666352 .5216913 -7.03 0.000 -4.688848 -2.643856 nkids -.477714 .0391116 -12.21 0.000 -.5543714 -.4010567 ed_deg .8734892 .1468462 5.95 0.000 .585676 1.161302 ed_sec .602653 .0900282 6.69 0.000 .4262009 .7791051 partner .4681257 .0943383 4.96 0.000 .2832261 .6530253 badhealth -.5213826 .0404068 -12.90 0.000 -.6005785 -.4421868 age2 -.0046546 .0003504 -13.28 0.000 -.0053414 -.0039679 age .3578242 .0282831 12.65 0.000 .3023904 .4132581 female -.8001156 .090802 -8.81 0.000 -.9780842 -.622147 work Coef. Std. Err. z P>|z| [95% Conf. Interval] Robust (Std. Err. adjusted for 2430 clusters in pid)

Log pseudolikelihood = -7838.2372 Pseudo R2 = 0.1456 Prob > chi2 = 0.0000 Wald chi2(8) = 613.59Logistic regression Number of obs = 17268

Iteration 4: log pseudolikelihood = -7838.2372 Iteration 3: log pseudolikelihood = -7838.2372 Iteration 2: log pseudolikelihood = -7838.4288 Iteration 1: log pseudolikelihood = -7909.9067 Iteration 0: log pseudolikelihood = -9174.0313

. logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid)

Page 23: SC968 Panel data methods for sociologists Lecture 1, part 1

Comparing logit and probit

Scaling factor proposed by Amemiya (1981) Multiply Probit coefficients by 1.6 to get an approximation to Logit Other authors have suggested a factor of 1.8

Logit Probit Probit * 1.6female -0.800 -0.455 -0.728age 0.358 0.206 0.330age2 -0.005 -0.003 -0.004badhealth -0.521 -0.300 -0.479partner 0.468 0.284 0.455ed_sec 0.603 0.343 0.548ed_deg 0.873 0.476 0.762nkids -0.478 -0.275 -0.441_cons -3.666 -2.112 -3.380

Page 24: SC968 Panel data methods for sociologists Lecture 1, part 1

Marginal effects

After logit or Probit estimation, use the margins command Calculates marginal effects of each of the RHS variables on the

dependent variable Slope of the function for continuous variables Effect of change from 0 to 1 in a dummy variable Can also provide predicted probabilities, linear combinations, plots, and much

more!

MEM: Marginal Effects at the Meansmargins, dydx(*) atmeans

AME: Average Marginal EffectsMargins, dydx(*)

MER: Marginal Effects at Representative ValuesMargins, dydx(*) at(age=20 30 40 50)

Page 25: SC968 Panel data methods for sociologists Lecture 1, part 1

Marginal effects

Logit and Probit mfx are very similar indeed OLS is actually not too bad

Logit Probit OLSfemale* -0.118 -0.122 -0.114age 0.053 0.056 0.057age2 -0.001 -0.001 -0.001badhea~h -0.078 -0.081 -0.086partner* 0.076 0.082 0.075ed_sec* 0.087 0.090 0.094ed_deg* 0.106 0.109 0.118nkids -0.071 -0.075 -0.077Constant -0.045

Page 26: SC968 Panel data methods for sociologists Lecture 1, part 1

Odds ratios

Only an option with logit Type “or” in, after the comma as an option Reports odds ratios: that is, how many times more (or less) likely the outcome becomes

if the variable is 1 rather than 0, in the case of a dichotomous variable for each unit increase of the variable, for a continuous variable

Results >1 show an increased probability, results <1 show decrease

nkids .6201995 .024257 -12.21 0.000 .5744333 .6696121 ed_deg 2.395254 .3517339 5.95 0.000 1.796205 3.194091 ed_sec 1.826959 .1644779 6.69 0.000 1.531428 2.179521 partner 1.596998 .150658 4.96 0.000 1.327405 1.921345 badhealth .5936991 .0239895 -12.90 0.000 .5484942 .6426296 age2 .9953562 .0003488 -13.28 0.000 .9946728 .99604 age 1.430214 .0404509 12.65 0.000 1.353089 1.511735 female .449277 .0407952 -8.81 0.000 .3760308 .5367907 work Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Robust (Std. Err. adjusted for 2430 clusters in pid)

Log pseudolikelihood = -7838.2372 Pseudo R2 = 0.1456 Prob > chi2 = 0.0000 Wald chi2(8) = 613.59Logistic regression Number of obs = 17268

Iteration 4: log pseudolikelihood = -7838.2372 Iteration 3: log pseudolikelihood = -7838.2372 Iteration 2: log pseudolikelihood = -7838.4288 Iteration 1: log pseudolikelihood = -7909.9067 Iteration 0: log pseudolikelihood = -9174.0313

. logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid) or

Page 27: SC968 Panel data methods for sociologists Lecture 1, part 1

Other post-estimation commands

Likelihood ratio test “lrtest” Adding an extra variable to the RHS always increases the likelihood But, does it add “enough” to the likelihood? LR test calculates L0/L1 (Lrestricted/Lunrestricted) and calculates chi-squared stat with

d.f. equal to the number of variables you are dropping. Null hypothesis: restricted specification. Only works on nested models, ie, where the RHS variables in one model are

a subset of the RHS variables in the other. How to do it

Run the full model Type “estimates store NAME” Run a smaller model Type “estimates store ANOTHERNAME” ….. And so on for as many models as you like Type “lrtest NAME ANOTHERNAME”

Be careful….. Sample sizes must be the same for both models Won’t happen if the dropped variable is missing for some observations Solve problem by running the biggest model first and using e(sample)

Page 28: SC968 Panel data methods for sociologists Lecture 1, part 1

. estimates store KEEPSCOT

. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids r_sco if e(sample)

. estimates store KEEP4

. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids r_sco r_sw r_nw r_nth if e(sample)

. estimates store DROPREG

. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids if e(sample)

. estimates store ALL

_cons -3.042685 1.133771 -2.68 0.007 -5.264836 -.8205342 r_sco -1.183256 .3413128 -3.47 0.001 -1.852216 -.5142949 r_wls -.4251579 .3862621 -1.10 0.271 -1.182218 .3319019 r_nth -.6270993 .2940544 -2.13 0.033 -1.203435 -.0507633 r_nw -.6369382 .3140179 -2.03 0.043 -1.252402 -.0214744 r_sw -.7379424 .34484 -2.14 0.032 -1.413816 -.0620684 r_mid -.3796683 .2807851 -1.35 0.176 -.929997 .1706604 r_lon -.5363941 .3247594 -1.65 0.099 -1.172911 .1001226 nkids -.3589658 .0813107 -4.41 0.000 -.5183318 -.1995998 ed_deg 1.356818 .2734631 4.96 0.000 .8208405 1.892796 ed_sec .7737744 .1749884 4.42 0.000 .4308035 1.116745 partner .244436 .1984392 1.23 0.218 -.1444977 .6333698 badhealth -.5337105 .0893498 -5.97 0.000 -.708833 -.358588 age2 -.0039101 .000729 -5.36 0.000 -.0053389 -.0024814 age .3093037 .0599295 5.16 0.000 .1918441 .4267633 work Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -477.02974 Pseudo R2 = 0.1296 Prob > chi2 = 0.0000 LR chi2(14) = 142.07Logistic regression Number of obs = 1066

Iteration 4: log likelihood = -477.02974 Iteration 3: log likelihood = -477.02974 Iteration 2: log likelihood = -477.04783 Iteration 1: log likelihood = -480.90757 Iteration 0: log likelihood = -548.06325

. logit work age age2 badhealth partner ed_sec ed_deg nkids r_* if age >= 21 & age <= 60 & wave == 15

. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"

LR test - example Similar but not identical regression to previous examples Add regional variables, decide which ones to keep Looks as though Scotland might stay, also possibly SW, NW, N

. logit work age age2 badhealth partner ed_sec ed_deg nkids r_* if age >= 21 & age <= 60 & wave == 15

Page 29: SC968 Panel data methods for sociologists Lecture 1, part 1

LR test - example

Reject dropping all regional variables against keeping full set Don’t reject dropping all but 4, over keeping full set Don’t reject dropping all but Scotland, over keeping full set Don’t reject dropping all but Scotland, over dropping all but 4 [and just to check: DO reject dropping all regional variables against dropping all but Scotland]

(Assumption: DROPREG nested in KEEPSCOT) Prob > chi2 = 0.0102Likelihood-ratio test LR chi2(1) = 6.59

. lrtest KEEPSCOT DROPREG

(Assumption: KEEPSCOT nested in KEEP4) Prob > chi2 = 0.2347Likelihood-ratio test LR chi2(3) = 4.26

. lrtest KEEP4 KEEPSCOT

(Assumption: KEEPSCOT nested in ALL) Prob > chi2 = 0.2689Likelihood-ratio test LR chi2(6) = 7.60

. lrtest ALL KEEPSCOT

(Assumption: KEEP4 nested in ALL) Prob > chi2 = 0.3422Likelihood-ratio test LR chi2(3) = 3.34

. lrtest ALL KEEP4

(Assumption: DROPREG nested in ALL) Prob > chi2 = 0.0479Likelihood-ratio test LR chi2(7) = 14.19

. lrtest ALL DROPREGREJECT nested specification

DON’T REJECT nested spec

Page 30: SC968 Panel data methods for sociologists Lecture 1, part 1

Again, specification is illustrative only

This is not an example of a “finished” labour supply model! How could one improve the model?

Model specification Theoretical considerations, Empirical considerations Parsimony Stepwise regression techniques

Regression diagnostics Interpreting results Spotting “unreasonable” results

Page 31: SC968 Panel data methods for sociologists Lecture 1, part 1

Other models

Other models to be aware of, but not covered on this course: Extensions to logit and probit

Ordered models (ologit, oprobit) for ordered outcomes• Levels of education, • Number of children• Excellent, good, fair or poor health

Multinomial models (mlogit, mprobit) for multiple outcomes with no obvious ordering

• Working in public, private or voluntary sector• Choice of nursery, childminder or playgroup for pre-school care

Heckman selection model For modelling two-stage procedures

• Earnings, conditional on having a job at all• Having a job is modelled as a probit, earnings are modelled as OLS• Used particularly for women’s earnings

Tobit model for censored or truncated data Typically, for data where there are lots of zeros

• Expenditure on rarely-purchased items, eg cars• Children’s weights, in an experiment where the scales broke and gave a minimum

reading of 10kg

Page 32: SC968 Panel data methods for sociologists Lecture 1, part 1

Competence in STATA Best results in this course if you already know how to use STATA

competently. Check you know how to

Get data into STATA (use and using commands)Manipulate data, (merge, append, rename, drop, save)Describe your data (describe, tabulate, table)Create new variables (gen, egen)Work with subsets of data (if, in, by)Do basic regressions (regress, logit, probit)Run sessions interactively and in batch modeOrganise your datasets and do-files so you can find them again.

If you can’t do these, upgrade your knowledge ASAP! Could enroll in STATA net course 101

Costs $110 ESRC might pay Courses run regularly www.stata.com

Page 33: SC968 Panel data methods for sociologists Lecture 1, part 1

SC968Panel data methods for sociologistsLecture 1, part 2

Introducing Longitudinal Data

Page 34: SC968 Panel data methods for sociologists Lecture 1, part 1

Overview

Cross-sectional and longitudinal data Types of longitudinal data Types of analysis possible with panel data Data management – merging, appending, long and wide forms Simple models using longitudinal data

Page 35: SC968 Panel data methods for sociologists Lecture 1, part 1

Cross-sectional and longitudinal data

First, draw the distinction between macro- and micro-level data Micro level: firms, individuals Macro level: local authorities, travel-to-work areas, countries, commodity prices Both may exist in cross-sectional or longitudinal forms We are interested in micro-level data But macro-level variables are often used in conjunction with micro-data

Cross-sectional data Contains information collected at a given point in time (More strictly, during a given time window)

• European Social Survey (ESS)• Programme for International Student Assessment (PISA)

Many cross-sectional surveys are repeated, but on different individuals Longitudinal data

Contains repeated observations on the same subjects

Page 36: SC968 Panel data methods for sociologists Lecture 1, part 1

Types of longitudinal data

Time-series data Eg, commodity prices, exchange rates

Repeated interviews at irregular intervals UK cohort studies: NCDS (1958), BCS (1970), MCS (2000)

Repeated interviews at regular intervals “Panel” surveys Usually annual intervals, sometimes two-yearly BHPS, SLID, PSID, SOEP

Some surveys have both cross-sectional and panel elements Panels more expensive to collect LFS, EU-SILC both have a “rolling panel” element

Other sources of longitudinal data Retrospective data (eg work or relationship history) Linkage with external data (eg, tax or benefit records) – particularly in Scandinavia May be present in both cross-sectional or longitudinal data sets

Page 37: SC968 Panel data methods for sociologists Lecture 1, part 1

Analysis with longitudinal data

The “snapshot” versus the “movie” Essentially, longitudinal data allow us to observe how events evolve Study “flows” as well as “stocks”.

Example: unemployment Cross-sectional analysis shows steady 5% unemployment rate Does this mean that everyone is unemployed one year out of twenty? That 5% of people are unemployed all the time? Or something in between Very different implications for equality, social policy, etc

Page 38: SC968 Panel data methods for sociologists Lecture 1, part 1

The BHPS

Interviews about 10,000 adults in about 6,000 households Interviews repeated annually People followed when they move People join the sample if they move in with a sample member

Household-level information collected from “head of household” Individual-level information collected from people aged 17+ Young people aged 11-16 fill in a youth questionnaire

BHPS is now part of Understanding Society Much larger and wider-ranging survey 40,000 households

Data set used for this course is a 20% sample of BHPS, with selected variables

Page 39: SC968 Panel data methods for sociologists Lecture 1, part 1

The BHPS

All files prefixed with a letter indicating the year All variables within each file also prefixed with this letter 1991: a 1992: b………. and so on

Several files each year, containing different information hhsamp information on sample households hhresp household-level information on households that actually responded indall info on all individuals in responding households indresp info on respondents to main questionnaire (adults) egoalt file showing relationship of household members to one another income incomes

Extra files each year containing derived variables: Work histories, net income files

And others with occasional modules, eg life histories in wave 2 bjobhist blifemst bmarriag bcohabit bchildnt

Page 40: SC968 Panel data methods for sociologists Lecture 1, part 1

Some BHPS files

768.1k aindall.dta 10.7M aindresp.dta 1626.3k ahhresp.dta 330.6k ahhsamp.dta 1066.4k aincome.dta 541.3k aegoalt.dta 303.8k ajobhist.dta

635.3k bindsamp.dta 978.2k bindall.dta 11.0M bindresp.dta 1499.7k bhhresp.dta 257.1k bhhsamp.dta 1073.0k bincome.dta 546.5k begoalt.dta 237.8k bjobhist.dta

23.5k bchildad.dta 284.4k bchildnt.dta 34.3k bcohabit.dta 766.4k blifemst.dta 272.4k bmarriag.dta

624.3k cindsamp.dta 975.6k cindall.dta 11.0M cindresp.dta 1539.0k chhresp.dta 287.4k chhsamp.dta 1008.9k cincome.dta 542.2k cegoalt.dta 237.8k cjobhist.dta 1675.0k clifejob.dta 616.7k dindsamp.dta 943.7k dindall.dta 11.2M dindresp.dta 1508.9k dhhresp.dta 301.9k dhhsamp.dta 1019.7k dincome.dta 531.8k degoalt.dta 245.0k djobhist.dta 129.0k dyouth.dta

4977.3k xwaveid.dta 1027.7k xwlsten.dta

Extra modules in Wave 2

Following sample members

Youth module introduced 1994

Cross-wave identifiers

Page 41: SC968 Panel data methods for sociologists Lecture 1, part 1

Person and household identifiers

BHPS (along with other panels such as ECHP, SOEP, ECHP) is a household survey – so everyone living in sample households becomes a member

Need identifiers to1. Associate the same individual with him- or herself in different waves2. Link members of same household with each other in the same wave

- the HID identifier Note: no such thing as a longitudinal household!

Household composition changes, household location changes….. HID is a cross-sectional concept only!

Page 42: SC968 Panel data methods for sociologists Lecture 1, part 1

30. 10076166 3 female 79 retired widowed 29. 10076166 2 female 78 retired widowed 28. 10076166 1 female 77 retired widowed 27. 10064966 4 male 72 retired widowed 26. 10064966 3 male 71 retired widowed 25. 10064966 2 male 70 retired widowed 24. 10064966 1 male 70 retired widowed 23. 10059377 4 female 49 self-emp never ma 22. 10059377 3 female 48 employed never ma 21. 10059377 2 female 47 employed never ma 20. 10059377 1 female 46 employed never ma 19. 10051562 4 female 7 . . 18. 10051562 3 female 6 . . 17. 10051562 2 female 5 . . 16. 10051562 1 female 4 . . 15. 10051538 4 female 25 family c never ma 14. 10051538 3 female 24 unemploy never ma 13. 10051538 2 female 23 family c never ma 12. 10051538 1 female 22 unemploy never ma 11. 10042571 4 male 62 retired never ma 10. 10042571 3 male 60 lt sick, never ma 9. 10042571 1 male 59 unemploy never ma 8. 10028005 4 male 33 employed never ma 7. 10028005 3 male 32 employed never ma 6. 10028005 2 male 31 employed never ma 5. 10028005 1 male 30 employed never ma 4. 10019057 4 female 62 retired never ma 3. 10019057 3 female 61 retired never ma 2. 10019057 2 female 60 retired never ma 1. 10019057 1 female 59 retired never ma pid wave hgsex age jbstat mastat

. list pid wave hgsex age jbstat mastat in 1/30, clean

What it looks like: 4 waves of data, sorted by pid and wave.

Not present at 2nd wave

A child, so no data on job or marital status

Observations in rows, variables in columns. Blue stripes show where one individual ends & another begins

Surveyed twice in 70th

Page 43: SC968 Panel data methods for sociologists Lecture 1, part 1

30. 10076166 3 2 79 4 3 29. 10076166 2 2 78 4 3 28. 10076166 1 2 77 4 3 27. 10064966 4 1 72 4 3 26. 10064966 3 1 71 4 3 25. 10064966 2 1 70 4 3 24. 10064966 1 1 70 4 3 23. 10059377 4 2 49 1 6 22. 10059377 3 2 48 2 6 21. 10059377 2 2 47 2 6 20. 10059377 1 2 46 2 6 19. 10051562 4 2 7 . . 18. 10051562 3 2 6 . . 17. 10051562 2 2 5 . . 16. 10051562 1 2 4 . . 15. 10051538 4 2 25 6 6 14. 10051538 3 2 24 3 6 13. 10051538 2 2 23 6 6 12. 10051538 1 2 22 3 6 11. 10042571 4 1 62 4 6 10. 10042571 3 1 60 8 6 9. 10042571 1 1 59 3 6 8. 10028005 4 1 33 2 6 7. 10028005 3 1 32 2 6 6. 10028005 2 1 31 2 6 5. 10028005 1 1 30 2 6 4. 10019057 4 2 62 4 6 3. 10019057 3 2 61 4 6 2. 10019057 2 2 60 4 6 1. 10019057 1 2 59 4 6 pid wave hgsex age jbstat mastat

. list pid wave hgsex age jbstat mastat in 1/30, clean nol

(Can also use ,nol option)

Page 44: SC968 Panel data methods for sociologists Lecture 1, part 1

Joining data sets together

50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat

50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat

Adding extra observations: “append” command

Adding extra variables:

“merge” command

50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat

Page 45: SC968 Panel data methods for sociologists Lecture 1, part 1

Whether appending or merging

Whether appending or merging The data set you are using at the time is called the “master” data The data set you want to merge it with is called the “using” data Make sure you can identify observations properly beforehand Make sure you can identify observations uniquely afterwards

Page 46: SC968 Panel data methods for sociologists Lecture 1, part 1

Appending

Use this command to add more observations Relatively easy Check first that you are really adding observations you don’t already

have (or that if you are adding duplicates, you really want to do this) Syntax: append using using_data STATA simply sticks the “using” data on the end of the “master” data STATA re-orders the variables if necessary. If the using data contain variables not present in the master data,

STATA sets the values of these variables to missing in the using data (and vice versa if the master data contains variables not present in the

using data)

Page 47: SC968 Panel data methods for sociologists Lecture 1, part 1

Merging is more complicated

Use “merge” to add more variables to a data setMaster data: age.dtapid wave age

28005 1 30

19057 1 59

28005 2 31

19057 3 61

19057 4 62

28005 4 33

Using data: sex.dtapid wave sex

19057 1 female

19057 3 female

28005 1 male

28005 2 male

28005 4 male

42571 1 male

42571 3 male

Notice that both data sets don’t contain the same observations• Merge 1:1 pid wave using sex

pid wave agesex _merge

19057 1 59female 3

19057 3 61female 3

19057 4 62 .1

28005 1 30male 3

28005 2 31male 3

28005 4 33male 3

42571 1 .male 2

42571 3 .male 2

Page 48: SC968 Panel data methods for sociologists Lecture 1, part 1

Merging

STATA creates a variable called _merge after merging• 1: observation in master but not using data• 2: observation in using but not master data• 3: observation in both data sets

Options available for discarding some observations – see manual

Page 49: SC968 Panel data methods for sociologists Lecture 1, part 1

More on merging

Previous example showed one-to-one merging Not every observation was in both data sets, but every observation in the master data

was matched with a maximum of only one observation in the using data – and vice versa.

Many-to-one merging: Household-level data sets contain only one observation per household (usually <1 per person) Regional data (eg, regional unemployment data), usually one observation per region Sample syntax: merge m:1 hid wave using hhinc_datahid pid age 1604 19057 59 2341 28005 30 3569 42571 59 4301 51538 22 4301 51562 4 4956 59377 46 5421 64966 70 6363 76166 77 6827 81763 71 6827 81798 72

hid h/h income 1604 780 2341 1501 3569 2684301 3944956 16015421 2256363 4116827 743

hid pid age h/h income1604 19057 59 7802341 28005 30 15013569 42571 59 2684301 51538 22 3944301 51562 4 3944956 59377 46 16015421 64966 70 2256363 76166 77 4116827 81763 71 7436827 81798 72 743

One-to-many merging Job and relationship files contain one observation per episode (potentially >1 per person) Income files contain one observation per source of income (potentially >1 per person) Sample syntax: merge 1:m pid wave using births_data

Page 50: SC968 Panel data methods for sociologists Lecture 1, part 1

Long and wide forms

The data we have here is in “long” form One row for each person/wave combination From a few slides back:

30. 10076166 3 female 79 29. 10076166 2 female 78 28. 10076166 1 female 77 27. 10064966 4 male 72 26. 10064966 3 male 71 25. 10064966 2 male 70 24. 10064966 1 male 70 23. 10059377 4 female 49 22. 10059377 3 female 48 21. 10059377 2 female 47 20. 10059377 1 female 46 19. 10051562 4 female 7 18. 10051562 3 female 6 17. 10051562 2 female 5 16. 10051562 1 female 4 15. 10051538 4 female 25 14. 10051538 3 female 24 13. 10051538 2 female 23 12. 10051538 1 female 22 11. 10042571 4 male 62 10. 10042571 3 male 60 9. 10042571 1 male 59 8. 10028005 4 male 33 7. 10028005 3 male 32 6. 10028005 2 male 31 5. 10028005 1 male 30 4. 10019057 4 female 62 3. 10019057 3 female 61 2. 10019057 2 female 60 1. 10019057 1 female 59 pid wave hgsex age

Page 51: SC968 Panel data methods for sociologists Lecture 1, part 1

Wide form

However, it’s also possible to put longitudinal data into “wide” form One observation per person, with different variables relating to different years of

data

8. 10076166 female 77 78 79 79 7. 10064966 male 70 70 71 72 6. 10059377 female 46 47 48 49 5. 10051562 . 4 5 6 7 4. 10051538 female 22 23 24 25 3. 10042571 male 59 . 60 62 2. 10028005 male 30 31 32 33 1. 10019057 female 59 60 61 62 pid sex1 age1 age2 age3 age4

Age at wave 1, and so on

Sex doesn’t change [usually]

Page 52: SC968 Panel data methods for sociologists Lecture 1, part 1

The reshape command

Switching from long to wide:• reshape wide [stubnames], i(id) j(year)

In BHPS, this becomes• reshape wide [stubnames], i(pid) j(wave)

What are stub names? They are a list of variables which vary between years Variables like sex or ethnicity would not normally be included in this list

Switching from wide to long: Exactly the opposite

• reshape long [stubnames], i(id) j(wave)

Lots more info and examples in STATA manual

Page 53: SC968 Panel data methods for sociologists Lecture 1, part 1

Simple models using longitudinal data

Auto-regressive and time-lagged models Models of change

Page 54: SC968 Panel data methods for sociologists Lecture 1, part 1

But first: the GHQ

Use this for lots of analysis in the lectures and practical sessions General Health Questionnaire Different versions: BHPS carries the GHQ-12, with 12 questions. Have you recently:

been able to concentrate on whatever you're doing ? lost much sleep over worry ? felt that you were playing a useful part in things ? felt capable of making decisions about things ? felt constantly under strain? felt you couldn't overcome your difficulties ? been able to enjoy your normal day to day activities ? been able to face up to problems ? been feeling unhappy or depressed? been losing confidence in yourself? been thinking of yourself as a worthless person ? been feeling reasonably happy, all things considered ?

Answer each question on 4-point scale not at all - no more than usual - rather more - much more

Page 55: SC968 Panel data methods for sociologists Lecture 1, part 1

GHQ (ghq) 1: likert | Freq. Percent Cum.------------------------+----------------------------------- missing or wild | 582 2.10 2.10proxy respondent | 1,202 4.33 6.43 0 | 77 0.28 6.70 1 | 109 0.39 7.10 2 | 149 0.54 7.63 3 | 288 1.04 8.67 4 | 504 1.82 10.49 5 | 867 3.12 13.61 6 | 2,229 8.03 21.64 7 | 2,265 8.16 29.80 8 | 2,355 8.48 38.28 9 | 2,426 8.74 47.02 10 | 2,259 8.14 55.16 11 | 2,228 8.03 63.19 12 | 2,478 8.93 72.11 13 | 1,316 4.74 76.85 14 | 1,115 4.02 80.87 15 | 876 3.16 84.03 16 | 714 2.57 86.60 17 | 635 2.29 88.89 18 | 499 1.80 90.68 19 | 439 1.58 92.27 20 | 381 1.37 93.64 21 | 318 1.15 94.78 22 | 276 0.99 95.78 23 | 264 0.95 96.73 24 | 220 0.79 97.52 25 | 134 0.48 98.00 26 | 103 0.37 98.38 27 | 96 0.35 98.72 28 | 59 0.21 98.93 29 | 66 0.24 99.17 30 | 47 0.17 99.34 31 | 47 0.17 99.51 32 | 35 0.13 99.64 33 | 26 0.09 99.73 34 | 20 0.07 99.80 35 | 29 0.10 99.91 36 | 26 0.09 100.00------------------------+----------------------------------- Total | 27,759 100.00

HLGHQ1 in BHPS Sum of scores LIKERT scale We recode <0 values to

missings, rename LIKERT Consider as a continuous

variable

Page 56: SC968 Panel data methods for sociologists Lecture 1, part 1

GHQ

HLGHQ2 Caseness scale Recodes answers 3-4

as 1, and adds up Scores above 2 used to

indicate psychological morbidity

12 383 1.38 100.00 11 343 1.24 98.62 10 385 1.39 97.38 9 417 1.50 96.00 8 529 1.91 94.50 7 561 2.02 92.59 6 720 2.59 90.57 5 933 3.36 87.98 4 1,143 4.12 84.61 3 1,581 5.70 80.50 2 2,189 7.89 74.80 1 3,569 12.86 66.92 0 13,222 47.63 54.06proxy respondent 1,202 4.33 6.43 missing or wild 582 2.10 2.10 (ghq) 2: caseness Freq. Percent Cum. subjective wellbeing

Page 57: SC968 Panel data methods for sociologists Lecture 1, part 1

Time-lagged models

Start with simple OLS model

The Likert score is a measure of psychological wellbeing derived from a battery of questions

_cons 8.298458 .2374816 34.94 0.000 7.83298 8.763936 partner -.044241 .0788756 -0.56 0.575 -.1988419 .1103598 ue_sick 3.562843 .1249977 28.50 0.000 3.31784 3.807846 female 1.593608 .0661958 24.07 0.000 1.46386 1.723356 age2 -.0007342 .0001119 -6.56 0.000 -.0009535 -.000515 age .0797637 .0111716 7.14 0.000 .0578667 .1016607 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 720690.354 25107 28.7047578 Root MSE = 5.2156 Adj R-squared = 0.0523 Residual 682847.462 25102 27.2029106 R-squared = 0.0525 Model 37842.892 5 7568.5784 Prob > F = 0.0000 F( 5, 25102) = 278.23 Source SS df MS Number of obs = 25108

. reg LIKERT age age2 female ue_sick partner if age >= 18

Page 58: SC968 Panel data methods for sociologists Lecture 1, part 1

30. 10042571 1 11 . 29. 10028005 14 13 9 28. 10028005 13 9 12 27. 10028005 12 12 7 26. 10028005 11 7 8 25. 10028005 10 8 7 24. 10028005 9 7 9 23. 10028005 8 9 7 22. 10028005 7 7 8 21. 10028005 6 8 8 20. 10028005 5 8 7 19. 10028005 4 7 12 18. 10028005 3 12 8 17. 10028005 2 8 7 16. 10028005 1 7 . 15. 10019057 15 11 11 14. 10019057 14 11 . 13. 10019057 13 . 12 12. 10019057 12 12 12 11. 10019057 11 12 12 10. 10019057 10 12 12 9. 10019057 9 12 12 8. 10019057 8 12 12 7. 10019057 7 12 . 6. 10019057 6 . 12 5. 10019057 5 12 11 4. 10019057 4 11 10 3. 10019057 3 10 12 2. 10019057 2 12 7 1. 10019057 1 7 . pid wave LIKERT LIKERT~g

. list pid wave LIKERT LIKERT_lag in 1/30, clean

. * check:

.

(14738 missing values generated). gen LIKERT_lag = LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1

. capture drop LIKERT_lag

. sort pid wave

.

Generate lagged variable

NB: the 1/30 here is just so it will fit on the page. You should check many more observations than this!

Page 59: SC968 Panel data methods for sociologists Lecture 1, part 1

_cons 4.593374 .2365749 19.42 0.000 4.12967 5.057079 partner .0967488 .0759926 1.27 0.203 -.0522022 .2456999 ue_sick 2.128451 .1222784 17.41 0.000 1.888777 2.368126 female .8414271 .0638746 13.17 0.000 .7162282 .966626 age2 -.0002391 .0001079 -2.22 0.027 -.0004506 -.0000276 age .0272471 .0108394 2.51 0.012 .0060011 .0484931 LIKERT_lag .4752892 .0060424 78.66 0.000 .4634456 .4871327 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 616865.089 21462 28.7421997 Root MSE = 4.5987 Adj R-squared = 0.2642 Residual 453760.485 21456 21.1484193 R-squared = 0.2644 Model 163104.604 6 27184.1006 Prob > F = 0.0000 F( 6, 21456) = 1285.40 Source SS df MS Number of obs = 21463

. reg LIKERT LIKERT_lag age age2 female ue_sick partner if age >= 18

OLS, with lagged dependent variable

R-squared rockets from 5% to 26%

Big & very significant coefficient on lagged variable

Coeff on “ue_sick” falls from 3.6 to 2.1

Also possible to include lagged explanatory variables

Page 60: SC968 Panel data methods for sociologists Lecture 1, part 1

Models of change

iii xy ......

1......11 iii xy

2......22 iii xy

)1212 12(......)()( iiiiii xxyy

iii xy ......

Start with OLS model [simplified, but imagine more variables]

Separate model for each year – suffix denotes year

Subtract 1st from 2nd model

Or, express in terms of change

Page 61: SC968 Panel data methods for sociologists Lecture 1, part 1

Generate difference variables capture drop dif* sort pid wave gen dif_LIKERT = LIKERT - LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age = age - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age2 = age2 - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_female = female - female[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_ue_sick = ue_sick - ue_sick[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_partner = partner - partner[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1

Check you understand why dif_female will [very nearly] always be zero

Page 62: SC968 Panel data methods for sociologists Lecture 1, part 1

30. 10042571 1 59 . 29. 10028005 14 43 1 28. 10028005 13 42 1 27. 10028005 12 41 1 26. 10028005 11 40 1 25. 10028005 10 39 1 24. 10028005 9 38 1 23. 10028005 8 37 1 22. 10028005 7 36 1 21. 10028005 6 35 1 20. 10028005 5 34 1 19. 10028005 4 33 1 18. 10028005 3 32 1 17. 10028005 2 31 1 16. 10028005 1 30 . 15. 10019057 15 73 2 14. 10019057 14 71 0 13. 10019057 13 71 2 12. 10019057 12 69 1 11. 10019057 11 68 1 10. 10019057 10 67 0 9. 10019057 9 67 1 8. 10019057 8 66 1 7. 10019057 7 65 1 6. 10019057 6 64 1 5. 10019057 5 63 1 4. 10019057 4 62 1 3. 10019057 3 61 1 2. 10019057 2 60 1 1. 10019057 1 59 . pid wave age dif_age

. list pid wave age dif_age in 1/30, clean

Check for sensible results!

Page 63: SC968 Panel data methods for sociologists Lecture 1, part 1

More checking….

30. 10042571 1 11 . 29. 10028005 14 13 4 28. 10028005 13 9 -3 27. 10028005 12 12 5 26. 10028005 11 7 -1 25. 10028005 10 8 1 24. 10028005 9 7 -2 23. 10028005 8 9 2 22. 10028005 7 7 -1 21. 10028005 6 8 0 20. 10028005 5 8 1 19. 10028005 4 7 -5 18. 10028005 3 12 4 17. 10028005 2 8 1 16. 10028005 1 7 . 15. 10019057 15 11 0 14. 10019057 14 11 . 13. 10019057 13 . . 12. 10019057 12 12 0 11. 10019057 11 12 0 10. 10019057 10 12 0 9. 10019057 9 12 0 8. 10019057 8 12 0 7. 10019057 7 12 . 6. 10019057 6 . . 5. 10019057 5 12 1 4. 10019057 4 11 1 3. 10019057 3 10 -2 2. 10019057 2 12 5 1. 10019057 1 7 . pid wave LIKERT dif_LI~T

. list pid wave LIK dif_LIK in 1/30, clean

Page 64: SC968 Panel data methods for sociologists Lecture 1, part 1

Issues

Most differences are zero Moving into unemployment or partnership is given equal and opposite

weighting to moving out. No real reason why this should be the case There are MUCH better ways to use these data! Nevertheless, let’s proceed!

Page 65: SC968 Panel data methods for sociologists Lecture 1, part 1

Results

Female drops out

Coeffs on sick and partner significant

_cons -.3104285 .1364378 -2.28 0.023 -.5778567 -.0430003 dif_partner -.5948999 .1780776 -3.34 0.001 -.9439452 -.2458545 dif_ue_sick 1.857857 .149281 12.45 0.000 1.565255 2.150458 dif_age2 -6.24e-07 .0000212 -0.03 0.977 -.0000422 .000041 dif_age .3757715 .1227943 3.06 0.002 .1350856 .6164574 dif_LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 613771.706 21460 28.6007319 Root MSE = 5.3264 Adj R-squared = 0.0081 Residual 608713.117 21456 28.3702982 R-squared = 0.0082 Model 5058.58867 4 1264.64717 Prob > F = 0.0000 F( 4, 21456) = 44.58 Source SS df MS Number of obs = 21461

. reg dif_LIKERT dif_age - dif_partner if age >= 18

_cons 8.298458 .2374816 34.94 0.000 7.83298 8.763936 partner -.044241 .0788756 -0.56 0.575 -.1988419 .1103598 ue_sick 3.562843 .1249977 28.50 0.000 3.31784 3.807846 female 1.593608 .0661958 24.07 0.000 1.46386 1.723356 age2 -.0007342 .0001119 -6.56 0.000 -.0009535 -.000515 age .0797637 .0111716 7.14 0.000 .0578667 .1016607 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 720690.354 25107 28.7047578 Root MSE = 5.2156 Adj R-squared = 0.0523 Residual 682847.462 25102 27.2029106 R-squared = 0.0525 Model 37842.892 5 7568.5784 Prob > F = 0.0000 F( 5, 25102) = 278.23 Source SS df MS Number of obs = 25108

. reg LIKERT age age2 female ue_sick partner if age >= 18