SC968 Panel data methods for sociologists Lecture 1, part 1
description
Transcript of SC968 Panel data methods for sociologists Lecture 1, part 1
SC968Panel data methods for sociologistsLecture 1, part 1
A review of concepts for regression modellingOr things you should know already
Overview
Models OLS, logit and probit Mathematically and practically
Interpretation of results, measures of fit and regression diagnostics Model specification Post-estimation commands STATA competence
Ordinary Least Squares (OLS)
iKiKiiii xxxxy .........332211
Value of dependent variable for individual i (LHS variable)
Value of explanatory variable 1 for person i
Coefficient on variable 1
Residual (disturbance, error term)
Total no. of explanatory variables (RHS variables or regressors) is K
Examples
yi = mental healthx1 = sexx2 = agex3 = marital statusx4 = employment statusx5 = physical health
yi = hourly payx1 = sexx2 = agex3 = educationx4 = job tenurex5 = industryx6 = region
Intercept (constant)
OLS
iii xy '
Vector of explanatory variables
Vector of coefficients
'Xy
N
K
NKNNN
K
K
K
K
K
N xxxx
xxxxxxxxxxxxxxxxxxxx
y
yyyyy
.
.
.
.
.
.*
..............
..
..
..
..
..
.
.
3
2
1
3
2
1
321
5535251
4434241
3333231
2232221
1131211
5
4
3
2
1
i
K
iKiiii xxxxy
.
.*.. 3
2
1
321
iKiKiiii xxxxy .........332211
In vector form In matrix form
Note: you will often see x’β written as xβ
OLS
2)(min i
Also called “linear regression” Assumes dependent variable is a linear combination of dependent
variables, plus disturbance “Least squares”: β’s estimated so as to minimise the sum of the ε’s.
Residuals have zero mean………………………………. Follows that ε’s and X’s are uncorrelated……………….
violated if a regressor is endogenous Eg, number of children in female labour supply models Cure by (eg) Instrumental Variables
Homoscedasticity: all ε’s have same variance ………… Classic example: food consumption and income Cure by using weighted least squares
Nonautocorrelation: ε’s uncorrelated with each other … Data sets where the same individual appears multiple times Adjust standard errors: ‘cluster’ option in STATA
Distubances are iid (normally distributed, zero mean, constant variance)
Basic Assumptions
0)|( ii XE 0)( iE
0)( ii XE
2)( iVar
0)( jiE
When is OLS appropriate?
When you have a continuous dependent variable Eg, you would use it to estimate regressions for height, but not for whether a
person has a university degree. When the assumptions are not obviously violated As a first step in research to get ball-park estimates
We will use them a lot for this purpose
Worked examples Coefficients, P-values, t-statistics Measures of fit (R-squared, adjusted R-squared) Thinking about specification Post-estimation commands Regression diagnostics.
A note on the data All examples (in lectures and practicals) drawn from a 20% sample of the British
Household Panel Survey (BHPS) – more about the data later!
Summarize monthly earned income
99% 5003.849 10000 Kurtosis 11.9432195% 3061.355 10000 Skewness 2.1929590% 2471.848 9333.333 Variance 101668575% 1690 9207.083 Largest Std. Dev. 1008.30850% 1073.088 Mean 1282.831
25% 615.3333 2.416667 Sum of Wgt. 1669610% 268.6667 2 Obs 16696 5% 156 1.25 1% 43 1 Percentiles Smallest incm
. sum incm if age >= 17 & age <= 64, d
For illustrative purposes only. Not an example of good practice.
R-squared = Model SS / Total SS
Tests whether all coeffs except constant are jointly zero
MS = SS/df
Root MSE = sqrt(MSR)
Coefficients + or – 1.96 standard errors
T-stat = coefficient / standard error
First worked example
Analysis of variance (ANOVA) table
_cons -819.931 78.80064 -10.41 0.000 -974.3888 -665.4732 mth_int -5.059072 4.036446 -1.25 0.210 -12.97094 2.8528 ed_deg 1076.674 20.54526 52.40 0.000 1036.403 1116.945 ed_sec 380.5032 14.36582 26.49 0.000 352.3446 408.6618 partner 155.7992 16.62703 9.37 0.000 123.2085 188.39 age2 -1.155281 .0479992 -24.07 0.000 -1.249364 -1.061197 age 101.0994 3.859657 26.19 0.000 93.53401 108.6647 female -594.9641 13.26812 -44.84 0.000 -620.9711 -568.9571 incm Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 1.6626e+10 16457 1010245.5 Root MSE = 847.35 Adj R-squared = 0.2893 Residual 1.1811e+10 16450 718000.667 R-squared = 0.2896 Model 4.8145e+09 7 687785597 Prob > F = 0.0000 F( 7, 16450) = 957.92 Source SS df MS Number of obs = 16458
. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
Monthly labour income, for people whose labour income is >= £1
All coefficients except month of interview are significant 29% of variation explained Being female reduces income by nearly £600 per month Income goes up with age and then down 16458 observations…..oops, this is from panel data, so there are repeated
observations on individuals.
What do the results tell us?
_cons -819.931 78.80064 -10.41 0.000 -974.3888 -665.4732 mth_int -5.059072 4.036446 -1.25 0.210 -12.97094 2.8528 ed_deg 1076.674 20.54526 52.40 0.000 1036.403 1116.945 ed_sec 380.5032 14.36582 26.49 0.000 352.3446 408.6618 partner 155.7992 16.62703 9.37 0.000 123.2085 188.39 age2 -1.155281 .0479992 -24.07 0.000 -1.249364 -1.061197 age 101.0994 3.859657 26.19 0.000 93.53401 108.6647 female -594.9641 13.26812 -44.84 0.000 -620.9711 -568.9571 incm Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 1.6626e+10 16457 1010245.5 Root MSE = 847.35 Adj R-squared = 0.2893 Residual 1.1811e+10 16450 718000.667 R-squared = 0.2896 Model 4.8145e+09 7 687785597 Prob > F = 0.0000 F( 7, 16450) = 957.92 Source SS df MS Number of obs = 16458
. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
Coefficients, R-squared etc are unchanged from previous specification But standard errors are adjusted: standard errors larger, t-statistics are lower
Add ,cluster(pid) as an option
_cons -819.931 132.8455 -6.17 0.000 -1080.431 -559.4306 mth_int -5.059072 4.126102 -1.23 0.220 -13.15006 3.031912 ed_deg 1076.674 64.45131 16.71 0.000 950.2898 1203.058 ed_sec 380.5032 30.36746 12.53 0.000 320.9549 440.0516 partner 155.7992 30.87227 5.05 0.000 95.26099 216.3375 age2 -1.155281 .0933813 -12.37 0.000 -1.338395 -.9721666 age 101.0994 7.323088 13.81 0.000 86.73932 115.4594 female -594.9641 31.81172 -18.70 0.000 -657.3445 -532.5836 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2466 clusters in pid)
Root MSE = 847.35 R-squared = 0.2896 Prob > F = 0.0000 F( 7, 2465) = 135.26Linear regression Number of obs = 16458
. reg incm female age age2 partner ed_sec ed_deg mth_int if age >= 17 & age <= 64, cluster(pid)
Let’s get rid of the “month” variable
_cons -866.2836 125.9787 -6.88 0.000 -1113.319 -619.2486 ed_deg 1076.837 64.44019 16.71 0.000 950.4745 1203.199 ed_sec 381.0247 30.36183 12.55 0.000 321.4874 440.562 partner 155.5618 30.87778 5.04 0.000 95.01275 216.1109 age2 -1.153834 .0934155 -12.35 0.000 -1.337015 -.9706534 age 100.9827 7.325995 13.78 0.000 86.617 115.3485 female -594.8596 31.80682 -18.70 0.000 -657.2304 -532.4887 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2467 clusters in pid)
Root MSE = 847.33 R-squared = 0.2895 Prob > F = 0.0000 F( 6, 2466) = 156.78Linear regression Number of obs = 16460
. reg incm female age age2 partner ed_sec ed_deg if age >= 17 & age <= 64, cluster(pid)
Think about the female coefficient a bit more. Could it be to do with women working shorter hours?
Is the coefficient on hours of work reasonable? £5.65 for every additional hour worked – certainly in the right ball park.
Control for weekly hours of work
_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 2263 clusters in pid)
Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998
. reg incm female age age2 partner ed_sec ed_deg hrsm if age >= 17 & age <= 64, cluster(pid)
R-squared jumps from 29% to 46% Coefficient on female goes from -595 to -315 Almost half the effect of gender is explained by women’s shorter hours of work Age, partner and education coefficients are also reduced in magnitude, for similar
reasons Number of observations reduces from 16460 to 13998 – missing data on hours
Looking at 2 specifications together
Root MSE = 847.33 R-squared = 0.2895 Prob > F = 0.0000 F( 6, 2466) = 156.78Linear regression Number of obs = 16460
_cons -866.2836 125.9787 -6.88 0.000 -1113.319 -619.2486 ed_deg 1076.837 64.44019 16.71 0.000 950.4745 1203.199 ed_sec 381.0247 30.36183 12.55 0.000 321.4874 440.562 partner 155.5618 30.87778 5.04 0.000 95.01275 216.1109 age2 -1.153834 .0934155 -12.35 0.000 -1.337015 -.9706534 age 100.9827 7.325995 13.78 0.000 86.617 115.3485 female -594.8596 31.80682 -18.70 0.000 -657.2304 -532.4887 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust
Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998
_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust
Is the effect of university qualifications statistically different from the effect of secondary education?
What age does income peak?
Income = Y + β1*age + β2*age2
d(Income)/d(age) = β1+ 2β2*age
Derivative = zero when
age = - β1/2β2
= -79.552/(-0.873*2)
= 45.5
Interesting post-estimation activities
Root MSE = 690.95 R-squared = 0.4580 Prob > F = 0.0000 F( 7, 2262) = 247.67Linear regression Number of obs = 13998
_cons -1495.805 111.8223 -13.38 0.000 -1715.09 -1276.52 hrsm 5.654682 .2467777 22.91 0.000 5.170747 6.138616 ed_deg 996.7434 59.88369 16.64 0.000 879.3107 1114.176 ed_sec 340.68 26.67171 12.77 0.000 288.3764 392.9835 partner 148.0265 26.07885 5.68 0.000 96.88551 199.1675 age2 -.873335 .0817518 -10.68 0.000 -1.033651 -.7130186 age 79.55289 6.372918 12.48 0.000 67.05551 92.05027 female -314.6874 34.32954 -9.17 0.000 -382.0081 -247.3667 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust
Prob > F = 0.0000 F( 1, 2262) = 110.75
( 1) ed_sec - ed_deg = 0
. test ed_sec = ed_deg
A closer look at “partner” coefficient
_cons -1382.844 133.0607 -10.39 0.000 -1643.909 -1121.779 hrsm 6.806946 .3051015 22.31 0.000 6.208337 7.405556 ed_deg 819.3002 73.74637 11.11 0.000 674.6098 963.9906 ed_sec 277.2823 31.66175 8.76 0.000 215.1619 339.4026 partner 84.15365 29.27677 2.87 0.004 26.7126 141.5947 age2 -.6229372 .0961605 -6.48 0.000 -.8116041 -.4342702 age 56.20989 7.327688 7.67 0.000 41.83296 70.58682 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 1167 clusters in pid)
Root MSE = 564.65 R-squared = 0.4830 Prob > F = 0.0000 F( 6, 1166) = 125.30Linear regression Number of obs = 7222
-> female = 1
_cons -1907.107 175.5681 -10.86 0.000 -2251.595 -1562.62 hrsm 3.930412 .3784925 10.38 0.000 3.18776 4.673065 ed_deg 1082.255 89.21241 12.13 0.000 907.2087 1257.302 ed_sec 356.7472 41.91151 8.51 0.000 274.5113 438.9832 partner 213.351 46.9817 4.54 0.000 121.1667 305.5354 age2 -1.257366 .1316253 -9.55 0.000 -1.515633 -.9991 age 113.3119 10.56356 10.73 0.000 92.5848 134.039 incm Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust (Std. Err. adjusted for 1096 clusters in pid)
Root MSE = 787.93 R-squared = 0.3452 Prob > F = 0.0000 F( 6, 1095) = 115.53Linear regression Number of obs = 6776
-> female = 0
. bysort female: reg incm age age2 partner ed_sec ed_deg hrsm if age >= 17 & age <= 64, cluster(pid)
Men who are part of a couple earn much more than men who are not – women less so.
Other coefficients also differ between men and women, but with current specification, we can’t test whether differences are significant.
Developed for discrete (categorical) dependent variables Eg, psychological morbidity, whether one has a job…. Think of other
examples. Outcome variable is always 0 or 1. Estimate:
OLS (linear probability model) would set F(X,β) = X’β + ε Inappropriate because:
Heteroscedasticity: the outcome variable is always 0 or 1, so ε only takes the value -x’β or 1-x’β
More seriously, one cannot constrain estimated probabilities to lie between 0 and 1.
Logit and Probit
),(1)0Pr(),()1Pr(
XFY
XFY
Solution: We need a link function that will transform our dichotomous Y into a continuous form Y’
Looking for a function which lies between 0 and 1: Cumulative normal distribution: Probit model
• Z scores assuming the cumulative normal distribution Φ
Logistic distribution: Logit (logistic) model• Logged odds of probability
They are very similar! Note how they lie between 0 and 1 (vertical axis)
Logit and Probit
)'().()1Pr('
XdttYx
)(1
)1Pr(
xe
eY x
x
Likelihood function: product of Pr(y=1) = F(x’β) for all observations where y=1 Pr(y=0) = 1- F(x’β) for all observations where y=0 (think of the probability of flipping exactly four heads and two tails, with six dice)
Log likelihood written as
Estimated using an iterative procedure STATA chooses starting values for β’s Computes slopes of likelihood function at these values Adjusts β’s accordingly Stops when slope of LF is ≈0 Can take time!
Maximum likelihood estimation
)](1ln[)(lnln jsj
jjsj
j xFyxFyL
Let’s look at whether a person works
Total 37,552 100.00 . 9,793 26.08 100.00 other 124 0.33 73.92 gvt trng scheme 67 0.18 73.59lt sick, disabld 1,057 2.81 73.41ft studt, school 1,394 3.71 70.60 family care 1,964 5.23 66.89 maternity leave 320 0.85 61.66 retired 4,726 12.59 60.80 unemployed 1,120 2.98 48.22 employed 14,702 39.15 45.24 self-employed 2,204 5.87 6.08 not answered 2 0.01 0.22 -7 66 0.18 0.21 missing or wild 13 0.03 0.03 activity Freq. Percent Cum. current economic
. tab jbstat, m
gen byte work = (jbstat == 1 | jbstat == 2) if jbstat >= 1 & jbstat != .
Logit regression: whether have a job
All the iterations
2* (LL of this model – LL of null model)
Measure of amount explained but less intuitive interpretation
From these coefficients, can tell whether estimated effects are positive or negative Whether they’re significant Something about effect sizes – but difficult to draw inferences from coefficients
_cons -3.666352 .5216913 -7.03 0.000 -4.688848 -2.643856 nkids -.477714 .0391116 -12.21 0.000 -.5543714 -.4010567 ed_deg .8734892 .1468462 5.95 0.000 .585676 1.161302 ed_sec .602653 .0900282 6.69 0.000 .4262009 .7791051 partner .4681257 .0943383 4.96 0.000 .2832261 .6530253 badhealth -.5213826 .0404068 -12.90 0.000 -.6005785 -.4421868 age2 -.0046546 .0003504 -13.28 0.000 -.0053414 -.0039679 age .3578242 .0282831 12.65 0.000 .3023904 .4132581 female -.8001156 .090802 -8.81 0.000 -.9780842 -.622147 work Coef. Std. Err. z P>|z| [95% Conf. Interval] Robust (Std. Err. adjusted for 2430 clusters in pid)
Log pseudolikelihood = -7838.2372 Pseudo R2 = 0.1456 Prob > chi2 = 0.0000 Wald chi2(8) = 613.59Logistic regression Number of obs = 17268
Iteration 4: log pseudolikelihood = -7838.2372 Iteration 3: log pseudolikelihood = -7838.2372 Iteration 2: log pseudolikelihood = -7838.4288 Iteration 1: log pseudolikelihood = -7909.9067 Iteration 0: log pseudolikelihood = -9174.0313
. logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid)
Comparing logit and probit
Scaling factor proposed by Amemiya (1981) Multiply Probit coefficients by 1.6 to get an approximation to Logit Other authors have suggested a factor of 1.8
Logit Probit Probit * 1.6female -0.800 -0.455 -0.728age 0.358 0.206 0.330age2 -0.005 -0.003 -0.004badhealth -0.521 -0.300 -0.479partner 0.468 0.284 0.455ed_sec 0.603 0.343 0.548ed_deg 0.873 0.476 0.762nkids -0.478 -0.275 -0.441_cons -3.666 -2.112 -3.380
Marginal effects
After logit or Probit estimation, use the margins command Calculates marginal effects of each of the RHS variables on the
dependent variable Slope of the function for continuous variables Effect of change from 0 to 1 in a dummy variable Can also provide predicted probabilities, linear combinations, plots, and much
more!
MEM: Marginal Effects at the Meansmargins, dydx(*) atmeans
AME: Average Marginal EffectsMargins, dydx(*)
MER: Marginal Effects at Representative ValuesMargins, dydx(*) at(age=20 30 40 50)
Marginal effects
Logit and Probit mfx are very similar indeed OLS is actually not too bad
Logit Probit OLSfemale* -0.118 -0.122 -0.114age 0.053 0.056 0.057age2 -0.001 -0.001 -0.001badhea~h -0.078 -0.081 -0.086partner* 0.076 0.082 0.075ed_sec* 0.087 0.090 0.094ed_deg* 0.106 0.109 0.118nkids -0.071 -0.075 -0.077Constant -0.045
Odds ratios
Only an option with logit Type “or” in, after the comma as an option Reports odds ratios: that is, how many times more (or less) likely the outcome becomes
if the variable is 1 rather than 0, in the case of a dichotomous variable for each unit increase of the variable, for a continuous variable
Results >1 show an increased probability, results <1 show decrease
nkids .6201995 .024257 -12.21 0.000 .5744333 .6696121 ed_deg 2.395254 .3517339 5.95 0.000 1.796205 3.194091 ed_sec 1.826959 .1644779 6.69 0.000 1.531428 2.179521 partner 1.596998 .150658 4.96 0.000 1.327405 1.921345 badhealth .5936991 .0239895 -12.90 0.000 .5484942 .6426296 age2 .9953562 .0003488 -13.28 0.000 .9946728 .99604 age 1.430214 .0404509 12.65 0.000 1.353089 1.511735 female .449277 .0407952 -8.81 0.000 .3760308 .5367907 work Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Robust (Std. Err. adjusted for 2430 clusters in pid)
Log pseudolikelihood = -7838.2372 Pseudo R2 = 0.1456 Prob > chi2 = 0.0000 Wald chi2(8) = 613.59Logistic regression Number of obs = 17268
Iteration 4: log pseudolikelihood = -7838.2372 Iteration 3: log pseudolikelihood = -7838.2372 Iteration 2: log pseudolikelihood = -7838.4288 Iteration 1: log pseudolikelihood = -7909.9067 Iteration 0: log pseudolikelihood = -9174.0313
. logit work female age age2 badhealth partner ed_sec ed_deg nkids if age >= 22 & age <= 60, cluster(pid) or
Other post-estimation commands
Likelihood ratio test “lrtest” Adding an extra variable to the RHS always increases the likelihood But, does it add “enough” to the likelihood? LR test calculates L0/L1 (Lrestricted/Lunrestricted) and calculates chi-squared stat with
d.f. equal to the number of variables you are dropping. Null hypothesis: restricted specification. Only works on nested models, ie, where the RHS variables in one model are
a subset of the RHS variables in the other. How to do it
Run the full model Type “estimates store NAME” Run a smaller model Type “estimates store ANOTHERNAME” ….. And so on for as many models as you like Type “lrtest NAME ANOTHERNAME”
Be careful….. Sample sizes must be the same for both models Won’t happen if the dropped variable is missing for some observations Solve problem by running the biggest model first and using e(sample)
. estimates store KEEPSCOT
. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids r_sco if e(sample)
. estimates store KEEP4
. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids r_sco r_sw r_nw r_nth if e(sample)
. estimates store DROPREG
. quietly logit work age age2 badhealth partner ed_sec ed_deg nkids if e(sample)
. estimates store ALL
_cons -3.042685 1.133771 -2.68 0.007 -5.264836 -.8205342 r_sco -1.183256 .3413128 -3.47 0.001 -1.852216 -.5142949 r_wls -.4251579 .3862621 -1.10 0.271 -1.182218 .3319019 r_nth -.6270993 .2940544 -2.13 0.033 -1.203435 -.0507633 r_nw -.6369382 .3140179 -2.03 0.043 -1.252402 -.0214744 r_sw -.7379424 .34484 -2.14 0.032 -1.413816 -.0620684 r_mid -.3796683 .2807851 -1.35 0.176 -.929997 .1706604 r_lon -.5363941 .3247594 -1.65 0.099 -1.172911 .1001226 nkids -.3589658 .0813107 -4.41 0.000 -.5183318 -.1995998 ed_deg 1.356818 .2734631 4.96 0.000 .8208405 1.892796 ed_sec .7737744 .1749884 4.42 0.000 .4308035 1.116745 partner .244436 .1984392 1.23 0.218 -.1444977 .6333698 badhealth -.5337105 .0893498 -5.97 0.000 -.708833 -.358588 age2 -.0039101 .000729 -5.36 0.000 -.0053389 -.0024814 age .3093037 .0599295 5.16 0.000 .1918441 .4267633 work Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -477.02974 Pseudo R2 = 0.1296 Prob > chi2 = 0.0000 LR chi2(14) = 142.07Logistic regression Number of obs = 1066
Iteration 4: log likelihood = -477.02974 Iteration 3: log likelihood = -477.02974 Iteration 2: log likelihood = -477.04783 Iteration 1: log likelihood = -480.90757 Iteration 0: log likelihood = -548.06325
. logit work age age2 badhealth partner ed_sec ed_deg nkids r_* if age >= 21 & age <= 60 & wave == 15
. do "C:\DOCUME~1\maria\LOCALS~1\Temp\STD03000000.tmp"
LR test - example Similar but not identical regression to previous examples Add regional variables, decide which ones to keep Looks as though Scotland might stay, also possibly SW, NW, N
. logit work age age2 badhealth partner ed_sec ed_deg nkids r_* if age >= 21 & age <= 60 & wave == 15
LR test - example
Reject dropping all regional variables against keeping full set Don’t reject dropping all but 4, over keeping full set Don’t reject dropping all but Scotland, over keeping full set Don’t reject dropping all but Scotland, over dropping all but 4 [and just to check: DO reject dropping all regional variables against dropping all but Scotland]
(Assumption: DROPREG nested in KEEPSCOT) Prob > chi2 = 0.0102Likelihood-ratio test LR chi2(1) = 6.59
. lrtest KEEPSCOT DROPREG
(Assumption: KEEPSCOT nested in KEEP4) Prob > chi2 = 0.2347Likelihood-ratio test LR chi2(3) = 4.26
. lrtest KEEP4 KEEPSCOT
(Assumption: KEEPSCOT nested in ALL) Prob > chi2 = 0.2689Likelihood-ratio test LR chi2(6) = 7.60
. lrtest ALL KEEPSCOT
(Assumption: KEEP4 nested in ALL) Prob > chi2 = 0.3422Likelihood-ratio test LR chi2(3) = 3.34
. lrtest ALL KEEP4
(Assumption: DROPREG nested in ALL) Prob > chi2 = 0.0479Likelihood-ratio test LR chi2(7) = 14.19
. lrtest ALL DROPREGREJECT nested specification
DON’T REJECT nested spec
Again, specification is illustrative only
This is not an example of a “finished” labour supply model! How could one improve the model?
Model specification Theoretical considerations, Empirical considerations Parsimony Stepwise regression techniques
Regression diagnostics Interpreting results Spotting “unreasonable” results
Other models
Other models to be aware of, but not covered on this course: Extensions to logit and probit
Ordered models (ologit, oprobit) for ordered outcomes• Levels of education, • Number of children• Excellent, good, fair or poor health
Multinomial models (mlogit, mprobit) for multiple outcomes with no obvious ordering
• Working in public, private or voluntary sector• Choice of nursery, childminder or playgroup for pre-school care
Heckman selection model For modelling two-stage procedures
• Earnings, conditional on having a job at all• Having a job is modelled as a probit, earnings are modelled as OLS• Used particularly for women’s earnings
Tobit model for censored or truncated data Typically, for data where there are lots of zeros
• Expenditure on rarely-purchased items, eg cars• Children’s weights, in an experiment where the scales broke and gave a minimum
reading of 10kg
Competence in STATA Best results in this course if you already know how to use STATA
competently. Check you know how to
Get data into STATA (use and using commands)Manipulate data, (merge, append, rename, drop, save)Describe your data (describe, tabulate, table)Create new variables (gen, egen)Work with subsets of data (if, in, by)Do basic regressions (regress, logit, probit)Run sessions interactively and in batch modeOrganise your datasets and do-files so you can find them again.
If you can’t do these, upgrade your knowledge ASAP! Could enroll in STATA net course 101
Costs $110 ESRC might pay Courses run regularly www.stata.com
SC968Panel data methods for sociologistsLecture 1, part 2
Introducing Longitudinal Data
Overview
Cross-sectional and longitudinal data Types of longitudinal data Types of analysis possible with panel data Data management – merging, appending, long and wide forms Simple models using longitudinal data
Cross-sectional and longitudinal data
First, draw the distinction between macro- and micro-level data Micro level: firms, individuals Macro level: local authorities, travel-to-work areas, countries, commodity prices Both may exist in cross-sectional or longitudinal forms We are interested in micro-level data But macro-level variables are often used in conjunction with micro-data
Cross-sectional data Contains information collected at a given point in time (More strictly, during a given time window)
• European Social Survey (ESS)• Programme for International Student Assessment (PISA)
Many cross-sectional surveys are repeated, but on different individuals Longitudinal data
Contains repeated observations on the same subjects
Types of longitudinal data
Time-series data Eg, commodity prices, exchange rates
Repeated interviews at irregular intervals UK cohort studies: NCDS (1958), BCS (1970), MCS (2000)
Repeated interviews at regular intervals “Panel” surveys Usually annual intervals, sometimes two-yearly BHPS, SLID, PSID, SOEP
Some surveys have both cross-sectional and panel elements Panels more expensive to collect LFS, EU-SILC both have a “rolling panel” element
Other sources of longitudinal data Retrospective data (eg work or relationship history) Linkage with external data (eg, tax or benefit records) – particularly in Scandinavia May be present in both cross-sectional or longitudinal data sets
Analysis with longitudinal data
The “snapshot” versus the “movie” Essentially, longitudinal data allow us to observe how events evolve Study “flows” as well as “stocks”.
Example: unemployment Cross-sectional analysis shows steady 5% unemployment rate Does this mean that everyone is unemployed one year out of twenty? That 5% of people are unemployed all the time? Or something in between Very different implications for equality, social policy, etc
The BHPS
Interviews about 10,000 adults in about 6,000 households Interviews repeated annually People followed when they move People join the sample if they move in with a sample member
Household-level information collected from “head of household” Individual-level information collected from people aged 17+ Young people aged 11-16 fill in a youth questionnaire
BHPS is now part of Understanding Society Much larger and wider-ranging survey 40,000 households
Data set used for this course is a 20% sample of BHPS, with selected variables
The BHPS
All files prefixed with a letter indicating the year All variables within each file also prefixed with this letter 1991: a 1992: b………. and so on
Several files each year, containing different information hhsamp information on sample households hhresp household-level information on households that actually responded indall info on all individuals in responding households indresp info on respondents to main questionnaire (adults) egoalt file showing relationship of household members to one another income incomes
Extra files each year containing derived variables: Work histories, net income files
And others with occasional modules, eg life histories in wave 2 bjobhist blifemst bmarriag bcohabit bchildnt
Some BHPS files
768.1k aindall.dta 10.7M aindresp.dta 1626.3k ahhresp.dta 330.6k ahhsamp.dta 1066.4k aincome.dta 541.3k aegoalt.dta 303.8k ajobhist.dta
635.3k bindsamp.dta 978.2k bindall.dta 11.0M bindresp.dta 1499.7k bhhresp.dta 257.1k bhhsamp.dta 1073.0k bincome.dta 546.5k begoalt.dta 237.8k bjobhist.dta
23.5k bchildad.dta 284.4k bchildnt.dta 34.3k bcohabit.dta 766.4k blifemst.dta 272.4k bmarriag.dta
624.3k cindsamp.dta 975.6k cindall.dta 11.0M cindresp.dta 1539.0k chhresp.dta 287.4k chhsamp.dta 1008.9k cincome.dta 542.2k cegoalt.dta 237.8k cjobhist.dta 1675.0k clifejob.dta 616.7k dindsamp.dta 943.7k dindall.dta 11.2M dindresp.dta 1508.9k dhhresp.dta 301.9k dhhsamp.dta 1019.7k dincome.dta 531.8k degoalt.dta 245.0k djobhist.dta 129.0k dyouth.dta
4977.3k xwaveid.dta 1027.7k xwlsten.dta
Extra modules in Wave 2
Following sample members
Youth module introduced 1994
Cross-wave identifiers
Person and household identifiers
BHPS (along with other panels such as ECHP, SOEP, ECHP) is a household survey – so everyone living in sample households becomes a member
Need identifiers to1. Associate the same individual with him- or herself in different waves2. Link members of same household with each other in the same wave
- the HID identifier Note: no such thing as a longitudinal household!
Household composition changes, household location changes….. HID is a cross-sectional concept only!
30. 10076166 3 female 79 retired widowed 29. 10076166 2 female 78 retired widowed 28. 10076166 1 female 77 retired widowed 27. 10064966 4 male 72 retired widowed 26. 10064966 3 male 71 retired widowed 25. 10064966 2 male 70 retired widowed 24. 10064966 1 male 70 retired widowed 23. 10059377 4 female 49 self-emp never ma 22. 10059377 3 female 48 employed never ma 21. 10059377 2 female 47 employed never ma 20. 10059377 1 female 46 employed never ma 19. 10051562 4 female 7 . . 18. 10051562 3 female 6 . . 17. 10051562 2 female 5 . . 16. 10051562 1 female 4 . . 15. 10051538 4 female 25 family c never ma 14. 10051538 3 female 24 unemploy never ma 13. 10051538 2 female 23 family c never ma 12. 10051538 1 female 22 unemploy never ma 11. 10042571 4 male 62 retired never ma 10. 10042571 3 male 60 lt sick, never ma 9. 10042571 1 male 59 unemploy never ma 8. 10028005 4 male 33 employed never ma 7. 10028005 3 male 32 employed never ma 6. 10028005 2 male 31 employed never ma 5. 10028005 1 male 30 employed never ma 4. 10019057 4 female 62 retired never ma 3. 10019057 3 female 61 retired never ma 2. 10019057 2 female 60 retired never ma 1. 10019057 1 female 59 retired never ma pid wave hgsex age jbstat mastat
. list pid wave hgsex age jbstat mastat in 1/30, clean
What it looks like: 4 waves of data, sorted by pid and wave.
Not present at 2nd wave
A child, so no data on job or marital status
Observations in rows, variables in columns. Blue stripes show where one individual ends & another begins
Surveyed twice in 70th
30. 10076166 3 2 79 4 3 29. 10076166 2 2 78 4 3 28. 10076166 1 2 77 4 3 27. 10064966 4 1 72 4 3 26. 10064966 3 1 71 4 3 25. 10064966 2 1 70 4 3 24. 10064966 1 1 70 4 3 23. 10059377 4 2 49 1 6 22. 10059377 3 2 48 2 6 21. 10059377 2 2 47 2 6 20. 10059377 1 2 46 2 6 19. 10051562 4 2 7 . . 18. 10051562 3 2 6 . . 17. 10051562 2 2 5 . . 16. 10051562 1 2 4 . . 15. 10051538 4 2 25 6 6 14. 10051538 3 2 24 3 6 13. 10051538 2 2 23 6 6 12. 10051538 1 2 22 3 6 11. 10042571 4 1 62 4 6 10. 10042571 3 1 60 8 6 9. 10042571 1 1 59 3 6 8. 10028005 4 1 33 2 6 7. 10028005 3 1 32 2 6 6. 10028005 2 1 31 2 6 5. 10028005 1 1 30 2 6 4. 10019057 4 2 62 4 6 3. 10019057 3 2 61 4 6 2. 10019057 2 2 60 4 6 1. 10019057 1 2 59 4 6 pid wave hgsex age jbstat mastat
. list pid wave hgsex age jbstat mastat in 1/30, clean nol
(Can also use ,nol option)
Joining data sets together
50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat
50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat
Adding extra observations: “append” command
Adding extra variables:
“merge” command
50. 10091904 3 male 12 . . . . 49. 10091904 2 male 11 . . . . 48. 10091904 1 male 11 . . . . 47. 10091866 4 female 50 family c married missing good 46. 10091866 3 female 49 employed married proxy re good 45. 10091866 2 female 48 employed married 11 good 44. 10091866 1 female 48 maternit married 17 good 43. 10091831 4 male 51 . . . . 42. 10091831 3 male 50 . . . . 41. 10091831 2 male 50 . . . . 40. 10091831 1 male 49 . . . . 39. 10081798 4 female 75 retired married missing excellen 38. 10081798 3 female 74 retired married 7 excellen 37. 10081798 2 female 73 retired married 5 excellen 36. 10081798 1 female 72 retired married 8 good 35. 10081763 4 male 74 . . . . 34. 10081763 3 male 73 . . . . 33. 10081763 2 male 72 . . . . 32. 10081763 1 male 71 . . . . 31. 10076166 4 female 79 retired widowed proxy re excellen 30. 10076166 3 female 79 retired widowed 7 excellen 29. 10076166 2 female 78 retired widowed 7 excellen 28. 10076166 1 female 77 retired widowed 6 excellen 27. 10064966 4 male 72 retired widowed 17 poor 26. 10064966 3 male 71 retired widowed 18 good 25. 10064966 2 male 70 retired widowed missing good 24. 10064966 1 male 70 retired widowed missing fair 23. 10059377 4 female 49 self-emp never ma 17 poor 22. 10059377 3 female 48 employed never ma 14 fair 21. 10059377 2 female 47 employed never ma 10 good 20. 10059377 1 female 46 employed never ma 12 fair 19. 10051562 4 female 7 . . . . 18. 10051562 3 female 6 . . . . 17. 10051562 2 female 5 . . . . 16. 10051562 1 female 4 . . . . 15. 10051538 4 female 25 family c never ma 10 good 14. 10051538 3 female 24 unemploy never ma 8 excellen 13. 10051538 2 female 23 family c never ma 6 excellen 12. 10051538 1 female 22 unemploy never ma 11 good 11. 10042571 4 male 62 retired never ma 6 fair 10. 10042571 3 male 60 lt sick, never ma 7 good 9. 10042571 1 male 59 unemploy never ma 11 fair 8. 10028005 4 male 33 employed never ma 7 good 7. 10028005 3 male 32 employed never ma 12 fair 6. 10028005 2 male 31 employed never ma 8 fair 5. 10028005 1 male 30 employed never ma 7 excellen 4. 10019057 4 female 62 retired never ma 11 excellen 3. 10019057 3 female 61 retired never ma 10 excellen 2. 10019057 2 female 60 retired never ma 12 excellen 1. 10019057 1 female 59 retired never ma 7 excellen pid wave hgsex age jbstat mastat hlghq1 hlstat
Whether appending or merging
Whether appending or merging The data set you are using at the time is called the “master” data The data set you want to merge it with is called the “using” data Make sure you can identify observations properly beforehand Make sure you can identify observations uniquely afterwards
Appending
Use this command to add more observations Relatively easy Check first that you are really adding observations you don’t already
have (or that if you are adding duplicates, you really want to do this) Syntax: append using using_data STATA simply sticks the “using” data on the end of the “master” data STATA re-orders the variables if necessary. If the using data contain variables not present in the master data,
STATA sets the values of these variables to missing in the using data (and vice versa if the master data contains variables not present in the
using data)
Merging is more complicated
Use “merge” to add more variables to a data setMaster data: age.dtapid wave age
28005 1 30
19057 1 59
28005 2 31
19057 3 61
19057 4 62
28005 4 33
Using data: sex.dtapid wave sex
19057 1 female
19057 3 female
28005 1 male
28005 2 male
28005 4 male
42571 1 male
42571 3 male
Notice that both data sets don’t contain the same observations• Merge 1:1 pid wave using sex
pid wave agesex _merge
19057 1 59female 3
19057 3 61female 3
19057 4 62 .1
28005 1 30male 3
28005 2 31male 3
28005 4 33male 3
42571 1 .male 2
42571 3 .male 2
Merging
STATA creates a variable called _merge after merging• 1: observation in master but not using data• 2: observation in using but not master data• 3: observation in both data sets
Options available for discarding some observations – see manual
More on merging
Previous example showed one-to-one merging Not every observation was in both data sets, but every observation in the master data
was matched with a maximum of only one observation in the using data – and vice versa.
Many-to-one merging: Household-level data sets contain only one observation per household (usually <1 per person) Regional data (eg, regional unemployment data), usually one observation per region Sample syntax: merge m:1 hid wave using hhinc_datahid pid age 1604 19057 59 2341 28005 30 3569 42571 59 4301 51538 22 4301 51562 4 4956 59377 46 5421 64966 70 6363 76166 77 6827 81763 71 6827 81798 72
hid h/h income 1604 780 2341 1501 3569 2684301 3944956 16015421 2256363 4116827 743
hid pid age h/h income1604 19057 59 7802341 28005 30 15013569 42571 59 2684301 51538 22 3944301 51562 4 3944956 59377 46 16015421 64966 70 2256363 76166 77 4116827 81763 71 7436827 81798 72 743
One-to-many merging Job and relationship files contain one observation per episode (potentially >1 per person) Income files contain one observation per source of income (potentially >1 per person) Sample syntax: merge 1:m pid wave using births_data
Long and wide forms
The data we have here is in “long” form One row for each person/wave combination From a few slides back:
30. 10076166 3 female 79 29. 10076166 2 female 78 28. 10076166 1 female 77 27. 10064966 4 male 72 26. 10064966 3 male 71 25. 10064966 2 male 70 24. 10064966 1 male 70 23. 10059377 4 female 49 22. 10059377 3 female 48 21. 10059377 2 female 47 20. 10059377 1 female 46 19. 10051562 4 female 7 18. 10051562 3 female 6 17. 10051562 2 female 5 16. 10051562 1 female 4 15. 10051538 4 female 25 14. 10051538 3 female 24 13. 10051538 2 female 23 12. 10051538 1 female 22 11. 10042571 4 male 62 10. 10042571 3 male 60 9. 10042571 1 male 59 8. 10028005 4 male 33 7. 10028005 3 male 32 6. 10028005 2 male 31 5. 10028005 1 male 30 4. 10019057 4 female 62 3. 10019057 3 female 61 2. 10019057 2 female 60 1. 10019057 1 female 59 pid wave hgsex age
Wide form
However, it’s also possible to put longitudinal data into “wide” form One observation per person, with different variables relating to different years of
data
8. 10076166 female 77 78 79 79 7. 10064966 male 70 70 71 72 6. 10059377 female 46 47 48 49 5. 10051562 . 4 5 6 7 4. 10051538 female 22 23 24 25 3. 10042571 male 59 . 60 62 2. 10028005 male 30 31 32 33 1. 10019057 female 59 60 61 62 pid sex1 age1 age2 age3 age4
Age at wave 1, and so on
Sex doesn’t change [usually]
The reshape command
Switching from long to wide:• reshape wide [stubnames], i(id) j(year)
In BHPS, this becomes• reshape wide [stubnames], i(pid) j(wave)
What are stub names? They are a list of variables which vary between years Variables like sex or ethnicity would not normally be included in this list
Switching from wide to long: Exactly the opposite
• reshape long [stubnames], i(id) j(wave)
Lots more info and examples in STATA manual
Simple models using longitudinal data
Auto-regressive and time-lagged models Models of change
But first: the GHQ
Use this for lots of analysis in the lectures and practical sessions General Health Questionnaire Different versions: BHPS carries the GHQ-12, with 12 questions. Have you recently:
been able to concentrate on whatever you're doing ? lost much sleep over worry ? felt that you were playing a useful part in things ? felt capable of making decisions about things ? felt constantly under strain? felt you couldn't overcome your difficulties ? been able to enjoy your normal day to day activities ? been able to face up to problems ? been feeling unhappy or depressed? been losing confidence in yourself? been thinking of yourself as a worthless person ? been feeling reasonably happy, all things considered ?
Answer each question on 4-point scale not at all - no more than usual - rather more - much more
GHQ (ghq) 1: likert | Freq. Percent Cum.------------------------+----------------------------------- missing or wild | 582 2.10 2.10proxy respondent | 1,202 4.33 6.43 0 | 77 0.28 6.70 1 | 109 0.39 7.10 2 | 149 0.54 7.63 3 | 288 1.04 8.67 4 | 504 1.82 10.49 5 | 867 3.12 13.61 6 | 2,229 8.03 21.64 7 | 2,265 8.16 29.80 8 | 2,355 8.48 38.28 9 | 2,426 8.74 47.02 10 | 2,259 8.14 55.16 11 | 2,228 8.03 63.19 12 | 2,478 8.93 72.11 13 | 1,316 4.74 76.85 14 | 1,115 4.02 80.87 15 | 876 3.16 84.03 16 | 714 2.57 86.60 17 | 635 2.29 88.89 18 | 499 1.80 90.68 19 | 439 1.58 92.27 20 | 381 1.37 93.64 21 | 318 1.15 94.78 22 | 276 0.99 95.78 23 | 264 0.95 96.73 24 | 220 0.79 97.52 25 | 134 0.48 98.00 26 | 103 0.37 98.38 27 | 96 0.35 98.72 28 | 59 0.21 98.93 29 | 66 0.24 99.17 30 | 47 0.17 99.34 31 | 47 0.17 99.51 32 | 35 0.13 99.64 33 | 26 0.09 99.73 34 | 20 0.07 99.80 35 | 29 0.10 99.91 36 | 26 0.09 100.00------------------------+----------------------------------- Total | 27,759 100.00
HLGHQ1 in BHPS Sum of scores LIKERT scale We recode <0 values to
missings, rename LIKERT Consider as a continuous
variable
GHQ
HLGHQ2 Caseness scale Recodes answers 3-4
as 1, and adds up Scores above 2 used to
indicate psychological morbidity
12 383 1.38 100.00 11 343 1.24 98.62 10 385 1.39 97.38 9 417 1.50 96.00 8 529 1.91 94.50 7 561 2.02 92.59 6 720 2.59 90.57 5 933 3.36 87.98 4 1,143 4.12 84.61 3 1,581 5.70 80.50 2 2,189 7.89 74.80 1 3,569 12.86 66.92 0 13,222 47.63 54.06proxy respondent 1,202 4.33 6.43 missing or wild 582 2.10 2.10 (ghq) 2: caseness Freq. Percent Cum. subjective wellbeing
Time-lagged models
Start with simple OLS model
The Likert score is a measure of psychological wellbeing derived from a battery of questions
_cons 8.298458 .2374816 34.94 0.000 7.83298 8.763936 partner -.044241 .0788756 -0.56 0.575 -.1988419 .1103598 ue_sick 3.562843 .1249977 28.50 0.000 3.31784 3.807846 female 1.593608 .0661958 24.07 0.000 1.46386 1.723356 age2 -.0007342 .0001119 -6.56 0.000 -.0009535 -.000515 age .0797637 .0111716 7.14 0.000 .0578667 .1016607 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 720690.354 25107 28.7047578 Root MSE = 5.2156 Adj R-squared = 0.0523 Residual 682847.462 25102 27.2029106 R-squared = 0.0525 Model 37842.892 5 7568.5784 Prob > F = 0.0000 F( 5, 25102) = 278.23 Source SS df MS Number of obs = 25108
. reg LIKERT age age2 female ue_sick partner if age >= 18
30. 10042571 1 11 . 29. 10028005 14 13 9 28. 10028005 13 9 12 27. 10028005 12 12 7 26. 10028005 11 7 8 25. 10028005 10 8 7 24. 10028005 9 7 9 23. 10028005 8 9 7 22. 10028005 7 7 8 21. 10028005 6 8 8 20. 10028005 5 8 7 19. 10028005 4 7 12 18. 10028005 3 12 8 17. 10028005 2 8 7 16. 10028005 1 7 . 15. 10019057 15 11 11 14. 10019057 14 11 . 13. 10019057 13 . 12 12. 10019057 12 12 12 11. 10019057 11 12 12 10. 10019057 10 12 12 9. 10019057 9 12 12 8. 10019057 8 12 12 7. 10019057 7 12 . 6. 10019057 6 . 12 5. 10019057 5 12 11 4. 10019057 4 11 10 3. 10019057 3 10 12 2. 10019057 2 12 7 1. 10019057 1 7 . pid wave LIKERT LIKERT~g
. list pid wave LIKERT LIKERT_lag in 1/30, clean
. * check:
.
(14738 missing values generated). gen LIKERT_lag = LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
. capture drop LIKERT_lag
. sort pid wave
.
Generate lagged variable
NB: the 1/30 here is just so it will fit on the page. You should check many more observations than this!
_cons 4.593374 .2365749 19.42 0.000 4.12967 5.057079 partner .0967488 .0759926 1.27 0.203 -.0522022 .2456999 ue_sick 2.128451 .1222784 17.41 0.000 1.888777 2.368126 female .8414271 .0638746 13.17 0.000 .7162282 .966626 age2 -.0002391 .0001079 -2.22 0.027 -.0004506 -.0000276 age .0272471 .0108394 2.51 0.012 .0060011 .0484931 LIKERT_lag .4752892 .0060424 78.66 0.000 .4634456 .4871327 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 616865.089 21462 28.7421997 Root MSE = 4.5987 Adj R-squared = 0.2642 Residual 453760.485 21456 21.1484193 R-squared = 0.2644 Model 163104.604 6 27184.1006 Prob > F = 0.0000 F( 6, 21456) = 1285.40 Source SS df MS Number of obs = 21463
. reg LIKERT LIKERT_lag age age2 female ue_sick partner if age >= 18
OLS, with lagged dependent variable
R-squared rockets from 5% to 26%
Big & very significant coefficient on lagged variable
Coeff on “ue_sick” falls from 3.6 to 2.1
Also possible to include lagged explanatory variables
Models of change
iii xy ......
1......11 iii xy
2......22 iii xy
)1212 12(......)()( iiiiii xxyy
iii xy ......
Start with OLS model [simplified, but imagine more variables]
Separate model for each year – suffix denotes year
Subtract 1st from 2nd model
Or, express in terms of change
Generate difference variables capture drop dif* sort pid wave gen dif_LIKERT = LIKERT - LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age = age - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_age2 = age2 - age[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_female = female - female[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_ue_sick = ue_sick - ue_sick[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1 gen dif_partner = partner - partner[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
Check you understand why dif_female will [very nearly] always be zero
30. 10042571 1 59 . 29. 10028005 14 43 1 28. 10028005 13 42 1 27. 10028005 12 41 1 26. 10028005 11 40 1 25. 10028005 10 39 1 24. 10028005 9 38 1 23. 10028005 8 37 1 22. 10028005 7 36 1 21. 10028005 6 35 1 20. 10028005 5 34 1 19. 10028005 4 33 1 18. 10028005 3 32 1 17. 10028005 2 31 1 16. 10028005 1 30 . 15. 10019057 15 73 2 14. 10019057 14 71 0 13. 10019057 13 71 2 12. 10019057 12 69 1 11. 10019057 11 68 1 10. 10019057 10 67 0 9. 10019057 9 67 1 8. 10019057 8 66 1 7. 10019057 7 65 1 6. 10019057 6 64 1 5. 10019057 5 63 1 4. 10019057 4 62 1 3. 10019057 3 61 1 2. 10019057 2 60 1 1. 10019057 1 59 . pid wave age dif_age
. list pid wave age dif_age in 1/30, clean
Check for sensible results!
More checking….
30. 10042571 1 11 . 29. 10028005 14 13 4 28. 10028005 13 9 -3 27. 10028005 12 12 5 26. 10028005 11 7 -1 25. 10028005 10 8 1 24. 10028005 9 7 -2 23. 10028005 8 9 2 22. 10028005 7 7 -1 21. 10028005 6 8 0 20. 10028005 5 8 1 19. 10028005 4 7 -5 18. 10028005 3 12 4 17. 10028005 2 8 1 16. 10028005 1 7 . 15. 10019057 15 11 0 14. 10019057 14 11 . 13. 10019057 13 . . 12. 10019057 12 12 0 11. 10019057 11 12 0 10. 10019057 10 12 0 9. 10019057 9 12 0 8. 10019057 8 12 0 7. 10019057 7 12 . 6. 10019057 6 . . 5. 10019057 5 12 1 4. 10019057 4 11 1 3. 10019057 3 10 -2 2. 10019057 2 12 5 1. 10019057 1 7 . pid wave LIKERT dif_LI~T
. list pid wave LIK dif_LIK in 1/30, clean
Issues
Most differences are zero Moving into unemployment or partnership is given equal and opposite
weighting to moving out. No real reason why this should be the case There are MUCH better ways to use these data! Nevertheless, let’s proceed!
Results
Female drops out
Coeffs on sick and partner significant
_cons -.3104285 .1364378 -2.28 0.023 -.5778567 -.0430003 dif_partner -.5948999 .1780776 -3.34 0.001 -.9439452 -.2458545 dif_ue_sick 1.857857 .149281 12.45 0.000 1.565255 2.150458 dif_age2 -6.24e-07 .0000212 -0.03 0.977 -.0000422 .000041 dif_age .3757715 .1227943 3.06 0.002 .1350856 .6164574 dif_LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 613771.706 21460 28.6007319 Root MSE = 5.3264 Adj R-squared = 0.0081 Residual 608713.117 21456 28.3702982 R-squared = 0.0082 Model 5058.58867 4 1264.64717 Prob > F = 0.0000 F( 4, 21456) = 44.58 Source SS df MS Number of obs = 21461
. reg dif_LIKERT dif_age - dif_partner if age >= 18
_cons 8.298458 .2374816 34.94 0.000 7.83298 8.763936 partner -.044241 .0788756 -0.56 0.575 -.1988419 .1103598 ue_sick 3.562843 .1249977 28.50 0.000 3.31784 3.807846 female 1.593608 .0661958 24.07 0.000 1.46386 1.723356 age2 -.0007342 .0001119 -6.56 0.000 -.0009535 -.000515 age .0797637 .0111716 7.14 0.000 .0578667 .1016607 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 720690.354 25107 28.7047578 Root MSE = 5.2156 Adj R-squared = 0.0523 Residual 682847.462 25102 27.2029106 R-squared = 0.0525 Model 37842.892 5 7568.5784 Prob > F = 0.0000 F( 5, 25102) = 278.23 Source SS df MS Number of obs = 25108
. reg LIKERT age age2 female ue_sick partner if age >= 18