DDI 3 Comparison Test-Case at ICPSR Sanda Ionescu Documentation Specialist ICPSR.
A data workshop presented at The Inter-University Consortium for Political and Social Research...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of A data workshop presented at The Inter-University Consortium for Political and Social Research...
A data workshop presented atThe Inter-University Consortium for Political and Social
Research (ICPSR)University of Michigan
Ann Arbor, MIJuly 21-23, 2008
DAY 2
Topics to discuss
Longitudinal analysis Weighting
Cross-sectional vs. longitudinal Survey EDS
Useful variables to know about Examples of longitudinal research Access to restricted data
Longitudinal analysis (survey)
3 waves of data Study includes separated and new
caregivers Focal child remains the same, but a child’s
residence and/or caregiver may change once or twice
Longitudinal analysis
A researcher may wish: to follow the same child over time to include only children included at all 3
waves to follow the same adult over time, even if her
caregiver status changes to study young adults who become
independent householders by wave 3
Data file structure
In order to follow separated caregivers AND children with new caregivers, two survey instruments were created for waves 2 and 3 The continuing and new caregiver interview collects
information about the caregiver and the focal child The separated caregiver interview collects information
about the caregiver only Continuing and new caregivers appear together in
one data file/codebook; separated caregivers have their own data file/codebook
Data file structure
At wave 2 (or wave 3), there may be two caregiver interviews associated with a given focal child Two caregiver records for the same sampled
household, but each record is on a separate file If the two caregiver files are appended, two caregiver
records will have the same “household” identifier Within a wave, need a way to identify separate person
records from the same “household”
Data file structure
Over time, a wave 1 caregiver’s status may change With child at wave 1, separated at wave 2, back
together at wave 3 With child at wave 1, separated at waves 2 and 3 With child at waves 1 and 2, separated at wave 3 For an analysis of caregivers, need a way to make
sure that you are following the same person over time (across waves)
Possible interview scenarios at wave 3, described in the variable SCENARIO
SCENARIO N (from wave 3 focal child interview file, total N=1944)
1 Focal child lived with same caregiver at each wave
1716
2 Focal child was with new caregiver at wave 2, with same caregiver at wave 3 (wave 1 caregiver is separated at wave 3)
20
3 Focal child was with wave 1 caregiver at wave 2 and with a new caregiver at wave 3
79
4 Focal child was with a new caregiver at wave 2, reunited with wave 1 caregiver at wave 3 (wave 2 caregiver not followed)
14
5 Focal child is living independently at wave 3 (wave 1 caregiver interviewed as a separated caregiver)
114
Tools to merge files longitudinally
HHID – a unique, universal household ID7 charactersValue is the same at every wave, for every
interview typeThink of this as a value attached to the focal
child, who is the unit of analysis in the study
Tools to merge files longitudinally
ZRID – an interview-specific unique identifier 8 characters Like HHID with a new leading character
1=Wave 1 survey interview for caregiver and child, wave 2 focal child 3=Mother EDS (wave 1 and wave 2) 4=Wave 1 Father EDS 5=Wave 1 Child Care EDS (wave 1 and wave 2) 6=Wave 2 continuing caregiver 7=Wave 2 new caregiver 8=Wave 2 separated caregiver R=Wave 3 continuing or new caregiver S=Wave 3 separated caregiver A=Wave 3 focal child
Think of this as a value attached to the respondent, usually for a particular wave
Tools to merge files longitudinally
NEWID – a caregiver-specific unique ID Allows users to follow same caregiver over time, even
if her status relative to the focal child changes Necessary because at wave 3, a “continuing
caregiver” could be: The same person who was with the focal child at waves 1
and 2 A caregiver who was new to the focal child at wave 2 and
continues to be the child’s caregiver at wave 3 A wave 1 caregiver who was separated from the child at
wave 2 and is reunited at wave 3
Tools to merge files - NEWID
9 characters Like HHID with 2 characters appended: “01” or “02” The original wave 1 caregiver always has “01” in her
NEWID value A person who was a new caregiver at wave 2 or at wave
3 will have a value of “02” in her NEWID value There are no cases where the focal child had different
new caregivers at wave 2 and wave 3 – hence, NEWID never ends with “03”
Scenario 1: Child always with same caregiverHHID ZRID,
wave 1
ZRID,
wave 2
ZRID,
wave 3
NEWID (for caregiver)
0801230 10801230 60801230 (caregiver),
10801230 (child)
R0801230 (caregiver),
A0801230 (child)
080123001
Scenario 2: Child with new CG at wave 2, still with that CG at wave 3
HHID ZRID,
wave 1
ZRID,
wave 2
ZRID,
wave 3
NEWID (for caregiver)
0801230 10801230 70801230 (new caregiver),
8081230 (separated caregiver)
10801230 (child)
R0801230 (“continuing new” caregiver),
S0801230 (separated caregiver)
A0801230 (child)
080123001 (original caregiver),
080123002 (caregiver introduced at wave 2)
Scenario 4: Child with new CG at wave 2, reunited with original CG at wave 3
HHID ZRID,
wave 1
ZRID,
wave 2
ZRID,
wave 3
NEWID (for caregiver)
0801230 10801230 70801230 (new caregiver),
8081230 (separated caregiver)
10801230 (child)
R0801230 (original caregiver),
A0801230 (child)
080123001 (original caregiver),
080123002 (caregiver introduced at wave 2, NOT interviewed at wave 3)
Merging
To follow the same focal child over time, merge:Wave 1 data, wave 2 continuing/new
caregiver data, wave 3 continuing/new caregiver data BY HHID
Analyses should include a control to indicate whether child with a new caregiver at either wave
Merging
To include only children interviewed at each wave, merge:Wave 1 data, wave 2 continuing/new
caregiver data, wave 3 continuing/new caregiver data BY HHID
Drop cases that do not appear on all 3 files
Merging
To follow the same adult over time, Append wave 2 separated caregiver file to wave 2
new/continuing caregiver file Append wave 3 separated caregiver file to wave 3
new/continuing caregiver file Merge these appended files to wave 1 file BY NEWID
HINT: A three-wave analysis of the same person would have to include women who were caregivers at wave 1. Hence, all valid values of NEWID would have to end in “01” for such an analysis.
Merging
To follow only children who always had the same caregiver,Merge wave 1 data, wave 2 continuing/new
caregiver data, and wave 3 continuing/new caregiver data BY NEWID
Drop cases that do not match on all 3 filesDo NOT append wave 2 or wave 3 separated
caregiver files
Data issues that affect longitudinal analysis Falsified data
At wave 1, 56 cases were determined to be falsified and were dropped from the public release (N=2402 instead of 2458)
Another 11 cases had data removed from the child interview, but not from the adult interview
Falsified cases
The 56 falsified cases were retained by the survey contractor to re-interview at wave 2
45 of the 56 were re-interviewed at wave 2However, no data on race/ethnicity or other
time-invariant characteristics was collected
Falsified cases
11 falsified cases not found at wave 2 were not followed to wave 3
35 of the 45 falsified households who were tracked provided interviews at wave 3
For these respondents, race/ethnicity was imputed for the caregiver and child from screener data.
Falsified cases
Any 3-wave analysis will exclude the 56 falsified cases (and will also exclude the 11 falsified child interview cases if analysis involves child data)
Any wave 2 analysis using race/ethnicity as a control variable will exclude the falsified cases
Duplicate cases
At wave 1, nine households were interviewed twice (under separate values of HHID)
Each household interviewed only once at wave 3 Values of HHIDs that were not re-interviewed:
1020060, 0950370, 1180350, 1220300, 1540090, 1690550, 1740110, 1920060, 1920090
Weighting
Sample is clustered and stratified probability weights developed at wave 1 for main interview and for EDS
There is attrition across waves weights at waves 2 and 3 account for non-response bias
Not accounting for weights affects point estimates and standard errors
Wave 1 probability weights for dwelling units
A product of the probability of:Respondent’s primary unit (neighborhood
cluster) selected for samplingRespondent participated in screeningHousehold met the various stratifying criteria
for inclusion (race/ethnicity, marital status, income level, Medicaid status, etc.)
Respondent agreed to participate
Wave 1 probability weights for focal children Among sampled households, a child’s probability
of participating depends on the number of age-eligible children in the household If there is only one child, his/her probability is 1 If there are two children, each child’s probability is .5
Focal child weight = (dwelling unit weight* 1/probability of child’s selection)
Dwelling unit weight vs focal child weight In most cases, the focal child weight is used,
even if an analysis focuses on adults’ employment, earnings, etc. Children are the unit of analysis in the sampling
design Not accounting for the number of eligible children
gives more weight to larger families But if a child did not participate in a given wave and
his/her caregiver did, there is only a dwelling unit associated with that household for that wave (no focal child weight)
Weights at waves 2 and 3
Wave 2 and wave 3 weights account for non-response bias
Equal to the product of the wave 1 weight x probability of remaining in the study at each wave
In addition, a wave 3 longitudinal weight calculates the unique probability of participating in every wave Distinct from the wave 3 cross-sectional weight, which considers
the probability of participating of wave 1 and wave 3, and ignores wave 2.
EDS Weights Wave 1 Age adjusted for eligible 2, 3, 4 year olds. Separate weight for each piece of EDS
(e.g., mother, childcare, father), and for each combination (e.g., mother and father).
Weights adjusted for EDS nonresponse, stratified differently by interview type race & income for motherresidence status for father type for childcare
EDS Weights Wave 2
Similar technique in wave 2. Correlational and longitudinal weights
created, for each possible combination. Like main survey, weights trimmed. Like main survey, should renormalize if
necessary.
Trimming weights
The research team agreed to trim dwelling unit and focal child weights at the top 5th percentile (not the bottom 5th)Very large weights gave extreme influence to
unusual cases (for example, higher income, unmarried Hispanic women)
Trimming is a conventional solution to the problem of outliers
Normalizing (equalizing) weights The original weights give more influence to
respondents from larger cities Each respondent in Chicago has a lower probability of
being selected than does a respondent in San Antonio
Each respondent in Chicago represents more people in his/her city compared to a respondent in SA
Therefore, the sum of weights in Chicago is larger compared to the sum for SA
Normalizing (equalizing) weights To give equal consideration to
respondents in each city, weights are normalized (aka equalized)
Normalizing involves dividing each person’s individual weight by the value of the adjusted mean weight for that person’s city
Adjusted mean weight
The adjusted mean weight for a city is the value of the average weight for a given city, where the denominator to compute the average is the average number of people from each city who are in the sample (not the number of people in the sample actually from a given city)
An example
At wave 1,926 respondents are from Boston762 respondents are from Chicago714 respondents are from San Antonio2402 respondents in all2402/3 = 800.67
An example
For each city, the adjusted mean weight for each city is the sum of the weights for each person in that city divided by 800.67:Boston: 7626.49/800.67 = 9.525Chicago: 48226.57/800.67 = 60.233San Antonio: 20435.36/800.67=25.52
An example
For each person, their normalized weight is their own weight divided by the adjusted mean weight for their city.
For the sample overall, the normalized weight should have a mean of 1
Normalized weights for the full sample at each wave are available on data files
Re-normalizing
If you are working with a subset of cases whose distribution varies dramatically by city (e.g., foreign-born, women who have hit a time limit), you may need to re-normalize
Check to see if the mean of the original normalized weight on the file is a lot different from 1 (below .95 or above 1.05). If so, need to re-normalize.
Steps to re-normalize
Find the appropriate trimmed, non-normalized weight on the sample (it will have the character T in place of the character E in its name)
Determine how many cities are represented in your subset (1, 2, or 3) Create a constant equal to that value
Determine how many cases are on your file Create a constant equal to that value divided by the
number of cities in your subset
Steps to re-normalize, cont’d
For each city, create a variable that is the sum of the weights for each respondent in that city If there are 3 cities in your subset, this variable will have 3 values
Divide that variable by the average number of people in each city (the constant you created). This is the adjusted mean weight. Again, this variable will take 3 values where there are 3 cities in
your subset Divide each respondent’s own weight by their city’s
adjusted mean weight this is your re-normalized weight.
Check that the average for this is 1 (or within .0001)
Clustering
Because respondents are clustered within primary units, people within primary units may be more alike than are people across primary units
Standard errors may be artificially deflated Taking clustering into account adjusts
standard errors to what they would be in a non-clustered sample
Boston ChicagoSan
Antonio
B W H B W H B H
S S S S S S S S
DU DU DU DU DU DU DU DU
VariablenameSITE
STR
PU
SEGID
SCRID
PU PU PU PU PU PU PU PU
Clustering
To date, the research team has not found that accounting for clustering using survey-based adjustments in Stata significantly influences standard errors.
The overall recommendation is not to take clustering into account
If a user wishes to do so, the necessary variables appear on the data files, and instructions appear in user’s guides for waves 2 and 3
Weighting in caregiver and focal child interviews – bottom line Focal child weights used more often than
dwelling unit weights But FC weights only created where a child
completed an interview Use normalized (equalized) weights to give each
city equal influence See weighting document for variable names
Weighting in caregiver and focal child interviews – bottom lineAnalysis type Equalized FC
weight/DU weightNon-equalized FC weight/DU weight
Wave 1 only R1CHE5WT/
R1DUE5WT
R1CHT5WT/
R1DUT5WT
Wave 1 to wave 2 or wave 2 only
R2CHE5WT/
R2DUE5WT
R2CHT5WT/
R2DUT5WT
Wave 1 to wave 3 or wave 3 only
R3CHE5WT/
R3DUE5WT
R3CHT5WT/
R3DUT5WT
Longitudinal (all 3 waves)
R3LCE5WT/
R3LDE5WT
R3LCT5WT/
R3LDT5WT
Useful variables to know aboutVariable description Wave 1 name Wave 2 name Wave 3 name
Calendar date of interview (caregiver/child)
COMPDATP/
COMPDATY
COMPDATE COMPDATE
Century month of interview (caregiver)
CMINTV11 CMINTV21 CMINTV31
Calendar date of last interview (caregiver)
-- IWDATEN IWDATE
Century month of last interview (caregiver)
-- CMINTV11 LASTIV31
Household’s participation in incentive experiment
INCENT11
Case was a duplicate interview
DUPLICAT DUPLICAT
City where respondent was originally interviewed
CITY11 SCRCITY or CITY21 SITE or CITY31
Useful variables to know aboutVariable description Wave 1 name Wave 2 name Wave 3 name
Interview conducted in-person or by telephone
-- CAPIMODE CAPIMODE
Household participated in EDS
-- EDS --
Household dropped at wave 1, picked up again at wave 2
-- NEWCASE --
Interview is complete (complete=491)
SUMSTAT W2SUMSTA SUM_STAT
Time between interviews (months)
-- -- TIMLPS31
Last interview wave R participated in
-- -- LIVWAV31
Imputed race for R/child (for cases dropped at wave 1)
-- -- IMRACE31/
IMRACC31/
ICRACE31/
ICRACC31
Useful variables to know aboutVariable description Wave 1 name Wave 2 name Wave 3 name
Interview type (continuing or new caregiver)
INTYPE
Total number of interviews R gave over course of study
TOTINT
Incorrect interview type administered (continuing CG received new CG interview)
CGTFL
Total # of caregivers associated w/this household at this wave
TOTCG
Focal child is independent (not living w/continuing or new CG)
INMAINCS (on caregiver files)
INDEPENDENT (on focal child data file)
Weighted Longitudinal Analysis
Are home renters more likely to eventually buy a home in their neighborhood when collective efficacy in the neighborhood is higher?
Look at caregivers who were renting at wave 1 to determine whether they purchased a home in their neighborhood by wave 3.
Homebuying example
Use wave 1 and wave 3 data From wave 1, keep unique identifiers, home
ownership status (PHT2A), collective efficacy scale (COLEFF11), and century-month of caregiver interview (CMINTV11)
From wave 3, keep unique identifiers, home ownership status (RHT2A), number of years R has lived in neighborhood (NHDYR31), century-month of interview (CMINTV31), equalized and non-equalized 5% trimmed weights, and CITY
Homebuying example
Merge 2 files by NEWID Keep only cases that appear on both files
(same caregiver) Keep only cases that were renters at wave
1 Keep only cases that have been in wave 3
neighborhood since before wave 1
Homebuying example
Merge 2 files by NEWID Keep only cases that appear on both files (same
caregiver) Keep only cases that were renters at wave 1 Keep only cases that have been in wave 3
neighborhood since before wave 1 N=618 Note that 11 cases lack values on focal child
weights, effective N=607.
Homebuying example
Look at normalized weight. Does it have a mean of 1?
If not, renormalize.
r3che5wt 607 .9135091 1.117513 .02 7.88 Variable Obs Mean Std. Dev. Min Max
. summ r3che5wt
.
r3che5wt 607 1 1.905628 .0214276 21.10895 Variable Obs Mean Std. Dev. Min Max
. summ r3che5wt
Example - Bivariate logit, weighted and unweighted Regress OWN31 on COLEFF11,
unweighted
_cons -2.353912 .4279933 -5.50 0.000 -3.192764 -1.515061 coleff11 -.0012978 .0157366 -0.08 0.934 -.0321411 .0295455 own31 Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -175.01546 Pseudo R2 = 0.0000 Prob > chi2 = 0.9343 LR chi2(1) = 0.01Logistic regression Number of obs = 606
Example - Bivariate logit, weighted and unweighted Regress OWN31 on COLEFF11, weighted
Effect of collective efficacy is not statistically significant here, but point estimates change dramatically with weights included
_cons -.4846027 .9864111 -0.49 0.623 -2.417933 1.448727 coleff11 -.044825 .0374888 -1.20 0.232 -.1183017 .0286517 own31 Coef. Std. Err. z P>|z| [95% Conf. Interval] Robust
Log pseudolikelihood = -268.76871 Pseudo R2 = 0.0265 Prob > chi2 = 0.2318 Wald chi2(1) = 1.43Logistic regression Number of obs = 606