A data workshop presented at The Inter-University Consortium for Political and Social Research...

A data workshop presented atThe Inter-University Consortium for Political and Social

Research (ICPSR)University of Michigan

Ann Arbor, MIJuly 21-23, 2008

DAY 2

Topics to discuss

Longitudinal analysis Weighting

Cross-sectional vs. longitudinal Survey EDS

Useful variables to know about Examples of longitudinal research Access to restricted data

Longitudinal analysis (survey)

3 waves of data Study includes separated and new

caregivers Focal child remains the same, but a child’s

residence and/or caregiver may change once or twice

Longitudinal analysis

A researcher may wish: to follow the same child over time to include only children included at all 3

waves to follow the same adult over time, even if her

caregiver status changes to study young adults who become

independent householders by wave 3

Data file structure

In order to follow separated caregivers AND children with new caregivers, two survey instruments were created for waves 2 and 3 The continuing and new caregiver interview collects

information about the caregiver and the focal child The separated caregiver interview collects information

about the caregiver only Continuing and new caregivers appear together in

one data file/codebook; separated caregivers have their own data file/codebook

Data file structure

At wave 2 (or wave 3), there may be two caregiver interviews associated with a given focal child Two caregiver records for the same sampled

household, but each record is on a separate file If the two caregiver files are appended, two caregiver

records will have the same “household” identifier Within a wave, need a way to identify separate person

records from the same “household”

Data file structure

Over time, a wave 1 caregiver’s status may change With child at wave 1, separated at wave 2, back

together at wave 3 With child at wave 1, separated at waves 2 and 3 With child at waves 1 and 2, separated at wave 3 For an analysis of caregivers, need a way to make

sure that you are following the same person over time (across waves)

Possible interview scenarios at wave 3, described in the variable SCENARIO

SCENARIO N (from wave 3 focal child interview file, total N=1944)

1 Focal child lived with same caregiver at each wave

1716

2 Focal child was with new caregiver at wave 2, with same caregiver at wave 3 (wave 1 caregiver is separated at wave 3)

20

3 Focal child was with wave 1 caregiver at wave 2 and with a new caregiver at wave 3

79

4 Focal child was with a new caregiver at wave 2, reunited with wave 1 caregiver at wave 3 (wave 2 caregiver not followed)

14

5 Focal child is living independently at wave 3 (wave 1 caregiver interviewed as a separated caregiver)

114

Tools to merge files longitudinally

HHID – a unique, universal household ID7 charactersValue is the same at every wave, for every

interview typeThink of this as a value attached to the focal

child, who is the unit of analysis in the study


ZRID – an interview-specific unique identifier 8 characters Like HHID with a new leading character

1=Wave 1 survey interview for caregiver and child, wave 2 focal child 3=Mother EDS (wave 1 and wave 2) 4=Wave 1 Father EDS 5=Wave 1 Child Care EDS (wave 1 and wave 2) 6=Wave 2 continuing caregiver 7=Wave 2 new caregiver 8=Wave 2 separated caregiver R=Wave 3 continuing or new caregiver S=Wave 3 separated caregiver A=Wave 3 focal child

Think of this as a value attached to the respondent, usually for a particular wave


NEWID – a caregiver-specific unique ID Allows users to follow same caregiver over time, even

if her status relative to the focal child changes Necessary because at wave 3, a “continuing

caregiver” could be: The same person who was with the focal child at waves 1

and 2 A caregiver who was new to the focal child at wave 2 and

continues to be the child’s caregiver at wave 3 A wave 1 caregiver who was separated from the child at

wave 2 and is reunited at wave 3

Tools to merge files - NEWID

9 characters Like HHID with 2 characters appended: “01” or “02” The original wave 1 caregiver always has “01” in her

NEWID value A person who was a new caregiver at wave 2 or at wave

3 will have a value of “02” in her NEWID value There are no cases where the focal child had different

new caregivers at wave 2 and wave 3 – hence, NEWID never ends with “03”

Scenario 1: Child always with same caregiverHHID ZRID,

wave 1

ZRID,

wave 2

ZRID,

wave 3

NEWID (for caregiver)

0801230 10801230 60801230 (caregiver),

10801230 (child)

R0801230 (caregiver),

A0801230 (child)

080123001

Scenario 2: Child with new CG at wave 2, still with that CG at wave 3

HHID ZRID,

wave 1

ZRID,

wave 2

ZRID,

wave 3


0801230 10801230 70801230 (new caregiver),

8081230 (separated caregiver)

10801230 (child)

R0801230 (“continuing new” caregiver),

S0801230 (separated caregiver)

A0801230 (child)

080123001 (original caregiver),

080123002 (caregiver introduced at wave 2)

Scenario 4: Child with new CG at wave 2, reunited with original CG at wave 3

HHID ZRID,

wave 1

ZRID,

wave 2

ZRID,

wave 3


0801230 10801230 70801230 (new caregiver),

8081230 (separated caregiver)

10801230 (child)

R0801230 (original caregiver),

A0801230 (child)

080123001 (original caregiver),

080123002 (caregiver introduced at wave 2, NOT interviewed at wave 3)

Merging

To follow the same focal child over time, merge:Wave 1 data, wave 2 continuing/new

caregiver data, wave 3 continuing/new caregiver data BY HHID

Analyses should include a control to indicate whether child with a new caregiver at either wave

Merging

To include only children interviewed at each wave, merge:Wave 1 data, wave 2 continuing/new

caregiver data, wave 3 continuing/new caregiver data BY HHID

Drop cases that do not appear on all 3 files

Merging

To follow the same adult over time, Append wave 2 separated caregiver file to wave 2

new/continuing caregiver file Append wave 3 separated caregiver file to wave 3

new/continuing caregiver file Merge these appended files to wave 1 file BY NEWID

HINT: A three-wave analysis of the same person would have to include women who were caregivers at wave 1. Hence, all valid values of NEWID would have to end in “01” for such an analysis.

Merging

To follow only children who always had the same caregiver,Merge wave 1 data, wave 2 continuing/new

caregiver data, and wave 3 continuing/new caregiver data BY NEWID

Drop cases that do not match on all 3 filesDo NOT append wave 2 or wave 3 separated

caregiver files

Data issues that affect longitudinal analysis Falsified data

At wave 1, 56 cases were determined to be falsified and were dropped from the public release (N=2402 instead of 2458)

Another 11 cases had data removed from the child interview, but not from the adult interview

Falsified cases

The 56 falsified cases were retained by the survey contractor to re-interview at wave 2

45 of the 56 were re-interviewed at wave 2However, no data on race/ethnicity or other

time-invariant characteristics was collected

Falsified cases

11 falsified cases not found at wave 2 were not followed to wave 3

35 of the 45 falsified households who were tracked provided interviews at wave 3

For these respondents, race/ethnicity was imputed for the caregiver and child from screener data.

Falsified cases

Any 3-wave analysis will exclude the 56 falsified cases (and will also exclude the 11 falsified child interview cases if analysis involves child data)

Any wave 2 analysis using race/ethnicity as a control variable will exclude the falsified cases

Duplicate cases

At wave 1, nine households were interviewed twice (under separate values of HHID)

Each household interviewed only once at wave 3 Values of HHIDs that were not re-interviewed:

1020060, 0950370, 1180350, 1220300, 1540090, 1690550, 1740110, 1920060, 1920090

Weighting

Sample is clustered and stratified probability weights developed at wave 1 for main interview and for EDS

There is attrition across waves weights at waves 2 and 3 account for non-response bias

Not accounting for weights affects point estimates and standard errors

Wave 1 probability weights for dwelling units

A product of the probability of:Respondent’s primary unit (neighborhood

cluster) selected for samplingRespondent participated in screeningHousehold met the various stratifying criteria

for inclusion (race/ethnicity, marital status, income level, Medicaid status, etc.)

Respondent agreed to participate

Wave 1 probability weights for focal children Among sampled households, a child’s probability

of participating depends on the number of age-eligible children in the household If there is only one child, his/her probability is 1 If there are two children, each child’s probability is .5

Focal child weight = (dwelling unit weight* 1/probability of child’s selection)

Dwelling unit weight vs focal child weight In most cases, the focal child weight is used,

even if an analysis focuses on adults’ employment, earnings, etc. Children are the unit of analysis in the sampling

design Not accounting for the number of eligible children

gives more weight to larger families But if a child did not participate in a given wave and

his/her caregiver did, there is only a dwelling unit associated with that household for that wave (no focal child weight)

Weights at waves 2 and 3

Wave 2 and wave 3 weights account for non-response bias

Equal to the product of the wave 1 weight x probability of remaining in the study at each wave

In addition, a wave 3 longitudinal weight calculates the unique probability of participating in every wave Distinct from the wave 3 cross-sectional weight, which considers

the probability of participating of wave 1 and wave 3, and ignores wave 2.

EDS Weights Wave 1 Age adjusted for eligible 2, 3, 4 year olds. Separate weight for each piece of EDS

(e.g., mother, childcare, father), and for each combination (e.g., mother and father).

Weights adjusted for EDS nonresponse, stratified differently by interview type race & income for motherresidence status for father type for childcare

EDS Weights Wave 2

Similar technique in wave 2. Correlational and longitudinal weights

created, for each possible combination. Like main survey, weights trimmed. Like main survey, should renormalize if

necessary.

Trimming weights

The research team agreed to trim dwelling unit and focal child weights at the top 5th percentile (not the bottom 5th)Very large weights gave extreme influence to

unusual cases (for example, higher income, unmarried Hispanic women)

Trimming is a conventional solution to the problem of outliers

Normalizing (equalizing) weights The original weights give more influence to

respondents from larger cities Each respondent in Chicago has a lower probability of

being selected than does a respondent in San Antonio

Each respondent in Chicago represents more people in his/her city compared to a respondent in SA

Therefore, the sum of weights in Chicago is larger compared to the sum for SA

Normalizing (equalizing) weights To give equal consideration to

respondents in each city, weights are normalized (aka equalized)

Normalizing involves dividing each person’s individual weight by the value of the adjusted mean weight for that person’s city

Adjusted mean weight

The adjusted mean weight for a city is the value of the average weight for a given city, where the denominator to compute the average is the average number of people from each city who are in the sample (not the number of people in the sample actually from a given city)

An example

At wave 1,926 respondents are from Boston762 respondents are from Chicago714 respondents are from San Antonio2402 respondents in all2402/3 = 800.67

An example

For each city, the adjusted mean weight for each city is the sum of the weights for each person in that city divided by 800.67:Boston: 7626.49/800.67 = 9.525Chicago: 48226.57/800.67 = 60.233San Antonio: 20435.36/800.67=25.52

An example

For each person, their normalized weight is their own weight divided by the adjusted mean weight for their city.

For the sample overall, the normalized weight should have a mean of 1

Normalized weights for the full sample at each wave are available on data files

Re-normalizing

If you are working with a subset of cases whose distribution varies dramatically by city (e.g., foreign-born, women who have hit a time limit), you may need to re-normalize

Check to see if the mean of the original normalized weight on the file is a lot different from 1 (below .95 or above 1.05). If so, need to re-normalize.

Steps to re-normalize

Find the appropriate trimmed, non-normalized weight on the sample (it will have the character T in place of the character E in its name)

Determine how many cities are represented in your subset (1, 2, or 3) Create a constant equal to that value

Determine how many cases are on your file Create a constant equal to that value divided by the

number of cities in your subset

Steps to re-normalize, cont’d

For each city, create a variable that is the sum of the weights for each respondent in that city If there are 3 cities in your subset, this variable will have 3 values

Divide that variable by the average number of people in each city (the constant you created). This is the adjusted mean weight. Again, this variable will take 3 values where there are 3 cities in

your subset Divide each respondent’s own weight by their city’s

adjusted mean weight this is your re-normalized weight.

Check that the average for this is 1 (or within .0001)

Clustering

Because respondents are clustered within primary units, people within primary units may be more alike than are people across primary units

Standard errors may be artificially deflated Taking clustering into account adjusts

standard errors to what they would be in a non-clustered sample

Boston ChicagoSan

Antonio

B W H B W H B H

S S S S S S S S

DU DU DU DU DU DU DU DU

VariablenameSITE

STR

PU

SEGID

SCRID

PU PU PU PU PU PU PU PU

Clustering

To date, the research team has not found that accounting for clustering using survey-based adjustments in Stata significantly influences standard errors.

The overall recommendation is not to take clustering into account

If a user wishes to do so, the necessary variables appear on the data files, and instructions appear in user’s guides for waves 2 and 3

Weighting in caregiver and focal child interviews – bottom line Focal child weights used more often than

dwelling unit weights But FC weights only created where a child

completed an interview Use normalized (equalized) weights to give each

city equal influence See weighting document for variable names

Weighting in caregiver and focal child interviews – bottom lineAnalysis type Equalized FC

weight/DU weightNon-equalized FC weight/DU weight

Wave 1 only R1CHE5WT/

R1DUE5WT

R1CHT5WT/

R1DUT5WT

Wave 1 to wave 2 or wave 2 only

R2CHE5WT/

R2DUE5WT

R2CHT5WT/

R2DUT5WT

Wave 1 to wave 3 or wave 3 only

R3CHE5WT/

R3DUE5WT

R3CHT5WT/

R3DUT5WT

Longitudinal (all 3 waves)

R3LCE5WT/

R3LDE5WT

R3LCT5WT/

R3LDT5WT

Useful variables to know aboutVariable description Wave 1 name Wave 2 name Wave 3 name

Calendar date of interview (caregiver/child)

COMPDATP/

COMPDATY

COMPDATE COMPDATE

Century month of interview (caregiver)

CMINTV11 CMINTV21 CMINTV31

Calendar date of last interview (caregiver)

-- IWDATEN IWDATE

Century month of last interview (caregiver)

-- CMINTV11 LASTIV31

Household’s participation in incentive experiment

INCENT11

Case was a duplicate interview

DUPLICAT DUPLICAT

City where respondent was originally interviewed

CITY11 SCRCITY or CITY21 SITE or CITY31


Interview conducted in-person or by telephone

-- CAPIMODE CAPIMODE

Household participated in EDS

-- EDS --

Household dropped at wave 1, picked up again at wave 2

-- NEWCASE --

Interview is complete (complete=491)

SUMSTAT W2SUMSTA SUM_STAT

Time between interviews (months)

-- -- TIMLPS31

Last interview wave R participated in

-- -- LIVWAV31

Imputed race for R/child (for cases dropped at wave 1)

-- -- IMRACE31/

IMRACC31/

ICRACE31/

ICRACC31


Interview type (continuing or new caregiver)

INTYPE

Total number of interviews R gave over course of study

TOTINT

Incorrect interview type administered (continuing CG received new CG interview)

CGTFL

Total # of caregivers associated w/this household at this wave

TOTCG

Focal child is independent (not living w/continuing or new CG)

INMAINCS (on caregiver files)

INDEPENDENT (on focal child data file)

Weighted Longitudinal Analysis

Are home renters more likely to eventually buy a home in their neighborhood when collective efficacy in the neighborhood is higher?

Look at caregivers who were renting at wave 1 to determine whether they purchased a home in their neighborhood by wave 3.

Homebuying example

Use wave 1 and wave 3 data From wave 1, keep unique identifiers, home

ownership status (PHT2A), collective efficacy scale (COLEFF11), and century-month of caregiver interview (CMINTV11)

From wave 3, keep unique identifiers, home ownership status (RHT2A), number of years R has lived in neighborhood (NHDYR31), century-month of interview (CMINTV31), equalized and non-equalized 5% trimmed weights, and CITY

Homebuying example

Merge 2 files by NEWID Keep only cases that appear on both files

(same caregiver) Keep only cases that were renters at wave

1 Keep only cases that have been in wave 3

neighborhood since before wave 1

Homebuying example

Merge 2 files by NEWID Keep only cases that appear on both files (same

caregiver) Keep only cases that were renters at wave 1 Keep only cases that have been in wave 3

neighborhood since before wave 1 N=618 Note that 11 cases lack values on focal child

weights, effective N=607.

Homebuying example

Look at normalized weight. Does it have a mean of 1?

If not, renormalize.

r3che5wt 607 .9135091 1.117513 .02 7.88 Variable Obs Mean Std. Dev. Min Max

. summ r3che5wt

.

r3che5wt 607 1 1.905628 .0214276 21.10895 Variable Obs Mean Std. Dev. Min Max

. summ r3che5wt

Example - Bivariate logit, weighted and unweighted Regress OWN31 on COLEFF11,

unweighted

_cons -2.353912 .4279933 -5.50 0.000 -3.192764 -1.515061 coleff11 -.0012978 .0157366 -0.08 0.934 -.0321411 .0295455 own31 Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -175.01546 Pseudo R2 = 0.0000 Prob > chi2 = 0.9343 LR chi2(1) = 0.01Logistic regression Number of obs = 606

Example - Bivariate logit, weighted and unweighted Regress OWN31 on COLEFF11, weighted

Effect of collective efficacy is not statistically significant here, but point estimates change dramatically with weights included

_cons -.4846027 .9864111 -0.49 0.623 -2.417933 1.448727 coleff11 -.044825 .0374888 -1.20 0.232 -.1183017 .0286517 own31 Coef. Std. Err. z P>|z| [95% Conf. Interval] Robust

Log pseudolikelihood = -268.76871 Pseudo R2 = 0.0265 Prob > chi2 = 0.2318 Wald chi2(1) = 1.43Logistic regression Number of obs = 606

A data workshop presented at The Inter-University Consortium for Political and Social Research...

Documents

Transcript of A data workshop presented at The Inter-University Consortium for Political and Social Research...