INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4,...
-
Upload
logan-white -
Category
Documents
-
view
217 -
download
2
Transcript of INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4,...
INFO 7470/ECON 7400/ILRLE 7400LEHD, Health, Admin, etc.
John M. Abowd and Lars VilhuberMarch 4, 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
2
Overview
• Continuation of LEHD micro data• Register-based statistics
– LEHD– Census 2010– Elsewhere in the world
• Health data in the United States• Novel statistical sources
3/4/2013
INFO 7470/ECON 7400/ILRLE 7400LEHD RDC Data Products
Lars VilhuberCornell University
Based on joint work with Kevin McKinneyU.S. Census Bureau
March 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
4
LEHD Infrastructure in the Census Research Data Centers
• Big-picture: http://www.vrdc.cornell.edu/news/lehd-infrastructure-files-in-the-census-rdc-overview/ (overview_master_zero_obs.pdf)
• Contains • detailed file descriptions• Attached zero-observation versions of all datasets
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
5
Important Features of RDC LEHD Data
• No public-use QWI information (non-perturbed data only)
• no information related to the disclosure-avoidance measures used in QWI and OTM
• Certain lag relative to the publicly available QWI– Currently: Snapshot 2008– Scheduled for 2013: Snapshot 2011
• Not all states have given permission to use their data for non-core research purposes
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
6
Treatment of Federal Tax Information
• Some components have Title-26 protected variables
• Census Bureau Policy office requires IRS approval for all LEHD projects.
• The LEHD Snapshot isolates T26 data, which may allow projects that do not need IRS approval to use LEHD data
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
7
Data Flow View of the LEHD Infrastructure
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
8
Name T26 component
Notes
Business Register Bridge (BRB)
(all)
Employer Characteristics File (ECF)
ECFT26 CA EINAlso: do not contain firm names
ES-202 (QCEW) Not Available to researchers, but may be feasible for special projects
Individual Characteristics File (ICF)
ICFT26 IRS 1040 Residence Addresses(only 1999 address, S2011 will have longitudinal addresses)
Geocoded Address List (GAL) GALT26 BR Records
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
9
Name T26 component
Notes
Quarterly Workforce Indicators (QWI, establishment level files)“QWI Seinunit” internally known as UFF_B
Unfuzzed
Successor-Predecessor Files (SPF)
Preliminary access (documentation is beta quality)
Unit-to-Worker Impute (U2W) Multiply-imputed files, should use multiple imputation analysis techniques
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
10
Determining Data Availability
• Use overview_master_zero_obs.pdf• Table 1.2 – States for which ANY files are available
(permissions given) Caution: the data underlying this is in flux as of 2012/2013
• Table 1.3 – Files that exist for each state (EHF, ECF, and ICF are the core processes)
• Cross-reference Table 1.2 with Table 1.3• Each process has a table with the available time periods.
For example, see table 4.7 on page 127 for the EHF.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
11
Identifiers• In general, linkages between the different files occur using
deterministic match-merge techniques• Person, firm, and establishment identifiers link all LEHD Infrastructure
files among themselves. • External linkages are generally probabilistic:
– Linkages to BR, LBD, etc.• Link using BRB/LBDB: many-to-many match of establishments• Link using upcoming (S2011) ECF: using Census “alpha” or EIN: firm-level to
collection of establishments on both sides
– Linkages to external files by establishment location– Linkages to external files by firm name (special projects only)
• Some external linkages seem deterministic– Linkages to demographic data are based on PIK, but PIK-assignment is not
deterministic
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
12
Individual Identifier System (PIK)
• All Social Security Numbers (SSN) have been replaced by Protected Identification Key (PIK) – no SSN’s are available anywhere in these data.
• A PIK is a unique 9 digit number that maps to one and only one SSN.
• A PIK is permanently assigned to an SSN, allowing the same types of analyses to be performed, but with greatly improved confidentiality protection.
• Used widely for person-level files within the Census RDC
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
13
Additional Identifiers
• Survey IDs– CPS– SIPP– Census 2000– ACS
• In general, PIK is on the file, or available as a separate crosswalk
• Note: LEHD ICF contains CPS and SIPP IDs, but they are not the most current – use/request the crosswalks
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
14
Firm/Establishment Identifiers• Firm identifiers are called State employer identification number
(SEIN) and generally reflect an entity reporting Unemployment insurance data (UI) taxes to state authorities.
• “Establishments” (more precisely: reporting units or workplaces) are identified by a combination of SEIN and reporting unit (SEINUNIT).
• The firm and establishment identifiers are state-specific - within the LEHD Infrastructure, there is no method of linking units of a nation-wide firm across state borders.
• Federal EIN is available – on ECF for most states, – on ECFT26 for California
• CFN available on BRB3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
15
Reminder: Entities in LEHD Data
• “Person”= PIK• “Firm” = SEIN (within state), EIN (cross-state)• “Establishment” = SEIN || SEINUNIT• “Job” = PIK||SEIN||SEINUNIT• Smallest observed time unit: calendar year
quarter• Inferred points in time: boundaries of calendar
year quarters
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
16
LEHD SNAPSHOT: EHF
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
17
EHF
• PIK SEIN SEINUNIT YEAR files – But: Only Minnesota currently has SEINUNIT
(establishment) identifiers.• Contains person-firm (job) x year information.
– Quarterly earning records• No direct measure of labor force attachment,
it is inferred by the presence of earnings.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
18
PIK SEIN SEINUNIT YEAR Earn Q1 Earn Q2 Earn Q3 Earn Q4
0123456789 ABC 00000 2005
PIK SEIN SEINUNIT YEAR Quarter Earnings
0123456789 ABC 00000 2005 2 $1,315
EHF Structure
PIK SEIN SEINUNIT YEAR Quarter Earnings
0123456789 ABC 00000 2005 1 $1,245
Wage Record (UI)
$1245 $1315
bijt=1eijt=0
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
19
EHF Auxiliary Files• Auxiliary PHF (person-history file): PIK-SEIN-SEINUNIT
– Wide file, entire job history arrayed out– Employment indicator flag (character string, 0001111000…)– Earnings for each quarter
• Auxiliary Unit History File (UHF)– SEIN-SEINUNIT history of establishment activity, based on ES202
• SEIN History File (SHF)– SEIN history of firm activity, based on ES202
• List of unique PIKs (ever appeared in a state)• Controltotals: BLS public-use employment (private only) for
equivalent time period, can be used for weighting
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
20
LEHD SNAPSHOT: ICF
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
21
ICF• PIK level file• Contains person level demographic information, primarily from SSA
applications for a SSN.– Sex– Race– DOB– POB– Education
• Completed files: missing characteristics have been imputed (flagged, and multiple imputes available in auxiliary files)– Imputation rate particularly high for Education
• Recently, higher quality imputations, especially important for education, have been developed. In use at LEHD, but will only be present in the next Snapshot (S2011)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
22
ICF Edits and Imputations
• Completed files: missing characteristics have been imputed (flagged, and multiple imputes available in auxiliary files)– Impute rate particularly high for education– The S2004 and S2008 education variable is not
recommended for use• Recently, higher quality imputations, especially
important for education, have been developed. – In use at LEHD, underlie RH and SE tabulations– Will be part of S2011
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
23
ICF Auxiliary Files (S2004, S2008)
• Auxiliary files contain array of multiple imputes for those PIKs that had some missing data– icf_zz_implicates_age_sex.sas7bdat – icf_zz_implicates_county.sas7bdat– icf_zz_implicates_education.sas7bdat
• Implicate 1 or observed value in corresponding field of ICF
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
24
ICF Auxiliary Files (S2011)
• Auxiliary files contain array of multiple imputes for those PIKs that had some missing data– icf_us_implicates_age_sex_pob.sas7bdat – icf_us_implicates_addresses.sas7bdat– icf_us_implicates_education.sas7bdat– icf_us_implicates_race_ethnicity.sas7bdat
• Flag for each variable indicates – 1 – observed value, no implicates– 2 – imputed value, implicate 1, implicates 1-10 available– 3 – observed value, but low-quality, implicates 1-10 available
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
25
LEHD SNAPSHOT: ECF
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
26
ECF
• SEIN SEINUNIT YEAR QUARTER files and SEIN YEAR QUARTER files.
• Contains establishment and firm level information.– Location– Industry– Firm Size (employment and payroll)– Public / Private
• QCEW-derived point-in-time measures of employment are available for the 12th of every month.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
27
ECF Edits and Imputations
• Completed file: all missing information is imputed, through a variety of methods– NAICS/SIC through bi-directional empirical (probabilistic)
crosswalks– Employment measures based on alternate employment
measures (previous quarter, UI <-> QCEW measures)– Establishment location based on distribution of similar
establishments within state• Contrary to most LEHD files, no multiple imputation• Abundance of flags to undo most imputations
– Since S2008, in an auxiliary file
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
28
Some Differences across Snapshots• S2004:
– SEIN-level file + SEINUNIT-level file– Almost exact copy of file in production: aimed at easy tabulation of
QWI– Way too many variables
• S2008:– Significant effort to make file user-friendly– Variable names restructured to distinguish source, better labels– Many auxiliary variables shunted into a “…_aux” file– Cleanup: all (and only) SEIN-level variables on SEIN file, and all (and
only) SEINUNIT-level variables on SEINUNIT file…• S2011
– Same structure as S2008, plus EIN-level firm-age/size variables
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
29
LEHD SNAPSHOT: ADDITIONAL FILES
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
30
QWI Establishment-level
• Establishment-level (YEAR-QUARTER-SEIN-SEINUNIT) tabulations
• Same statistics as public-use QWI (accessions, separations, etc., by demography), but at the establishment level
• Already incorporate U2W multiple records
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
31
Addresses: GAL
• GAL is a list of all unduplicated addresses from– Business Register– ACS (place of work)– AHS– ES-202– Census Master Address File (MAF)
• All establishment addresses from LEHD Infrastructure have been geocoded to the most accurate level possible
• GAL does not contain residential addresses (see ICF T26 components)
• Cross-walk to the ECF is available
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
32
STRUCTURE OF LEHD SNAPSHOT
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
33
Internal Consistency of LEHD Infrastructure
• LEHD Infrastructure is constructed to be internally consistent– Firms on EHF = Firms on ECF
(superset of UI and ES202) – Individuals on EHF = Individuals on ICF
• … and complete: all missing data are imputed or edited– Addresses (generally to block levels, at least to county level)– Periodically missing information on firms– (research) periodically missing information on
individuals/jobs, due to firm or state-level non-reporting.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
34
Working with Files
• LEHD Infrastructure files are huge when compared to regular research files. In the S2004 version, in all existing states and years combined, there are– 6,100,912,201 wage records– 754,775,697 unique jobs– 226,639,116 quarterly observations on firms. – Total size of all datasets is about 1.5TB
• Careful planning is required to ensure that adequate resources are available.
• Careful programming is required to make analysis feasible.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
35
Working with Files
• Random variables can be used to select a sub-sample of persons, establishments, or firms.– ECF SEIN level: sample_sein– ECF SEINUNIT level: sample_seinunit– ICF: substr(PIK,1,2) – No equivalent variable provided for EHF (jobs)
• Industry or other person / firm characteristics can also be used to create a smaller analysis dataset.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
36
Sample Program: On-the-fly Sampling
%let state=tx;libname INLIB "/mixedtmp/lehd/s2004/ecf/&state./";
data mydata/view=mydata; set INLIB.ecf_&state._seinunit (where=(sample_seinunit <= 0.05));run;proc reg data=mydata; model y= x w z;run;
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
37
Common Errors
• Administrative (SEIN or EIN) firms are not the same as an economic firm.– Firms are dynamic entities.– Single versus Multi-unit firms.– SEIN Firms are state based entities.
• EHF earnings are at the SEIN (firm) level, while the ECF contains data at the establishment level. – U2W can be used to assign workers to an establishment using a
multiply-imputed assignment• Multi-unit reporting is required, but there is no penalty for not
breaking out establishments.3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
38
Warnings
• BRB bridge links CES Economic data to LEHD at the EIN STATE COUNTY SIC2 level. Multi-unit establishments may not link one to one. – LBD Bridge links by NAICS, but does not overlap with BRB
• Current education variable (S2004, S2008) is of very low quality.
• ECF (LEG) and GAL are not internally consistent on S2004 version of the snapshot (corrected in S2008).
• GAL users must handle missing location information themselves.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
39
Special Disclosure Avoidance Rules
• In general, only model output is allowed• Any tabulations are subject to QWI rules
– National, and aggregates of states are probably acceptable, but may require review by the Census Disclosure Review Board (DRB) – should be addressed in planning phase
– Tabulations may require access to noise-infusion data, not available on Snapshot – may require inquiry/support from LEHD staff
3/4/2013
INFO 7470/ECON 7400/ILRLE 7400Register-based statistics
John M. Abowd and Lars VilhuberMarch 4, 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
41
What Is Different?
• “Classical” Census of population and housing: field enumeration
• “Register” Census: partially or wholly based on administrative data (“registers”)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
42
What Is a Register?
• A register aims to be a complete list of objects• An administrative register is an (exhaustive) list of
objects to be administered• A statistical register is a transformation of an
administrative register for statistical purposesWallgren and Wallgren (2007)
• A register is a repository which stores information about statistical units and is directly updated in the course of events affecting the statistical units
EU (2011)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
43
What Is a Register-based Census?
• A register-based Census uses data from the administrative records to “fill the forms”
• Most census frames are already maintained through a combination of enumeration and administrative records– Population and Housing frames (Session 2)– Business and establishment frames (Session 3)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
44
Register-based Censuses
• Many European censuses of population and housing make “growing use of information from administrative registers to compare, complete or even replace information obtained from ‘classical’ field enumerations” (EU, 2011)– 2001 censuses: 7 countries used administrative
registers as one of the data sources– 2011 censuses: 16 countries
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
45
Dutch Virtual Census 2001
• Makes use of the Social Statistical Database (SSD) – “main source of the 2001 Census”
(Statistics Netherlands, 2004)
• Last complete enumeration was in 1971• Census 2001 tabulated in 2003• Estimate of the cost of “traditional” Census:
$300 million– The Virtual Census cost $3 million
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
46
Some Details
• Justified by – declining response rate of the population– Efficiency
• Brings with it– Larger acceptance (claimed)
• Background– 1991: Census law “rescinded” – no further legal obligation to
produce census!– 2003: new statistical law– Creation of 2001 Census on a “voluntary” basis: “inconceivable
[to] (…) not compile census data (…) like all other European countries”
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
47
Issues in the Netherlands Virtual Census
• Issues– Start of data work was later than a traditional census –
retrospective rather than contemporaneous – waited for all registers to be “complete”
– But: Census was completed earlier than comparable other European countries
– Some loss of detail (not present in registers)– Source: combination of all municipal population
registers– Combined with select surveys (Labor Force Survey, etc.)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
48
So is This New?
• For population census in the US:– 2000: creation of StARS 1999 (Statistical Administrative
Records System) • Use in AREX 2000 evaluation (Asher & Fienberg, 2001, 2002)• Successor systems under evaluation for use in 2020 Census
• For establishment censuses in the US– Economic Census: Most small employers and all non-
employers were not sent forms: any data comes from “administrative records of other federal agencies” [2007]
– QCEW: entirely based on administrative data
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
49
So Not Really So New
• For job statistics– Since 2003, LEHD – the only “job census” – has
created a register of (almost) all jobs• Combination of state registers of jobs (but purpose of
register is not count jobs!)• Use of administrative information on demographic
characteristics, complemented with imputes
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
50
Why Not Use It for 2020 Census?
• Research into this is ongoing• Problem #1: the US doesn’t actually have a Population Register
– But birth, death, and migration records used for intercensal estimates
• Register-based Censuses also have issues– Timeliness may not be better than enumeration– Coverage still needs to be assessed
• Does the register cover all the entities that the census has to cover? (divergence in purposes)
– Not all variables exist in administrative data– Almost always, requires matching of multiple sources -> new source
of potential errors
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
51
Privacy
• In Netherland’s case, Statistics Netherlands claimed higher acceptance for the use of administrative records than willingness to provide survey response
• In the US, some evidence suggests an increase in concerns about privacy (Singer, Bates, van Hoewyk, 2011; Gates, 2011)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
52
Ongoing Work for 2020 Census
• Based on Vitrano & Chapin (2012)• First Census where administrative records are being
considered to replace or supplement respondent data• 2010 Census Match Study
Or the “Virtual Census 2010” – same methods as in the Dutch Virtual Census– Compare coverage of addresses– Compare coverage of persons– Compare characteristics of persons
• Better Census coverage measurement– Replace capture-recapture with multi-source measurement
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
53
Ongoing Work for 2020 Census
• Alternate sources, including commercial sources
• Better non-response follow-up (NRFU) to reduce cost (most expensive part of the operation
• Use of administrative records to improve alternate response methods (internet)
• Improving the address frame (see Session 7)
3/4/2013
INFO 7470/ECON 7400/ILRLE 7400 Health Data
John M. Abowd and Lars VilhuberMarch 4, 2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
55
Defining “Health” Data
• Data on individuals' health– By age (elder, infants)– Disability-related– Nutrition– Etc.
• Data on health providers– Insurance access and utilization– Access or utilization of doctors, hospitals, etc.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
56
Scope Here
• Health data related to disability• Health data in the RDC network• Other restricted-access health data
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
57
Long List of Disability-related Data
• 2000 Decennial Census Long Form:six disability-related questions, and many other questions relevant to participation (e.g., living arrangements, employment, income, transportation utilization, and housing).
• American Community Survey (ACS)since 2003, disability questions reworded in 2008 to new OMB-coordinated standard (also CPS, soon SIPP)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
58
Disability-related Data (2)
• CPSPrior to 2008, only work limitation measure, since 2008 same question as ACS.– Contains extensive information on employment,
income, insurance, living arrangements, family status
– insurance coverage, marital status, and parental status for youth and young adults with work limitations
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
59
Health and Retirement Survey (HRS)
• A nationally-representative sample of more than 22,000 Americans over the age of 50 every two years.
• Subjects:– antecedents and consequences of retirement;– health, health care, income, and wealth over time;– work disability, declining health, and institutionalization– chronic illness, functional ability, depression, and self-assessed health status,– health-related behaviors such as smoking, alcohol use, and exercise– cognition, demographics, family relationships, occupations, employment,
employer accommodation, and economic circumstances• Run by Institute for Social Research (ISR)/University of Michigan under
a cooperative agreement with the NIAhttp://hrsonline.isr.umich.edu/
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
60
HRS (2)
• Restricted-access linkages– SSA earnings– Detailed geography and industry– Medicare claims and summary
• Access and merge to RDC-data is feasible, but needs authorization from HRS
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
61
Behavioral Risk Factors Surveillance System (BRFSS)
• Annual state-based health survey– health risk behaviors,– clinical preventative practices,– health care use and access focused on chronic disease
and injury.• Collaborative effort between CDC and the state
health departments• Cross-sectional telephone survey that collects data
on approximately 350,000 non-institutionalized adults (18 and older).
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
62
National Health Interview Survey
• Annual nationally representative cross-sectional survey of approximately 100,000 non-institutionalized civilians
• Conducted by CDC• Topics/data
– Highly detailed information on health, functional status, and activity limitations.
– Hducation, employment, marriage, parenting, and family income
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
63
National Health Interview Survey
• Random selection of disability-related research using NHIS: Stapleton, Livermore, and Kennell (2002); Burkhauser, Houtenville, and Wittenburg (2003); Burkhauser et al. (2002); Kaye (2002); Trupin et al. (1997), Horvath-Rose and Stapleton (2004)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
64
NHIS Linkages - SSA
• 1994-2005 linked• Numident (limited demographics)• Master Beneficiary Record (MBR)• Supplemental Security Record (SSR)• Payment History Update System (PHUS) • Reduced sample size, ~ 64,000 in linked file compared to
regular analysis files – 75% linkage rate in 1994 (92% of eligible), declined to 45% in
2005 (76% of eligible)• Example analysis: comparisons between self-reports of
disability and diagnosis codes contained in the SSA data
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
65
NHIS Linkages (CMS)
• CMS Medicare Enrollment and Claims Files– CMS Medicare data (1991-2007)– Linked to
• 1994-1998 National Health Interview Survey (NHIS)• 1999-2005 National Health Interview Survey (NHIS)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
66
Medical Expenditure Panel Survey (MEPS)
• Set of large-scale surveys of families and individuals, their medical providers (e.g., doctors, hospitals, and pharmacies), and employers
• Household Component (HC) of the MEPS collects data from a sample of families and individuals in selected communities across the United States, drawn from a nationally-representative sub-sample of households that participated in the prior year's NHIS
• Approximately 22,000 individuals3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
67
Medical Expenditure Panel Survey (MEPS)
• Topics:– Specific health services, frequency of use– Cost of these services, how they are paid for– Cost, scope, and breadth of health insurance held
by and available to U.S. workers– Demographic characteristics, health conditions,
health status, income, and employment
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
68
MEPS-MPC
– MEPS-HC is supplemented by information provided by the Medical Provider Component (MPC)
• Covers hospitals• Physicians• Home health care providers• Pharmacies
– Identified by household respondents
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
69
NHANES
• National Health and Nutrition Examination Survey, Since 1960s
• Topics/Data:– Demographic, socioeconomic, dietary, and health-
related questions– Examination component consists of medical,
dental, and physiological measurements– Laboratory tests
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
70
NHANES linkages
• EPA data (environment on health)• Using restricted-access data:
– Lower level geography– Indirect identifiers
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
71
Medicare Current Beneficiary Survey (MCBS)
• Non-public-use survey of about 12,000 Medicare beneficiaries at any point in time
• conducted since 1991 by Centers for Medicare and Medicaid Services (CMS)
• overlapping-cohorts structure, 3 contacts/year, any single individual at most four years
• nationally-representative of all Medicare beneficiaries, including those who are aged, disabled, and institutionalized
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
72
Medicare Current Beneficiary Survey (MCBS)
• Data/topics:– Health status, health care use and expenditures,
and health insurance coverage of Medicare beneficiaries
– Employment and demographic characteristics• The MCBS is split into two components:
Access to Care, and Cost and Use files
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
73
HEALTH PROVIDER DATA (EXAMPLE)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
74
State-specific Data
• Example: Gruber and Kleiner (2012), AEA Public Policy, “Do Strikes Kill? Evidence from New York State” (also NBER 2010 w15855)
• Short-term non-federal hospitals in New York State are required to submit discharge data to the New York State Department of Health through the Statewide Planning and Research Cooperative System (SPARCS)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
75
SPARCS
• Patient level, detailed data on– Patient characteristics (e.g. age, sex, race),– Diagnoses (several DRG and ICD-9 codes),– Treatments (several ICD9 codes),– Services (accommodation), and– Total charges
• For every hospital discharge in New York State since in 1982.
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
76
Gruber and Kleiner (2010)
• Combine with detailed strikes data• Controlling for hospital specific heterogeneity,
patient demographics and disease severity, the results show that nurses’ strikes increase in-hospital mortality by 19.4% and 30-day readmission by 6.5% for patients admitted during a strike, with little change in patient demographics, disease severity or treatment intensity
3/4/2013
INFO 7470/ECON 7400/ILRLE 7400Alternate Data Sources of the 21st Century
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
78
Rising Cost of Decennial Census
3/4/2013
Vitrano and Chapin, 2012
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
79
Where Are Official Statistics Heading?
• On the negative side:– Increasing (?) privacy and confidentiality concerns– Decreasing response rates
• Canada’s Census long form is no longer compulsory
– Increasing costs• On the positive side:
– Far more data available in general – petabytes per minute
– “Big Data” is of commercial interest – ability to buy-in data
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
80
Consider Mapping
• In the 1980s– Census was the key provider of detailed
demographic maps (TIGER)• In 2013
– Navteq, Google, Bing, etc., etc. provide highly detailed maps
– Private provision of satellite imagery (SPOT, 1986) • GeoEye-1 provides 0.50m resolution for commercial
customers
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
81
Google Flu Index• http://www.google.org/flutrends/
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
82
Google Flu Index
• Compares to CDC’s flu tracking http://www.cdc.gov/flu/weekly/overview.htm
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
83
Google flu index
• Compares to CDC’s flu tracking http://www.cdc.gov/flu/weekly/overview.htm
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
84
Google Flu Index
• How it works:– Uses search terms for flu-related topics– Validated against historical “high-quality” data (in
the US: CDC, see doi:10.1038/nature07634)• Advantages:
– Very current (vs. some lag at CDC)• Disadvantage:
– May get side-tracked by changes in behavior/indicators *
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
85
Google Trends by Google
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
86
Latest Google Trends
3/4/2013
Butler in Nature, Vol 494, Issue 7436
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
87
The Use of Non-traditional Data in Statistics
• Powerful, timely• Power obtained in part through validation
against “old-style” statistics• Difference between
– “raw data” (Google search terms)– First-order analytics (Google Flu trends)– Representativeness of the data – deep analysis
and interpretation
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
88
Google Price Index
• Tracks database of web shopping data• The GPI shows a “pretty good correlation”
with the CPI for goods such as cameras and watches that are often sold on the web, but less so for others, such as car parts, that are infrequently traded online. (FT 2010)
• General approach: Choi and Varian (2011)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
89
Billion Prices Project
• http://bpp.mit.edu– Alberto Cavallo and Roberto
Rigodon (MIT)
• Methodology– Collected every day from
online retailers– DB contains prices on the full
array of products sold by these retailers, product descriptions, package sizes, brands, special characteristics (e.g., “organic”), and whether the item is on sale or price control. *
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
90
Billion prices project
• Availability– PriceStats commercially sells data through State
Street (for instance, used in the Economist for Argentina)
– Research data available with a lag– Do not claim to cover 100% of CPI of goods and
services (depend on online availability)– “Standard NSF funding would cover the costs for 1
day of web scraping”
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
91
Billion prices project
• Use of similar methods by BLS to supplement CPI (Amstat article)– Quality adjustment models for televisions,
camcorders, cameras, and washing machines. – Investigating retail scanner data for use in CPI
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
92
Twitter as a Statistical Data Source
• Work at the Michigan NCRN node by Matthew D. Shapiro and co-authors
• Use Twitter activity to create indicators of new job loss
• Benchmarked to weekly UI claims• Critical: classification, search phrases
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
93
Initial UI Claims vs. Twitter Signal
3/4/2013
Excerpt from presentation by Shapiro at NCRN annual PI meeting
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
94
New Sources, New Methods, …
• New unstructured data• Convergence of qualitative and quantitative
sciences• Needs traditional statistics (registers!), but role may
increase over time– All of the uses here are validating against official
statistics where available• Can supplement or replace official statistics where
the latter are of lower quality (BPP for Argentina!)
3/4/2013
© John M. Abowd and Lars Vilhuber 2013, all rights reserved
95
… New Skill Sets
• Social scientists need to learn new tools
• INFO7470 students 2013
• Who knows how to leverage a 15,000 node Hadoop cluster?
None Advanced
SAS 31% 11%
Stata 25% 16%
R 56% 4%
Python 87% 1%
C, Fortran 67% 2%
3/4/2013