INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4,...

95
INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013

Transcript of INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4,...

Page 1: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

INFO 7470/ECON 7400/ILRLE 7400LEHD, Health, Admin, etc.

John M. Abowd and Lars VilhuberMarch 4, 2013

Page 2: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

2

Overview

• Continuation of LEHD micro data• Register-based statistics

– LEHD– Census 2010– Elsewhere in the world

• Health data in the United States• Novel statistical sources

3/4/2013

Page 3: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

INFO 7470/ECON 7400/ILRLE 7400LEHD RDC Data Products

Lars VilhuberCornell University

Based on joint work with Kevin McKinneyU.S. Census Bureau

March 2013

Page 4: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

4

LEHD Infrastructure in the Census Research Data Centers

• Big-picture: http://www.vrdc.cornell.edu/news/lehd-infrastructure-files-in-the-census-rdc-overview/ (overview_master_zero_obs.pdf)

• Contains • detailed file descriptions• Attached zero-observation versions of all datasets

3/4/2013

Page 5: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

5

Important Features of RDC LEHD Data

• No public-use QWI information (non-perturbed data only)

• no information related to the disclosure-avoidance measures used in QWI and OTM

• Certain lag relative to the publicly available QWI– Currently: Snapshot 2008– Scheduled for 2013: Snapshot 2011

• Not all states have given permission to use their data for non-core research purposes

3/4/2013

Page 6: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

6

Treatment of Federal Tax Information

• Some components have Title-26 protected variables

• Census Bureau Policy office requires IRS approval for all LEHD projects.

• The LEHD Snapshot isolates T26 data, which may allow projects that do not need IRS approval to use LEHD data

3/4/2013

Page 7: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

7

Data Flow View of the LEHD Infrastructure

3/4/2013

Page 8: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

8

Name T26 component

Notes

Business Register Bridge (BRB)

(all)

Employer Characteristics File (ECF)

ECFT26 CA EINAlso: do not contain firm names

ES-202 (QCEW) Not Available to researchers, but may be feasible for special projects

Individual Characteristics File (ICF)

ICFT26 IRS 1040 Residence Addresses(only 1999 address, S2011 will have longitudinal addresses)

Geocoded Address List (GAL) GALT26 BR Records

3/4/2013

Page 9: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

9

Name T26 component

Notes

Quarterly Workforce Indicators (QWI, establishment level files)“QWI Seinunit” internally known as UFF_B

Unfuzzed

Successor-Predecessor Files (SPF)

Preliminary access (documentation is beta quality)

Unit-to-Worker Impute (U2W) Multiply-imputed files, should use multiple imputation analysis techniques

3/4/2013

Page 10: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

10

Determining Data Availability

• Use overview_master_zero_obs.pdf• Table 1.2 – States for which ANY files are available

(permissions given) Caution: the data underlying this is in flux as of 2012/2013

• Table 1.3 – Files that exist for each state (EHF, ECF, and ICF are the core processes)

• Cross-reference Table 1.2 with Table 1.3• Each process has a table with the available time periods.

For example, see table 4.7 on page 127 for the EHF.

3/4/2013

Page 11: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

11

Identifiers• In general, linkages between the different files occur using

deterministic match-merge techniques• Person, firm, and establishment identifiers link all LEHD Infrastructure

files among themselves. • External linkages are generally probabilistic:

– Linkages to BR, LBD, etc.• Link using BRB/LBDB: many-to-many match of establishments• Link using upcoming (S2011) ECF: using Census “alpha” or EIN: firm-level to

collection of establishments on both sides

– Linkages to external files by establishment location– Linkages to external files by firm name (special projects only)

• Some external linkages seem deterministic– Linkages to demographic data are based on PIK, but PIK-assignment is not

deterministic

3/4/2013

Page 12: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

12

Individual Identifier System (PIK)

• All Social Security Numbers (SSN) have been replaced by Protected Identification Key (PIK) – no SSN’s are available anywhere in these data.

• A PIK is a unique 9 digit number that maps to one and only one SSN.

• A PIK is permanently assigned to an SSN, allowing the same types of analyses to be performed, but with greatly improved confidentiality protection.

• Used widely for person-level files within the Census RDC

3/4/2013

Page 13: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

13

Additional Identifiers

• Survey IDs– CPS– SIPP– Census 2000– ACS

• In general, PIK is on the file, or available as a separate crosswalk

• Note: LEHD ICF contains CPS and SIPP IDs, but they are not the most current – use/request the crosswalks

3/4/2013

Page 14: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

14

Firm/Establishment Identifiers• Firm identifiers are called State employer identification number

(SEIN) and generally reflect an entity reporting Unemployment insurance data (UI) taxes to state authorities.

• “Establishments” (more precisely: reporting units or workplaces) are identified by a combination of SEIN and reporting unit (SEINUNIT).

• The firm and establishment identifiers are state-specific - within the LEHD Infrastructure, there is no method of linking units of a nation-wide firm across state borders.

• Federal EIN is available – on ECF for most states, – on ECFT26 for California

• CFN available on BRB3/4/2013

Page 15: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

15

Reminder: Entities in LEHD Data

• “Person”= PIK• “Firm” = SEIN (within state), EIN (cross-state)• “Establishment” = SEIN || SEINUNIT• “Job” = PIK||SEIN||SEINUNIT• Smallest observed time unit: calendar year

quarter• Inferred points in time: boundaries of calendar

year quarters

3/4/2013

Page 16: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

16

LEHD SNAPSHOT: EHF

3/4/2013

Page 17: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

17

EHF

• PIK SEIN SEINUNIT YEAR files – But: Only Minnesota currently has SEINUNIT

(establishment) identifiers.• Contains person-firm (job) x year information.

– Quarterly earning records• No direct measure of labor force attachment,

it is inferred by the presence of earnings.

3/4/2013

Page 18: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

18

PIK SEIN SEINUNIT YEAR Earn Q1 Earn Q2 Earn Q3 Earn Q4

0123456789 ABC 00000 2005

PIK SEIN SEINUNIT YEAR Quarter Earnings

0123456789 ABC 00000 2005 2 $1,315

EHF Structure

PIK SEIN SEINUNIT YEAR Quarter Earnings

0123456789 ABC 00000 2005 1 $1,245

Wage Record (UI)

$1245 $1315

bijt=1eijt=0

3/4/2013

Page 19: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

19

EHF Auxiliary Files• Auxiliary PHF (person-history file): PIK-SEIN-SEINUNIT

– Wide file, entire job history arrayed out– Employment indicator flag (character string, 0001111000…)– Earnings for each quarter

• Auxiliary Unit History File (UHF)– SEIN-SEINUNIT history of establishment activity, based on ES202

• SEIN History File (SHF)– SEIN history of firm activity, based on ES202

• List of unique PIKs (ever appeared in a state)• Controltotals: BLS public-use employment (private only) for

equivalent time period, can be used for weighting

3/4/2013

Page 20: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

20

LEHD SNAPSHOT: ICF

3/4/2013

Page 21: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

21

ICF• PIK level file• Contains person level demographic information, primarily from SSA

applications for a SSN.– Sex– Race– DOB– POB– Education

• Completed files: missing characteristics have been imputed (flagged, and multiple imputes available in auxiliary files)– Imputation rate particularly high for Education

• Recently, higher quality imputations, especially important for education, have been developed. In use at LEHD, but will only be present in the next Snapshot (S2011)

3/4/2013

Page 22: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

22

ICF Edits and Imputations

• Completed files: missing characteristics have been imputed (flagged, and multiple imputes available in auxiliary files)– Impute rate particularly high for education– The S2004 and S2008 education variable is not

recommended for use• Recently, higher quality imputations, especially

important for education, have been developed. – In use at LEHD, underlie RH and SE tabulations– Will be part of S2011

3/4/2013

Page 23: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

23

ICF Auxiliary Files (S2004, S2008)

• Auxiliary files contain array of multiple imputes for those PIKs that had some missing data– icf_zz_implicates_age_sex.sas7bdat – icf_zz_implicates_county.sas7bdat– icf_zz_implicates_education.sas7bdat

• Implicate 1 or observed value in corresponding field of ICF

3/4/2013

Page 24: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

24

ICF Auxiliary Files (S2011)

• Auxiliary files contain array of multiple imputes for those PIKs that had some missing data– icf_us_implicates_age_sex_pob.sas7bdat – icf_us_implicates_addresses.sas7bdat– icf_us_implicates_education.sas7bdat– icf_us_implicates_race_ethnicity.sas7bdat

• Flag for each variable indicates – 1 – observed value, no implicates– 2 – imputed value, implicate 1, implicates 1-10 available– 3 – observed value, but low-quality, implicates 1-10 available

3/4/2013

Page 25: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

25

LEHD SNAPSHOT: ECF

3/4/2013

Page 26: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

26

ECF

• SEIN SEINUNIT YEAR QUARTER files and SEIN YEAR QUARTER files.

• Contains establishment and firm level information.– Location– Industry– Firm Size (employment and payroll)– Public / Private

• QCEW-derived point-in-time measures of employment are available for the 12th of every month.

3/4/2013

Page 27: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

27

ECF Edits and Imputations

• Completed file: all missing information is imputed, through a variety of methods– NAICS/SIC through bi-directional empirical (probabilistic)

crosswalks– Employment measures based on alternate employment

measures (previous quarter, UI <-> QCEW measures)– Establishment location based on distribution of similar

establishments within state• Contrary to most LEHD files, no multiple imputation• Abundance of flags to undo most imputations

– Since S2008, in an auxiliary file

3/4/2013

Page 28: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

28

Some Differences across Snapshots• S2004:

– SEIN-level file + SEINUNIT-level file– Almost exact copy of file in production: aimed at easy tabulation of

QWI– Way too many variables

• S2008:– Significant effort to make file user-friendly– Variable names restructured to distinguish source, better labels– Many auxiliary variables shunted into a “…_aux” file– Cleanup: all (and only) SEIN-level variables on SEIN file, and all (and

only) SEINUNIT-level variables on SEINUNIT file…• S2011

– Same structure as S2008, plus EIN-level firm-age/size variables

3/4/2013

Page 29: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

29

LEHD SNAPSHOT: ADDITIONAL FILES

3/4/2013

Page 30: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

30

QWI Establishment-level

• Establishment-level (YEAR-QUARTER-SEIN-SEINUNIT) tabulations

• Same statistics as public-use QWI (accessions, separations, etc., by demography), but at the establishment level

• Already incorporate U2W multiple records

3/4/2013

Page 31: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

31

Addresses: GAL

• GAL is a list of all unduplicated addresses from– Business Register– ACS (place of work)– AHS– ES-202– Census Master Address File (MAF)

• All establishment addresses from LEHD Infrastructure have been geocoded to the most accurate level possible

• GAL does not contain residential addresses (see ICF T26 components)

• Cross-walk to the ECF is available

3/4/2013

Page 32: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

32

STRUCTURE OF LEHD SNAPSHOT

3/4/2013

Page 33: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

33

Internal Consistency of LEHD Infrastructure

• LEHD Infrastructure is constructed to be internally consistent– Firms on EHF = Firms on ECF

(superset of UI and ES202) – Individuals on EHF = Individuals on ICF

• … and complete: all missing data are imputed or edited– Addresses (generally to block levels, at least to county level)– Periodically missing information on firms– (research) periodically missing information on

individuals/jobs, due to firm or state-level non-reporting.

3/4/2013

Page 34: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

34

Working with Files

• LEHD Infrastructure files are huge when compared to regular research files. In the S2004 version, in all existing states and years combined, there are– 6,100,912,201 wage records– 754,775,697 unique jobs– 226,639,116 quarterly observations on firms. – Total size of all datasets is about 1.5TB

• Careful planning is required to ensure that adequate resources are available.

• Careful programming is required to make analysis feasible.

3/4/2013

Page 35: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

35

Working with Files

• Random variables can be used to select a sub-sample of persons, establishments, or firms.– ECF SEIN level: sample_sein– ECF SEINUNIT level: sample_seinunit– ICF: substr(PIK,1,2) – No equivalent variable provided for EHF (jobs)

• Industry or other person / firm characteristics can also be used to create a smaller analysis dataset.

3/4/2013

Page 36: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

36

Sample Program: On-the-fly Sampling

%let state=tx;libname INLIB "/mixedtmp/lehd/s2004/ecf/&state./";

data mydata/view=mydata; set INLIB.ecf_&state._seinunit (where=(sample_seinunit <= 0.05));run;proc reg data=mydata; model y= x w z;run;

3/4/2013

Page 37: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

37

Common Errors

• Administrative (SEIN or EIN) firms are not the same as an economic firm.– Firms are dynamic entities.– Single versus Multi-unit firms.– SEIN Firms are state based entities.

• EHF earnings are at the SEIN (firm) level, while the ECF contains data at the establishment level. – U2W can be used to assign workers to an establishment using a

multiply-imputed assignment• Multi-unit reporting is required, but there is no penalty for not

breaking out establishments.3/4/2013

Page 38: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

38

Warnings

• BRB bridge links CES Economic data to LEHD at the EIN STATE COUNTY SIC2 level. Multi-unit establishments may not link one to one. – LBD Bridge links by NAICS, but does not overlap with BRB

• Current education variable (S2004, S2008) is of very low quality.

• ECF (LEG) and GAL are not internally consistent on S2004 version of the snapshot (corrected in S2008).

• GAL users must handle missing location information themselves.

3/4/2013

Page 39: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

39

Special Disclosure Avoidance Rules

• In general, only model output is allowed• Any tabulations are subject to QWI rules

– National, and aggregates of states are probably acceptable, but may require review by the Census Disclosure Review Board (DRB) – should be addressed in planning phase

– Tabulations may require access to noise-infusion data, not available on Snapshot – may require inquiry/support from LEHD staff

3/4/2013

Page 40: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

INFO 7470/ECON 7400/ILRLE 7400Register-based statistics

John M. Abowd and Lars VilhuberMarch 4, 2013

Page 41: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

41

What Is Different?

• “Classical” Census of population and housing: field enumeration

• “Register” Census: partially or wholly based on administrative data (“registers”)

3/4/2013

Page 42: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

42

What Is a Register?

• A register aims to be a complete list of objects• An administrative register is an (exhaustive) list of

objects to be administered• A statistical register is a transformation of an

administrative register for statistical purposesWallgren and Wallgren (2007)

• A register is a repository which stores information about statistical units and is directly updated in the course of events affecting the statistical units

EU (2011)

3/4/2013

Page 43: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

43

What Is a Register-based Census?

• A register-based Census uses data from the administrative records to “fill the forms”

• Most census frames are already maintained through a combination of enumeration and administrative records– Population and Housing frames (Session 2)– Business and establishment frames (Session 3)

3/4/2013

Page 44: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

44

Register-based Censuses

• Many European censuses of population and housing make “growing use of information from administrative registers to compare, complete or even replace information obtained from ‘classical’ field enumerations” (EU, 2011)– 2001 censuses: 7 countries used administrative

registers as one of the data sources– 2011 censuses: 16 countries

3/4/2013

Page 45: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

45

Dutch Virtual Census 2001

• Makes use of the Social Statistical Database (SSD) – “main source of the 2001 Census”

(Statistics Netherlands, 2004)

• Last complete enumeration was in 1971• Census 2001 tabulated in 2003• Estimate of the cost of “traditional” Census:

$300 million– The Virtual Census cost $3 million

3/4/2013

Page 46: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

46

Some Details

• Justified by – declining response rate of the population– Efficiency

• Brings with it– Larger acceptance (claimed)

• Background– 1991: Census law “rescinded” – no further legal obligation to

produce census!– 2003: new statistical law– Creation of 2001 Census on a “voluntary” basis: “inconceivable

[to] (…) not compile census data (…) like all other European countries”

3/4/2013

Page 47: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

47

Issues in the Netherlands Virtual Census

• Issues– Start of data work was later than a traditional census –

retrospective rather than contemporaneous – waited for all registers to be “complete”

– But: Census was completed earlier than comparable other European countries

– Some loss of detail (not present in registers)– Source: combination of all municipal population

registers– Combined with select surveys (Labor Force Survey, etc.)

3/4/2013

Page 48: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

48

So is This New?

• For population census in the US:– 2000: creation of StARS 1999 (Statistical Administrative

Records System) • Use in AREX 2000 evaluation (Asher & Fienberg, 2001, 2002)• Successor systems under evaluation for use in 2020 Census

• For establishment censuses in the US– Economic Census: Most small employers and all non-

employers were not sent forms: any data comes from “administrative records of other federal agencies” [2007]

– QCEW: entirely based on administrative data

3/4/2013

Page 49: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

49

So Not Really So New

• For job statistics– Since 2003, LEHD – the only “job census” – has

created a register of (almost) all jobs• Combination of state registers of jobs (but purpose of

register is not count jobs!)• Use of administrative information on demographic

characteristics, complemented with imputes

3/4/2013

Page 50: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

50

Why Not Use It for 2020 Census?

• Research into this is ongoing• Problem #1: the US doesn’t actually have a Population Register

– But birth, death, and migration records used for intercensal estimates

• Register-based Censuses also have issues– Timeliness may not be better than enumeration– Coverage still needs to be assessed

• Does the register cover all the entities that the census has to cover? (divergence in purposes)

– Not all variables exist in administrative data– Almost always, requires matching of multiple sources -> new source

of potential errors

3/4/2013

Page 51: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

51

Privacy

• In Netherland’s case, Statistics Netherlands claimed higher acceptance for the use of administrative records than willingness to provide survey response

• In the US, some evidence suggests an increase in concerns about privacy (Singer, Bates, van Hoewyk, 2011; Gates, 2011)

3/4/2013

Page 52: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

52

Ongoing Work for 2020 Census

• Based on Vitrano & Chapin (2012)• First Census where administrative records are being

considered to replace or supplement respondent data• 2010 Census Match Study

Or the “Virtual Census 2010” – same methods as in the Dutch Virtual Census– Compare coverage of addresses– Compare coverage of persons– Compare characteristics of persons

• Better Census coverage measurement– Replace capture-recapture with multi-source measurement

3/4/2013

Page 53: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

53

Ongoing Work for 2020 Census

• Alternate sources, including commercial sources

• Better non-response follow-up (NRFU) to reduce cost (most expensive part of the operation

• Use of administrative records to improve alternate response methods (internet)

• Improving the address frame (see Session 7)

3/4/2013

Page 54: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

INFO 7470/ECON 7400/ILRLE 7400 Health Data

John M. Abowd and Lars VilhuberMarch 4, 2013

Page 55: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

55

Defining “Health” Data

• Data on individuals' health– By age (elder, infants)– Disability-related– Nutrition– Etc.

• Data on health providers– Insurance access and utilization– Access or utilization of doctors, hospitals, etc.

3/4/2013

Page 56: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

56

Scope Here

• Health data related to disability• Health data in the RDC network• Other restricted-access health data

3/4/2013

Page 57: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

57

Long List of Disability-related Data

• 2000 Decennial Census Long Form:six disability-related questions, and many other questions relevant to participation (e.g., living arrangements, employment, income, transportation utilization, and housing).

• American Community Survey (ACS)since 2003, disability questions reworded in 2008 to new OMB-coordinated standard (also CPS, soon SIPP)

3/4/2013

Page 58: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

58

Disability-related Data (2)

• CPSPrior to 2008, only work limitation measure, since 2008 same question as ACS.– Contains extensive information on employment,

income, insurance, living arrangements, family status

– insurance coverage, marital status, and parental status for youth and young adults with work limitations

3/4/2013

Page 59: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

59

Health and Retirement Survey (HRS)

• A nationally-representative sample of more than 22,000 Americans over the age of 50 every two years.

• Subjects:– antecedents and consequences of retirement;– health, health care, income, and wealth over time;– work disability, declining health, and institutionalization– chronic illness, functional ability, depression, and self-assessed health status,– health-related behaviors such as smoking, alcohol use, and exercise– cognition, demographics, family relationships, occupations, employment,

employer accommodation, and economic circumstances• Run by Institute for Social Research (ISR)/University of Michigan under

a cooperative agreement with the NIAhttp://hrsonline.isr.umich.edu/

3/4/2013

Page 60: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

60

HRS (2)

• Restricted-access linkages– SSA earnings– Detailed geography and industry– Medicare claims and summary

• Access and merge to RDC-data is feasible, but needs authorization from HRS

3/4/2013

Page 61: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

61

Behavioral Risk Factors Surveillance System (BRFSS)

• Annual state-based health survey– health risk behaviors,– clinical preventative practices,– health care use and access focused on chronic disease

and injury.• Collaborative effort between CDC and the state

health departments• Cross-sectional telephone survey that collects data

on approximately 350,000 non-institutionalized adults (18 and older).

3/4/2013

Page 62: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

62

National Health Interview Survey

• Annual nationally representative cross-sectional survey of approximately 100,000 non-institutionalized civilians

• Conducted by CDC• Topics/data

– Highly detailed information on health, functional status, and activity limitations.

– Hducation, employment, marriage, parenting, and family income

3/4/2013

Page 63: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

63

National Health Interview Survey

• Random selection of disability-related research using NHIS: Stapleton, Livermore, and Kennell (2002); Burkhauser, Houtenville, and Wittenburg (2003); Burkhauser et al. (2002); Kaye (2002); Trupin et al. (1997), Horvath-Rose and Stapleton (2004)

3/4/2013

Page 64: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

64

NHIS Linkages - SSA

• 1994-2005 linked• Numident (limited demographics)• Master Beneficiary Record (MBR)• Supplemental Security Record (SSR)• Payment History Update System (PHUS) • Reduced sample size, ~ 64,000 in linked file compared to

regular analysis files – 75% linkage rate in 1994 (92% of eligible), declined to 45% in

2005 (76% of eligible)• Example analysis: comparisons between self-reports of

disability and diagnosis codes contained in the SSA data

3/4/2013

Page 65: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

65

NHIS Linkages (CMS)

• CMS Medicare Enrollment and Claims Files– CMS Medicare data (1991-2007)– Linked to

• 1994-1998 National Health Interview Survey (NHIS)• 1999-2005 National Health Interview Survey (NHIS)

3/4/2013

Page 66: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

66

Medical Expenditure Panel Survey (MEPS)

• Set of large-scale surveys of families and individuals, their medical providers (e.g., doctors, hospitals, and pharmacies), and employers

• Household Component (HC) of the MEPS collects data from a sample of families and individuals in selected communities across the United States, drawn from a nationally-representative sub-sample of households that participated in the prior year's NHIS

• Approximately 22,000 individuals3/4/2013

Page 67: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

67

Medical Expenditure Panel Survey (MEPS)

• Topics:– Specific health services, frequency of use– Cost of these services, how they are paid for– Cost, scope, and breadth of health insurance held

by and available to U.S. workers– Demographic characteristics, health conditions,

health status, income, and employment

3/4/2013

Page 68: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

68

MEPS-MPC

– MEPS-HC is supplemented by information provided by the Medical Provider Component (MPC)

• Covers hospitals• Physicians• Home health care providers• Pharmacies

– Identified by household respondents

3/4/2013

Page 69: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

69

NHANES

• National Health and Nutrition Examination Survey, Since 1960s

• Topics/Data:– Demographic, socioeconomic, dietary, and health-

related questions– Examination component consists of medical,

dental, and physiological measurements– Laboratory tests

3/4/2013

Page 70: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

70

NHANES linkages

• EPA data (environment on health)• Using restricted-access data:

– Lower level geography– Indirect identifiers

3/4/2013

Page 71: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

71

Medicare Current Beneficiary Survey (MCBS)

• Non-public-use survey of about 12,000 Medicare beneficiaries at any point in time

• conducted since 1991 by Centers for Medicare and Medicaid Services (CMS)

• overlapping-cohorts structure, 3 contacts/year, any single individual at most four years

• nationally-representative of all Medicare beneficiaries, including those who are aged, disabled, and institutionalized

3/4/2013

Page 72: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

72

Medicare Current Beneficiary Survey (MCBS)

• Data/topics:– Health status, health care use and expenditures,

and health insurance coverage of Medicare beneficiaries

– Employment and demographic characteristics• The MCBS is split into two components:

Access to Care, and Cost and Use files

3/4/2013

Page 73: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

73

HEALTH PROVIDER DATA (EXAMPLE)

3/4/2013

Page 74: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

74

State-specific Data

• Example: Gruber and Kleiner (2012), AEA Public Policy, “Do Strikes Kill? Evidence from New York State” (also NBER 2010 w15855)

• Short-term non-federal hospitals in New York State are required to submit discharge data to the New York State Department of Health through the Statewide Planning and Research Cooperative System (SPARCS)

3/4/2013

Page 75: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

75

SPARCS

• Patient level, detailed data on– Patient characteristics (e.g. age, sex, race),– Diagnoses (several DRG and ICD-9 codes),– Treatments (several ICD9 codes),– Services (accommodation), and– Total charges

• For every hospital discharge in New York State since in 1982.

3/4/2013

Page 76: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

76

Gruber and Kleiner (2010)

• Combine with detailed strikes data• Controlling for hospital specific heterogeneity,

patient demographics and disease severity, the results show that nurses’ strikes increase in-hospital mortality by 19.4% and 30-day readmission by 6.5% for patients admitted during a strike, with little change in patient demographics, disease severity or treatment intensity

3/4/2013

Page 77: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

INFO 7470/ECON 7400/ILRLE 7400Alternate Data Sources of the 21st Century

Page 78: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

78

Rising Cost of Decennial Census

3/4/2013

Vitrano and Chapin, 2012

Page 79: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

79

Where Are Official Statistics Heading?

• On the negative side:– Increasing (?) privacy and confidentiality concerns– Decreasing response rates

• Canada’s Census long form is no longer compulsory

– Increasing costs• On the positive side:

– Far more data available in general – petabytes per minute

– “Big Data” is of commercial interest – ability to buy-in data

3/4/2013

Page 80: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

80

Consider Mapping

• In the 1980s– Census was the key provider of detailed

demographic maps (TIGER)• In 2013

– Navteq, Google, Bing, etc., etc. provide highly detailed maps

– Private provision of satellite imagery (SPOT, 1986) • GeoEye-1 provides 0.50m resolution for commercial

customers

3/4/2013

Page 81: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

81

Google Flu Index• http://www.google.org/flutrends/

3/4/2013

Page 82: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

82

Google Flu Index

• Compares to CDC’s flu tracking http://www.cdc.gov/flu/weekly/overview.htm

3/4/2013

Page 83: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

83

Google flu index

• Compares to CDC’s flu tracking http://www.cdc.gov/flu/weekly/overview.htm

3/4/2013

Page 84: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

84

Google Flu Index

• How it works:– Uses search terms for flu-related topics– Validated against historical “high-quality” data (in

the US: CDC, see doi:10.1038/nature07634)• Advantages:

– Very current (vs. some lag at CDC)• Disadvantage:

– May get side-tracked by changes in behavior/indicators *

3/4/2013

Page 85: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

85

Google Trends by Google

3/4/2013

Page 86: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

86

Latest Google Trends

3/4/2013

Butler in Nature, Vol 494, Issue 7436

Page 87: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

87

The Use of Non-traditional Data in Statistics

• Powerful, timely• Power obtained in part through validation

against “old-style” statistics• Difference between

– “raw data” (Google search terms)– First-order analytics (Google Flu trends)– Representativeness of the data – deep analysis

and interpretation

3/4/2013

Page 88: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

88

Google Price Index

• Tracks database of web shopping data• The GPI shows a “pretty good correlation”

with the CPI for goods such as cameras and watches that are often sold on the web, but less so for others, such as car parts, that are infrequently traded online. (FT 2010)

• General approach: Choi and Varian (2011)

3/4/2013

Page 89: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

89

Billion Prices Project

• http://bpp.mit.edu– Alberto Cavallo and Roberto

Rigodon (MIT)

• Methodology– Collected every day from

online retailers– DB contains prices on the full

array of products sold by these retailers, product descriptions, package sizes, brands, special characteristics (e.g., “organic”), and whether the item is on sale or price control. *

3/4/2013

Page 90: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

90

Billion prices project

• Availability– PriceStats commercially sells data through State

Street (for instance, used in the Economist for Argentina)

– Research data available with a lag– Do not claim to cover 100% of CPI of goods and

services (depend on online availability)– “Standard NSF funding would cover the costs for 1

day of web scraping”

3/4/2013

Page 91: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

91

Billion prices project

• Use of similar methods by BLS to supplement CPI (Amstat article)– Quality adjustment models for televisions,

camcorders, cameras, and washing machines. – Investigating retail scanner data for use in CPI

3/4/2013

Page 92: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

92

Twitter as a Statistical Data Source

• Work at the Michigan NCRN node by Matthew D. Shapiro and co-authors

• Use Twitter activity to create indicators of new job loss

• Benchmarked to weekly UI claims• Critical: classification, search phrases

3/4/2013

Page 93: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

93

Initial UI Claims vs. Twitter Signal

3/4/2013

Excerpt from presentation by Shapiro at NCRN annual PI meeting

Page 94: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

94

New Sources, New Methods, …

• New unstructured data• Convergence of qualitative and quantitative

sciences• Needs traditional statistics (registers!), but role may

increase over time– All of the uses here are validating against official

statistics where available• Can supplement or replace official statistics where

the latter are of lower quality (BPP for Argentina!)

3/4/2013

Page 95: INFO 7470/ECON 7400/ILRLE 7400 LEHD, Health, Admin, etc. John M. Abowd and Lars Vilhuber March 4, 2013.

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

95

… New Skill Sets

• Social scientists need to learn new tools

• INFO7470 students 2013

• Who knows how to leverage a 15,000 node Hadoop cluster?

None Advanced

SAS 31% 11%

Stata 25% 16%

R 56% 4%

Python 87% 1%

C, Fortran 67% 2%

3/4/2013