Documenting Data Transformations

Post on 21-Mar-2017

10 views 0 download

Transcript of Documenting Data Transformations

“Provenance and Social Science Data”15 March 2017

Documenting DataTransformations

George Alter, University of Michigan

• Data are useless without Metadata – “data about data”

• Metadata should:– Include all information about data creation– Describe transformations to variables– Be easy to create

• Our goal: Automated capture of metadata

Why Metadata?

A few words about ICPSR

• World’s largest archive of social science data

• Consortium established 1962

• 760+ member institutions around the world

• Founding member and home office for the DDI Alliance

Powered by DDI Metadata

ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML

Codebooks (pdf and online) are rendered from the DDI.

Searchable database of 4.5M variables

Click here for online codebook

Online codebook shows variable in context of dataset

Link to online crosstab tool

What question was asked?

How was the question coded?Link to online

graph tool

Searchable database of 4.5M variables

Click here for variable comparison

Variable comparisondisplay

Click here for online codebook

Search for datasets with 3 desired variables

Check boxes for variable comparison

Crosswalk for American National Election Study (ANES) and General Social Survey (GSS)

Columns link to 70 datasets

134 tags in 8 lists

Variable comparison display

Variables linked to online codebooks

Metadata for the American National Election Study

What question was asked?

Who answered this question?

How was the question coded?

Who answered this question?

Metadata for the American National Election Study

Who answered this question?

Who answered this question?

How do we know who answered the question?

It’s in the pdf.

When data arrive at the archive…

• No question text• No interview flow (question order, skip

pattern)• No variable provenance• Data transformations are not documented.

How is research data created?

• Most surveys are conducted with computer assisted interview software (CAI)– CATI – Computer-assisted Telephone Interview– CAPI – Computer-assisted Personal Interview– CAWI – Computer Aided Web Interview

• There is no paper questionnaire• The CAI program is the questionnaire– i.e. the program is the metadata

Originaldata

DDI XML

Original metadata

CAI

CAI to

DDI

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

We already have tools to convert CAI to machine-

readable metadata.

SPSSSA

SStat

aR

Command scripts:

Originaldata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

What happens when a project modifies the data.

The modified data no longer

match the metadata.

SPSSSA

SStat

aR

Command scripts:

Originaldata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

Stat Packag

e to DDI

DDI XML

Extracted metadata

Extract metadata

from SPSS/SAS/

Stata/RData file

Metadata are re-created after the

data are transformed.

Transformations are

documented by hand

Statistics packages have limited metadata

• Variable names• Variable labels• Value labels• No provenance

SDTL

XML Update

r

DDI XML

SPSSSA

SStat

aR

Script Parser

Command scripts:

Originaldata

Revised metadata

DDI XML

Original metadata

Reviseddata

SPSSSASStata

R

CAI

CAI to

DDI

Statistical Packages

StandardData

Transformation Language

Convert to DDI:

CollecticaMQDSothers

Computer Assisted

Interviewing

Automating the capture of

transformation metadata.

Missing links that we will build.

What statistics packages should be covered?

ICPSR Downloads by Format

All downloadsStudies with all

formatsDelimited text 43% 29%SPSS 22% 24%SAS 10% 12%Stata 19% 23%R 5% 12%Excel 0% 1%Other 0% 0%

100% 100%Number 378,007 154,663

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X234-1

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X234-1

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X234-1

Why do we need an SDTL?

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X X Y Z2 2 83 34 4 9-1 -1

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X X Y Z2 2 83 34 4 9-1 9

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X X Y Z2 2 . 83 3 . .4 4 9 .-1 . . 8

Why do we need an SDTL?

What happens when a missing value is in a logical comparison?• SPSS– Logical expressions including a missing value are

considered “Missing.” Usually, “Missing” is equivalent to “False.”

• Stata– Missing values are treated as numbers equal to

infinity. So, any number is less than a missing value.• SAS– Missing values are treated as numbers equal to minus

infinity. So, any number is greater than a missing value.

Input Data Output DataSPSSMISSING VALUES X(-1).IF (X > 3) Y=9.IF (X < 3) Z=8.

X X Y Z2 2 83 34 4 9-1 NULL

Statareplace X=. if X==-1generate Y=9 if X>3generate Z=8 if X<3

X X Y Z2 2 83 34 4 9-1 ∞ 9

SASif X=-1 then X=.;if X>3 then Y=9;if X<3 then Z=8;

X X Y Z2 2 . 83 3 . .4 4 9 .-1 -∞ . 8

Missing Values in Comparisons

Benefits of automated metadata capture

• Metadata will be better– All the information in the CAI can be included.– Variable transformations can be described

• Automation will lower costs– Metadata will not be discarded and re-created

• All metadata will be standardized and machine readable– Codebooks with rich information can be rendered at

will• If we make it easy and beneficial, researchers

will use it.

Continuous Capture of Metadata for Statistical Data

(NSF ACI-1640575)Project Partners•Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan•Colectica•Metadata Technology North America•Norwegian Centre for Research Data•General Social Survey, NORC, University of Chicago•American National Election Study, University of Michigan

Questions?George Alter

altergc@umich.edu