Stephenson - Data Curation for Quantitative Social Science Research

30
L IBBIE S TEPHENSON , D ATA A RCHIVIST (R ETIRED ) UCLA S OCIAL S CIENCE D ATA A RCHIVE LIBBIE @ G . UCLA . EDU HTTPS :// DATAVERSE . HARVARD . EDU / DATAVERSE / SSDA _ UCLA Data Curation for Quantitative Social Science Research: A Case Study NISO Virtual Conference: Data Curation – Cultivating Past Research Data for Future Consumption August 31, 2016

Transcript of Stephenson - Data Curation for Quantitative Social Science Research

Page 2: Stephenson - Data Curation for Quantitative Social Science Research

DISCLAIMER

I am retired from UCLA so my comments reflect my own experience and expertise. They do not necessarily reflect the ideas, opinions or practices of anyone at UCLA.

These materials are free for you to use, but please cite accordingly.

NISO - AUGUST 31, 2016

2

Page 3: Stephenson - Data Curation for Quantitative Social Science Research

OVERVIEW

About the Archive

About the data we manage

What we are trying to do

What we actually do

Some illustrations

NISO - AUGUST 31, 2016

3

Page 4: Stephenson - Data Curation for Quantitative Social Science Research

ABOUT THE ARCHIVE

Operating since 1964 -- before email, PC’s, Internet, laptops, smart phones; Manage survey/quantitative data stored on media from punch cards to cloud

Staff have library science degrees; statistical and technical expertise; quantitative social science background

Serve all UCLA quantitative researchers: Provide reference, cataloging/metadata, long term archiving; support in data rescue, management, security.

NISO - AUGUST 31, 2016

4 h

ttp

s://

dat

aver

se.h

arva

rd.e

du

/dat

aver

se/s

sda_

ucl

a

Page 5: Stephenson - Data Curation for Quantitative Social Science Research

SURVEY/QUANTITATIVE

RESEARCH

Carried out in the U.S. since 1940’s -- post WW2

1960’s -70’s -- ICPSR & academic archives

1970’s -- growth of data oriented professional associations (IASSIST, APDU, IFDO, CESSDA)

Focused on society and social norms

Predict outcomes; test assumptions; study change over time; run experiments

NISO - AUGUST 31, 2016

5

Note: in any discipline we also need to understand the work flow of the research and the way individuals approach their work.

Page 6: Stephenson - Data Curation for Quantitative Social Science Research

CURATION GOALS

Researcher driven philosophy of open access, data sharing, reuse

Collaborative, multi-unit or multi-institutional

Ensure data conservation and long term usability, as well as discovery and access

Processes and work flows support disaster planning

Use of best and trusted digital repository policies, models, practices, and work flows

Reflect values of accountability and integrity NISO - AUGUST 31, 2016

6

Page 7: Stephenson - Data Curation for Quantitative Social Science Research

POLICIES SUPPORT PRACTICE

Foundational, essential to a strong data curation infrastructure.

Encompasses what is acquired/collected, curation levels and scope, ensures long term usability, drives processes and work flows

Social Science Data Archive policy

TOOL : Policy-making for Research Data in Repositories by Ann Green, Stuart Macdonald and Robin Rice.

NISO - AUGUST 31, 2016

7

Page 8: Stephenson - Data Curation for Quantitative Social Science Research

OUR STEPS IN CURATION

Initial contact

Data Quality Review and Appraisal

Ingest Verification Metadata Physical storage

Access

Preservation

NISO - AUGUST 31, 2016

8

Page 9: Stephenson - Data Curation for Quantitative Social Science Research

INITIAL CONTACT

Data Curation Profile

Data Management Plan

Guide to Social Science Data Preparation and Archiving

NISO - AUGUST 31, 2016

9

Page 10: Stephenson - Data Curation for Quantitative Social Science Research

APPRAISAL

Archival Collection Policy

Also depends on:

Resources to process

Long term resources

Fitness, usefulness

Data Deposit Form signatures and completeness; commitment to share data; privacy and confidentiality

NISO - AUGUST 31, 2016

10

Page 11: Stephenson - Data Curation for Quantitative Social Science Research

DATA QUALITY REVIEW

Use of statistical packages, emulator, Adobe Pro, Excel, Colectica, Text editor

Verify deposit package, check sums, freq’s, compare data to documentation

Completeness of codebook, question text, sampling, weighting, recodes, methods

Disclosure analysis, check for personal identifiers and assess privacy/confidentiality of respondents

Documentation converted to PDF/A

11

NISO - AUGUST 31, 2016

Page 12: Stephenson - Data Curation for Quantitative Social Science Research

EXAMPLE: WHAT KIND OF DATA?

NISO - AUGUST 31, 2016

12

Page 13: Stephenson - Data Curation for Quantitative Social Science Research

CODEBOOK DOCUMENTS THE

COLUMNS

NISO - AUGUST 31, 2016

13

5002 01 01 302000 001 101 10004B121068965

Each item is called a variable. We refer to the numeric content of each item as a value.

Page 14: Stephenson - Data Curation for Quantitative Social Science Research

COMPARE FREQS TO CODEBOOK

NISO - AUGUST 31, 2016

14

VALUES VALUE LABELS

VARIABLE

Page 15: Stephenson - Data Curation for Quantitative Social Science Research

RUN MARGINALS/FREQUENCIES

NISO - AUGUST 31, 2016

15

Sex of Respondent Frequency Percent Valid Percent Cumulative Percent Valid MALE 856 45.1 45.1 45.1 FEMALE 1041 54.9 54.9 100.0 Total 1897 100.0 100.0 What is your race - ethnicity Frequency Percent Valid Percent Cumulative Percent Valid White 618 32.6 32.6 32.6 Hispanic 475 25.0 25.0 57.6 Black 474 25.0 25.0 82.6 Asian or Pacific Islander 282 14.9 14.9 97.5 Native American or Alaskan native 17 .9 .9 98.4 Identifies more than one of the above groups 20 1.1 1.1 99.4 DON'T KNOW 2 .1 .1 99.5 REFUSED 9 .5 .5 100.0 Total 1897 100.0 100.0

Page 16: Stephenson - Data Curation for Quantitative Social Science Research

INGEST – PHYSICAL FORMATS

Virus check, run check sums, address versioning, fixity, file naming conventions

Convert files to archival formats if required

Back copies to external media

Copy datasets to Dataverse; Safe Archive tool

Use of secure file transfer client

SQL/PHP scripts for local holdings file

Compression software (7-zip)

NISO - AUGUST 31, 2016

16

Address disaster plan and file access (public and local); Security requirements; LOCKSS

Page 17: Stephenson - Data Curation for Quantitative Social Science Research

INGEST– BIBLIOGRAPHIC METADATA

Bibliographic metadata enables search and discovery:

Establish bibliographic-level identity for unique items

Bibliographic record to WorldCat/Voyager

Add record to holdings database (SQL)

Create Dataverse record; Assign persistent identifier

NISO - AUGUST 31, 2016

17

Produce and review with investigator

Page 18: Stephenson - Data Curation for Quantitative Social Science Research

WHAT ELSE DO WE NEED TO

KNOW ABOUT THE DATA?

Description of the study

Citation

Funding source

Methodology

Sampling

Publications

NISO - AUGUST 31, 2016

18

Page 19: Stephenson - Data Curation for Quantitative Social Science Research

EXAMPLE - DATAVERSE

NISO - AUGUST 31, 2016

19

Links to tools to manage collections

Navigate to and search for studies

Studies can be downloaded or analyzed online

Page 20: Stephenson - Data Curation for Quantitative Social Science Research

VARIABLE LEVEL SEARCH

CAPABILITIES

Enables searching across many studies at once.

Enables searching shared catalogs of multiple archives

TOOLS: Colectica Repository and NESSTAR

Requires local or remote hosting of software.

Can share the metadata files for repurposing.

NISO - AUGUST 31, 2016

20

Page 21: Stephenson - Data Curation for Quantitative Social Science Research

DATA DOCUMENTATION

INITIATIVE

Document, Discover, and Interoperate

“International standard for describing data that result from observational methods in the social, behavioral, economic, and health sciences”

“Facilitates interpretation and understanding -- both by humans and computers”

NISO - AUGUST 31, 2016

21 h

ttp

://w

ww

.dd

ialli

ance

.org

/

Page 22: Stephenson - Data Curation for Quantitative Social Science Research

INGEST-VARIABLE LEVEL METADATA

Descriptive metadata of detailed information about the data enables understandability and reuse:

Create variable-level metadata, using Colectica or NESSTAR to produce standardized metadata records

Create DDI record; full DDI codebook

Migrate DDI to Colectica Repository

NISO - AUGUST 31, 2016

22

Produce and review with investigator

NESSTAR

Page 23: Stephenson - Data Curation for Quantitative Social Science Research

EXAMPLE - IMPORTING DATA

Use the Data tab to import files from SPSS or STATA formats.

NISO - AUGUST 31, 2016

23

Page 24: Stephenson - Data Curation for Quantitative Social Science Research

Label

Question

text

Numeric

values

Variable Details include variable name, label, description or question text, and types of coding.

NISO - AUGUST 31, 2016

24

Page 25: Stephenson - Data Curation for Quantitative Social Science Research

EXAMPLE DDI FROM COLECTICA

NISO - AUGUST 31, 2016

25

DDI fields are in red; used to create documentation; can be repurposed

Page 26: Stephenson - Data Curation for Quantitative Social Science Research

PRESERVATION AND CURATION

Continuous monitoring of file formats; migrate to new formats when: New operating system; New version of statistical software New mode of file transfer; Code change

Monitoring of database function; software updates or redesigns

Monitoring of servers, external media health; replace as needed

Data forensics; check sums; validation; authentication; version control; format migration; refresh media; record preservation metadata -- DDI

Review disaster plan and collection policy at regular intervals

Review new or revised regulations for intellectual property; security; data producers/distributors; funding agencies

Review with original depositor, their data management plans, changes in access or user permissions

26 Focus is on functional-level preservation and long term usability through use of DDI and continuous review.

Page 27: Stephenson - Data Curation for Quantitative Social Science Research

UNCOMFORTABLE TRUTHS

Data management in institutions requires high level administrative participation; new, sustained funding; and differently trained staff

Data management planning is not a static event but a continuous process to ensure long term independently understandable informed reuse of research

There is an urgent need for standards, tools, and best practice models for many different file formats and disciplines

NISO - AUGUST 31, 2016

27

Page 28: Stephenson - Data Curation for Quantitative Social Science Research

NEXT STEPS FOR PRACTITIONERS

“Crucial metadata about data are not always being captured or created and linked to data in repositories. Storage and persistence of data submissions isn't enough. We need data archivists and librarians to commit to partnering with researchers to curate data -- to review incoming data for usability, confidentiality, and completeness of descriptive information.”

NISO - AUGUST 31, 2016

28

Ann Green (2016) Email communication Used with permission

Page 29: Stephenson - Data Curation for Quantitative Social Science Research

ANY QUESTIONS?

THANK YOU!

Social Science Data Archive, UCLA

Box 951484 Los Angeles, CA 90095-1484 310-825-0716

NISO - AUGUST 31, 2016

29

Page 30: Stephenson - Data Curation for Quantitative Social Science Research

LINKS

Social Science Data Archive dataverse.harvard.edu/dataverse/ssda_ucla

Data Seal of Approval www.datasealofapproval.org/en/

National Digital Stewardship Alliance ndsa.org/activities/levels-of-digital-preservation/

Open Archival Information System www.oclc.org/research/publications/library/2000/lavoie-oais.html

Social Science Data Archive Policy data-archive.library.ucla.edu/SSDA_collectionAndArchivingPolicy.pdf?_ga= 1.3255478.786669706.1378228281

Data Curation Profile datacurationprofiles.org/

Data Management Planning at ICPSR www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/index.html

ICPSR Guide to Data Preparation www.icpsr.umich.edu/icpsrweb/content/deposit/guide/

Colectica www.colectica.com/

NESSTAR www.nesstar.com/index.html

DDI www.ddialliance.org/

Dataverse dataverse.org/

NISO - AUGUST 31, 2016