Privacy in Research Data Managemnt - Use Cases

16
Prepared for: Integrating Approaches to Privacy across the Research Lifecycle Sept 2013 Introduction to Research Data Privacy Use Cases Micah Altman <[email protected]> Director of Research, MIT Libraries Non-Resident Senior Fellow, Brookings Institution

description

From Integrating Approaches to Privacy across the Research Lifecycle http://privacytools.seas.harvard.edu/fall-2013-workshop This workshop will consider how emerging tools and perspectives from a variety of disciplines, such as computer science, social science, law, and the health sciences, should be integrated in the management of confidential research data. Multidisciplinary discussion groups will grapple with these issues in the context of exemplar research use cases.

Transcript of Privacy in Research Data Managemnt - Use Cases

Page 1: Privacy in Research Data Managemnt - Use Cases

Prepared for:

Integrating Approaches to Privacy across the Research Lifecycle

Sept 2013

Introduction to Research Data Privacy Use Cases

Micah Altman<[email protected]>

Director of Research, MIT LibrariesNon-Resident Senior Fellow, Brookings Institution

Page 2: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

DISCLAIMERThese opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators.

Secondary disclaimer:

“It’s tough to make predictions, especially about the future!”

-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R.

Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.

Page 3: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

About the ‘use cases”?Technical definition:

A summary of a pattern of interactions between external actors within a system under consideration to accomplish a goal.

Working definition:

Who does what, when; and what do they wish to accomplish?

Complemented by:

• User stories – simle generalized descriptions of specific interactions• Scenarios – variations on a theme• Examples/fact patterns – real life examples of the abstract use case

Page 4: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Data InputOutput Model

Published Outputs

* Jones * * 1961 021*

* Jones * * 1961 021*

* Jones * * 1972 9404*

* Jones * * 1972 9404*

* Jones * * 1972 9404*

“The correlation between X and Y was large and

statistically significant”

Summary statistics

Contingency table

Public use sample microdata

Information Visualization

DATA

DATA

Page 5: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Information Life Cycle Model

Creation/Collection

Storage/Ingest

Processing

Internal SharingAnalysis

External dissemination/publica

tion

Re-use• Scientometric• Education• Scientific• Policy

Long-term access

Research methods

Data ManagementSystems

Legal / Policy Frameworks∂

Statistical / Computational

Frameworks

Page 6: Privacy in Research Data Managemnt - Use Cases

Legal/Policy FrameworksContract Intellectual Property

Access Rights Confidentiality

Copyright

Fair Use

DMCA

Database Rights

Moral Rights

Intellectual Attribution

Trade Secret

Patent

Trademark

Common Rule45 CFR 26

HIPAA

FERPA EU Privacy DirectivePrivacy Torts

(Invasion, Defamation)

Rights of Publicity

Sensitive but Unclassified

Potentially Harmful

(Archeological Sites,

Endangered Species, Animal

Testing, …)

Classified

FOIA

CIPSEA

State Privacy Laws

EAR

State FOI Laws

Journal Replication

Requirements

Funder Open Access

Contract

License

Click-WrapTOU

ITAR

Export Restrictions

Page 7: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Example: Stakeholder Concerns Across Lifecycle

Research sources:- Research Subjects.- Owners of subject material- Owners of supplementary data

Research sponsors:- Home institution- Funding sources

Project Personnel:- Investigators- Research Staff

Research Publishers- Print publishers- Research archives

Research Consumers- Readers- Secondary researcher

LicensingCopyrightDMCAInformed ConsentPrivacyTrade secrets

LicensingFreedom of InformationCopyright

Copyright

CopyrightLicensing

Fair Use

InformationTransfer

PrivacyConfidentialityIntellectual Property

Replicable ResearchPolicy RelevanceAccessibility of ResearchProtect IPAvoid third party IP/Privacy Issues

Replicable ResearchPublishPromote use of PublicationsTrack use

Replicable researchPromote use of their publicationsProtect publisher IPAvoid third party IP/Privacy Issues

Replicate and extendSecondary analysisLink research

Stakeholder Concerns Legal Issues

Page 8: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

• Infrastructure requirements analysis– Data acquisition, storage, dissemination– Identification, authorization, authentication– Metadata, protocols

• System design: potential implementation cost of differential privacy:– Information security -- hardening– Information security – certification & auditing– Model server development, provisioning, maintenance, reliability, availability

• System design: information security tradeoffs of Interactive privacy mechanisms:– Availability risks: denial of service attack– Availability/integrity risks: privacy budget exhaustion attacks– Integrity risks: modification of delivered results (e.g. man-in-the-middle attacks)– Secrecy/privacy: breach of authentication/authorization layer

• System design: optimizing privacy & utility across lifecycle– When does limiting disclosive data collection dominate methods at the data analysis stage– When does restricted virtual data enclaves + public synthetic data dominate interactive mechanisms

• System design: Information use/reuse– Support of scientific analysis use cases (model diagnostics, exploratory data analysis, integration of external

data) within interactive privacy systems.– Align informational assumptions across stages & incorporating informative priors? – Requirements for scientific replication/verification of results produced by model servers?

Systems Policy Research questions deriving from Information Lifecycle Analysis

Page 9: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Modeling Features

Features Characteristics

Data - Structure; Source; Unit of observation; Attribute types; Dimensionality; Number of observations; homogeneity; frequency of updates; quality characteristics

Analytic Results - Form of output; analysis methodology; analysis/inferential goal; utility/loss/quality

Disclosure scenario - - Source of threat; areas of vulnerability; attacker objectives, background knowledge, capability; Breach criteria/disclosure concept

Stakeholders - Stakeholder types; capacities; trust relationships; budgets

Lifecycle characteristics - Lifecycle stages controlled/in scope; policies used; stakeholders involved at each stage

Current privacy management approach - Regulation/policy; legal controls; statistical/computational disclosure methods; information security controls

Page 10: Privacy in Research Data Managemnt - Use Cases

Exemplar: Social Media Analysis

Introduction to Research Data Privacy Use Cases

Attribute Type Examples

Data: Structure - network

Data: Attribute Types - Continuous/Discrete/- Scale: ratio/interval/ordinal/nominal

Data: Performance Characteristics

- 10M-1B observations- Sample from stream of continuously

updated corpus- Dozens of dimensions/measures

Measurement: Unit of Observation

- Individuals; Interactions

Measurement: Measurement type

- Observational

Measurement: Performance characteristic

- High volume- Complex network structure- Sparsity- Systematic and sparse metadata

Management Constraints - License; Replication

Analysis methods - Bespoke algorithms (clustering); nonlinear optimization; Bayesian methods

Desired Outputs - Summary scalars (model coefficients)- Summary table- Static /interactive visualization

More Information• Grimmer, Justin, and Gary King. "General purpose computer-

assisted clustering and conceptualization." Proceedings of the National Academy of Sciences 108.7 (2011): 2643-2650.

• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in China allows government criticism but silences collective expression." APSA 2012 Annual Meeting Paper. 2012.

• Lazer, David, et al. "Life in the network: the coming age of computational social science." Science (New York, NY) 323.5915 (2009): 721.

Page 11: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Mapping the “Space” of Research Data Privacy

• Many different types of potentially relevant features• Many types stakeholders• Many lifecycle stages

so can’t be exhaustive

Heuristic: Choose some points -- combinations of characteristics -- that are near various corners of the (hyper-) space and that represent substantively important examples. Document these…

Discuss. Think. Repeat.

Page 12: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Example Use

Cases

Name/Description Examples

Comparison case: Official Statistics

Well-resourced data collector summarizes tables/relational data in the form of summary statistics and contingency tables

• U.S. Census dissemination• European statistical agencies

Privacy-Aware Journal Replication Policies

Scholarly journals adopting policies for deposit and disposition of data for verification and replication. How to balance privacy and replicability without intensive review?

• Data Sharing Systems for Open Access Journals

• American Political Science Association Data Access and Research Transparency [DART] Policy Initiative

Long-term Longitudinal data Collection

Data collections tracking individual subjects (and possibly friends and relations) over decades

• National Longitudinal Study of Adolescent Health (Add Health)

• Framingham Heart Study• Panel Study of Income Dynamics

Computational Social Science

“Big” data. New forms and sources of data. Cutting-edge analytical methods and algorithms.

Analyzing …

• Netflix• Facebook• Hubway• GPS• Blogs

Page 13: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Proposed Discussion Questions(for tomorrow)

• Characterization.• Current approaches.• Enhancing approaches. • Integrating approaches. • Utility. • Privacy. • Methodological Barriers• Incentives. • Future. • Prior work.

• Are these summaries useful as descriptive models?

• What is missing from the big picture?

• What are the opportunities for research, practice & policy?

(What one wants to know)(What one asks)

Page 14: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Selected Bibliography• L. Willenborg and T. D. Waal. Elements of Statistical Disclosure Control,

volume 155 of Lecture Notes in Statistics. Springer Verlag, New York, NY, 2001.

• Higgins, Sarah. "The DCC curation lifecycle model." International Journal of Digital Curation 3.1 (2008): 134-140.www.dcc.ac.uk/resources/curation-lifecycle-model

• ESSNET, Handbook on Statistical Disclosure Control. 2011.neon.vb.cbs.nl/casc/SDC_Handbook.pdf

• Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey of recent developments." ACM Computing Surveys (CSUR) 42.4 (2010): 14.

• Altman, M. (2012). “Mitigating Threats To Data Quality Throughout the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.), Curating For Quality. datacuration.web.unc.edu

Page 15: Privacy in Research Data Managemnt - Use Cases

Questions?

E-mail: [email protected]:informatics.mit.edu Twitter: @drmaltman

Introduction to Research Data Privacy Use Cases

Page 16: Privacy in Research Data Managemnt - Use Cases

Introduction to Research Data Privacy Use Cases

Appendix: Full Questions• Characterization.

– Are there key additional characteristics of the use case that should be noted? How do these characteristics change the analysis and treatment of privacy in these cases?

• Current approaches.– How is this use case treated now -- what's the state of the art & practice? How is success measured?

• Enhancing approaches. – Are any of the approaches discussed yesterday used? How could the tools and approaches mentioned earlier or other existing tools be used

at particular stages of the research lifecycle to enhance utility and privacy?• Integrating approaches.

– Are approaches that have been developed and used in different communities compatible with each other? How should legal, computational, policy, and statistical tools be integrated so as to be most effective?

• Utility. – What things would stakeholders like to do with the data that the toolset doesn't restrict or obstruct? Where is social benefit sub-optimal?

How is utility measured/perceived by the stakeholders?• Privacy.

– What sorts of data/outputs are considered particularly sensitive? What are the most important real and perceived risks -- what harms could occur if data is released and reidentified, how severe are these harms and how likely?

• Methodological Barriers– . What are technical, methodological, computational or infrastructural barriers to improving privacy and utility in the management of this

data. What particular characteristics of the use case contribute barriers? • Incentives.

– If better tools already exist, why aren't they used? What are barriers to adoption of new tools and methods? What are the specific "market failures" in this area -- such as perverse incentives, lack/asymmetry of information, lack of well-developed market, irrational behavior, transaction cost, network effects, etc.? What particular characteristics of the use case most influence incentives?

• Future. – How is this use case likely to evolve over time? What are threats to stability/scalability/robustness/resilience of the proposed/current

solutions?• Prior work.

– Are there key additional examples of the use case that should be noted? Are there additional key references or writings that should be noted?