Privacy Statistics and Data Linkage
description
Transcript of Privacy Statistics and Data Linkage
![Page 1: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/1.jpg)
Privacy Statistics and Data Linkage
Mark Elliot
Confidentiality and Privacy Group
University of Manchester
![Page 2: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/2.jpg)
Overview
• The disclosure risk problem
• Some e-science possibilities– Monitored data access– Grid based Data environment Analysis
• The meaning of privacy
![Page 3: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/3.jpg)
Data Data Everywhere…• Massive and exponential increase in data; Mackey
and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service.
• Singer(1999) noted three behavioural tendencies:– Collect more information on each population unit
– Replace aggregate data with person specific databases
– Given the opportunity collect personal information
• Purdam and Elliot add:– Link data whenever you can
![Page 4: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/4.jpg)
Disclosure Risk I: Microdata
![Page 5: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/5.jpg)
The Disclosure Risk Problem:Type I: Identification
Name Address Sex Age ..
Income .. ..Sex Age ..
IDvariables
Keyvariables
Targetvariables
Identification file
Target file
![Page 6: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/6.jpg)
Disclosure Risk II: Aggregate Tables of
Counts
![Page 7: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/7.jpg)
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
![Page 8: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/8.jpg)
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 1 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
![Page 9: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/9.jpg)
The Disclosure Risk Problem:Type II: Attribution
High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305
Income levels for two occupations
![Page 10: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/10.jpg)
Multiple datasets
• Disclosure Risk assessment for single datasets is a reasonably understood problem.
• But what happens with multiple datasets?
![Page 11: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/11.jpg)
Data Mining and the Grid
• Traditional Data Mining examines and identifies patterns on single (if massive) datasets.
• But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.
![Page 12: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/12.jpg)
• Smith and Elliot (2005,06,07)
• Increases in data availability lead inexorably to an increase in disclosure risk
• My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z.
• It’s all about information!
![Page 13: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/13.jpg)
CLEF: Clinical e-Science Framework
A solution involving monitored access
![Page 14: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/14.jpg)
CLEF Consortium
Approximately 40 Staff from
• University of Manchester
• University of Sheffield
• University College London
• University of Brighton
• Royal Marsden Hospital, London
![Page 15: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/15.jpg)
Purpose
• To provide a system for allowing research access to patient data, whilst maintaining privacy.
• Patient records– Database
• Texts such as referral letters and other clinical texts– Text mining system convert to microdata
![Page 16: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/16.jpg)
PRE-ACCESS DQI Monitor
Raw Data
Treated Data
Data Intrusion
sentry
PRE-OUTPUT SDRA/SDC
PRE-ACCESS SDRA/SDC
PRE-Output DQI Monitor
Firewall
CLEF one possible architecture
Workbench
![Page 17: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/17.jpg)
Data Sentry: an AI system
• Monitors patterns of analytical requests– 3 levels: users, institution, world.– Looking for intrusive patterns.– Numbers of requests
• Stores Analytical requests for future use.
![Page 18: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/18.jpg)
PRE-ACCESS DQI Monitor
Raw Data
Treated Data
Data Intrusion
sentry
PRE-OUTPUT SDRA/SDC
PRE-ACCESS SDRA/SDC
PRE-Output DQI Monitor
Firewall
CLEF Proposed Architecture
Workbench
![Page 19: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/19.jpg)
Data Quality
• User analyses are run on both treated and untreated data. – Outputs are compared and assessed for
difference.– Major research area – Knowledge Engineering
• Analyses are stored and collectively run over pre and post SDC files for assessment of impact.
![Page 20: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/20.jpg)
The Grid: the context for massive combining.
• “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002)
– Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002)
• Data grid
• Knowledge grid
![Page 21: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/21.jpg)
Grid based Data Environment Analysis
![Page 22: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/22.jpg)
What’s it about?
• Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house
being vulnerable to flooding without looking at where it is located!
• Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..
![Page 23: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/23.jpg)
What would it involve?
• Web Crawling
• Data Monitoring
• Synthetic Data Generation
• Grid based disclosure risk analysis
![Page 24: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/24.jpg)
Web crawling
• Untrained Screen scraping of all web sites that collect personal data.
• Generic info gathering of web published personal info (personal web pages, My space etc)
![Page 25: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/25.jpg)
Data Monitoring
• The development of sophisticated metadatabases representing available info fields
• Combined Database of web available data. – Involves intelligent interpretation of web data,
record linkage and other AI crossover techniques.
![Page 26: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/26.jpg)
Architecture
Repository: Data & Metadata
Data monitorSynthesiserSDRA system
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
![Page 27: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/27.jpg)
What next?
• Decide on roles.
• Identify funder.
• Develop grant application.
![Page 28: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/28.jpg)
Synthetic Data Generation
• Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.
![Page 29: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/29.jpg)
Closing thoughts
![Page 30: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/30.jpg)
A Blurring of Concepts
• The boundaries between data and processes become less distinct.
• Cyberidenties– I am my data?
• The distinction between informational and physical privacy becomes less distinct.
![Page 31: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/31.jpg)
Data Growth
• There is no reason to suppose that data growth will not continue at the same break neck pace– The data environment will become increasingly
richer
• In this context the meaning of “privacy” will undoubtedly change.– But how?
![Page 32: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/32.jpg)
The meaning of Privacy
• Do people care about privacy in an orthodox, absolute sense?– What does a blog mean?
• Private-public: Public Privacy
– Control and ownership are more important than the absolute right to secrecy.
![Page 33: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/33.jpg)
From Data Subjects to Data Citizens
• A data actualised individual in control and self aware of their own data.
• What would data citizens be concerned about?– Ownership– The use/abuse of their data– Harm– Permission/Consent
• This suggests that the law should focus on data abuse rather than privacy per se.
![Page 34: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/34.jpg)
Summary
• Statistical Disclosure prevents a problem for the use of data
• Multiple linkable datasets exacerbate that problem.
• E-science provides some tools for new modes of data access
![Page 35: Privacy Statistics and Data Linkage](https://reader034.fdocuments.in/reader034/viewer/2022051821/5681582e550346895dc594fc/html5/thumbnails/35.jpg)
But…..
• Assuming that the global culture continues to feed and be fed by the information explosion:– Our view of ourselves/our data will/must change.
– The meaning of privacy must change with it.
• The key question is what sort of society we are constructing; the meaning of privacy will reflect this.