Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17...
Transcript of Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17...
Privacy Protection
& the SAIL Databank
David Ford
ECCONET Data Linkage Workshop
Bergen 15 – 17 June 2011
Overview
1. The SAIL system and how it operates
2. Privacy Protection Issues and Drivers
3. Privacy Protection approach
4. Current developments
5. Examples of research studies
6. Future work
What are HIRU and SAIL?
HIRU – the Health Information Research Unit
SAIL – Secure Anonymous Information Linkage
Main aim of HIRU is to realise the potential of
electronically-held, routinely-collected, person-based
data to conduct and support health-related studies
The SAIL databank already holds over 1 billion
anonymised and encrypted individual-level records, from
a range of sources relevant to health and well-being
Is SAIL a Cohort?
Perhaps!
Total population databank for the 3 million people of
Wales
Multi source data (administrative, clinical, research)
Many nested e-cohorts within SAIL (such as WECC )
Data Linkage is the key!
Data linkage (at a person level) is essential to reap the
benefits of routine data
Good quality data linkage needs some form of consistent
personal identifiers on which to link
In the UK multi source data do not share a common ID
number. Names, Address, Date of Birth, ARE however,
normally collected.
In the beginning . . .
There was a real opportunity to create this data resource
We had established how linked, routine data was useful
for research
We knew there were numerous technical (computing)
challenges to overcome
But the idea required data owners / guardians to feel able
to provide their data to SAIL.
Constructing the circumstances that enabled data
guardians to supply data to SAIL become the single
biggest challenge!
The issues facing SAIL
Data guardians across Wales:
Wanted to participate and saw the potential benefits
Were nervous of breaching the Data Protection Act
Did not have clear guidelines that helped
Needed a way of guaranteeing the privacy of their
data
Were nervous about the uses to which the data might
be put
Wanted access to be controlled (in some way)
The issues facing SAIL
Researchers (across the UK) wanted:
As much data as they wanted, whenever they wanted
it
To avoid detailing what they wanted to do
Data delivered to them
Data to arrive quickly
No admin, no approvals, no constraints
Clean, easy and consistent data
Simple, flat data structures
Our response
Set a series of objectives
Undertook pilot work
Consulted very widely
Understood relevant legislation and good practice
guidance (Information Governance)
Developed the approaches over time
Continued to consult and have external inspection
Continuous improvement process
The initial IG challenge
Matching up the same people in different datasets (data
linkage) is very inaccurate without access to identifiers
(Matching with imperfect identifiers is still a challenge!)
Sophisticated but standardised matching was therefore
required.
Data owners felt able to part with “anonymised at source”
data. However including identifiers in the supply was seen
to be illegal without consent
Setting out
Pilot to prove the concept
One health economy area – Swansea (pop. c. 250k)
Data General Practices (36), Patient Episode Database
Wales (PEDW) and social services data extracts
Purpose: to develop, review, refine technical and
procedural methodologies.
Setting out
Consultations with regulatory and professional bodies (local and
national)
Suitability of system
In the public interest
Protection of patient privacy
Ethics and governance
Usefulness to enhance research and inform policy
Value for money
Exhaustive (and exhausting!) efforts
File of evidence of acceptability
The base level
Response: development of “Split-file anonymisation”
technique
Using the “separation principle”
No flow of identifiable information to SAIL
No flow of identifiable confidential information to
ANYONE
Clear, written, formal data sharing agreements
Clarity about use cases, conditions and exception
clauses
Other design constraints
A pledge to data providers that no data will ever leave the
databank
They can ask for it to be deleted
They know who has accessed it
They know what it has been used for
HIRU methodology (illustration)
Anonymisation process
HIRU (Blue C)
Demographic data only
Clinical / activity data
Recombine
Other recombined data
Validated, anonymised data
Encrypt and load
Operational system
Health Solutions Wales
Data Provider
HIRU (Blue C)
Co
ns
truc
t A
LF
Va
lida
te
Available Computing infrastructure
Blue C supercomputer, one of the fastest computers in
Europe dedicated to Life Science research
Strategic partnership with IBM (through School of
Medicine’s Institute of Life Sciences initiative)
Advanced software toolset (database, data mining,
GIS)
Objectives
1) Secure data transportation
2) Reliable matching process
3) Anonymisation and encryption
4) Disclosure control
5) Data access controls
6) Scrutiny of data utilisation proposals
7) External verification of compliance with IG
Objective 1
1) Secure data transportation
Data transported using HTTPS (Hyper-Text Transfer
Protocol Secure)
DPOs split datasets at source
Clinical details to HIRU (none to HSW)
Demographics to HSW for matching and anonymisation
Brown Envelope principle)
Linking key – re-join after anonymisation
Objective 2
2) Reliable matching process
• Partnership with Trusted Third Party – HSW
• HSW = NHS Agency with right to hold identifiers for NHS
admin purposes
• Use the Welsh Demographic Service administrative register
as gold standard
• MACRAL (Matching Algorithm for Consistent Results in
Anonymous Linkage) - SQL-based algorithm – sequential
passes
• Deterministic and probabilistic record linkage
MACRAL
Exact match on valid NHS
number
Exact match on firstname,
surname, d.o.b, gender, postcode
Soundexing
Lexicon matching
Assigns match probability on
Bayesian model
Informs analysts
Validation and optimisation
Firstly – Validation exercise Obtained specificity values >99.8% and sensitivity > 94.6%
with error rates <0.2%
Then – Optimised techniques for matching a variety of datasets:
primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)
Matching rates Levels of matched records Primary Care General
Practice (GP dataset)
Secondary Care Hospital Admissions
(PEDW dataset)
Social Services (PARIS database)
Number % Number % Number % Sample size
229,127 290,650 18,540
Valid NHS Number
229,117 99.996% 264,868 91.13% - 0.00%
Valid NHS Number plus DRL:
229,123 99.998% 280,729 96.59% 14,158 76.36%
Valid NHS Number plus PRL (99% cut off):
229,125 99.999% 287,572 98.94% 17,095 92.21%
Valid NHS Number plus PRL (95% cut off):
229,125 99.999% 288,186 99.15% 17,431 94.02%
Valid NHS Number plus PRL (90% cut off):
229,125 99.999% 288,424 99.23% 17,553 94.68%
Valid NHS Number plus PRL (50% cut off):
229,125 99.999% 288,670 99.32% 17,639 95.14%
Overall combining Valid NHS, DRL & PRL (50%):
229,125 99.999% 288,683 99.32% 17,642 95.16%
Objective 3
3) Anonymisation and encryption
Anonymous Linking Field (ALF)
One person – one ALF
Aggregation and categorisation
Further processing at HIRU
Into ALF_E
Recombination
Objective 4
4) Disclosure controls
Assessment of Uniques and low-copy numbers
Data reduced to minimum required for study
Operated at various stages:
When the data view is created
Before dissemination
Numerical Evaluation of Multiple Outputs
Combination of expert review and machines processes
Numerical Evaluation of Multiple Outputs
NEMO
SQL-based algorithm
Counts unique and low-copy number records
Allows the judicious application of suppression and/or
aggregation
Manual review
Linkage/Homogeneity attack
Objective 5
5) Data access controls
Technical and permission-based control
Policies and Standard Operating Procedures (SOPs)
User agreements – clarity + penalties
Project-based approvals and linked access
Physical restrictions - technology
Time-limited, specific data views per approved project
SAIL Gateway
Improving access: SAIL Gateway
DB2 & SSH Firew
all
WebServers
AD authentication
ftp
ssh
ftp
Firew
all
DB2 Node
DB2 Co-ord
DB2 Node
DB2 Node
DB2 Node
Remote Desktops
SAILInner
Firew
all
SAIL DMZ
RDP
IN
Out
Secure file transfer
Guaridan
ADAuthentication
DB2 database ports only
Secure file transfer
Local Tec Team
VPN Server
InternalFileservers
Project App Server
Firew
all
SAILPerimeter Remote
User(off campus)
ssh
Local Analysis(on campus)
Firewall
HIRU
VmwareServers
WSUSServer
Firew
all
SAIL NHS
Firew
all
NHS controlled firewall
NHS UserNHS
VPN Server
RDP
Hel
p R
esou
rces
Con
figur
atio
n D
etai
ls
Hel
p R
esou
rces
Con
figur
atio
n D
etai
ls
SAIL GatewayWeb server
SAIL Gateway: Critical features
Firewalled network Windows XP Desktops one per user running in a virtualised
environment (VPN) All desktop and server members of active directory and specific
group policies applied Only remote desktop (RDP) allowed through firewall to the windows
XP desktops Localised file storage for windows XP desktops both private and
shared between desktops within the Gateway Ability to host application servers within environment Automated one-way transfer of data into the environment Authorised limited transfer of data out of the environment
Objective 6
6) Scrutiny of data utilisation proposals Collaboration Review System – applies to all uses Information Governance Review Panel (IGRP)
British Medical Association Public Health Wales National Research Ethics Service Informing Healthcare Involving People
Objective 7
7) External verification of compliance with IG (Audit) Important to:
Reassure DPOs and other partners Gain recommendations for improvement
Conduct: Policies and SOPs Interviews System verification
The SAIL systemProject
ViewProject
Request
National Datasets
Social care
OthersNHS
IGRP
SOPs and Policies
Disclosure control
Access controls
Views
SAIL databank
Masking and encryption
Anonymisation service
Data Users
HIRU&
IGRP
HIRU
HSW
Data Sources
Subsequent refinements• Role based access
• Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical
• SAIL Gateway
• Uploading, tool selection, performance
• Results out / approvals
• Wiki, help, training materials, code of conduct, messaging
• Tighter user agreements (line management sign off) & Clearer sanctions for misuse
• Purpose-built virtual IGRP committee technology
Data
• Data on 3 million people, ≈ 2 billion records, and growing!
• Historical data 5 – 20 years
• Maintains address history for full period (exposures)
• Most codified using ICD, Read codes, OPCS codes, SNOMED, etc.
• Many hundreds of separate data suppliers
• Free text a real (IG) challenge• Unknown use of identifiers
• Potential for ‘risky’ comments
• Hard to analyse in quantity
• Now a major work stream
National datasets - examples PEDW - in-patients & day cases and out-patients National Community Child Health Database NHS Direct Wales Cancer incidence registry for Wales National screening programmes Congenital abnormalities Ambulance service data National Pupil Database (performance and attendance of
children at School)
And much more…
Local datasets - examples
General Practice
Pathology
A&E departments
Social services
Local authority housing data – RALFs
And more….
Research datasets
Data collected as part of research studies where the aim is
to use routine data as well
Permissions, consent and regulatory approvals
Do not release SAIL data to researchers to link to study
datasets
Treat as dataset from DPO – study dataset anonymised
and loaded into SAIL for linkage with SAIL data
Clinical systems
Introduced new clinical systems to send data direct to SAIL
(via standard mechanisms)
Working with NHS Wales to introduce new national
systems
SAIL now central part of the NHS’s “secondary uses”
approaches – new data from new national systems e.g. -
radiology, pathology, emergency, etc.
Other advancements Data collected directly from the people of Wales (and beyond) via
internet portals. Currently disease cohort specific, moving to all-Wales
SAIL data now linked to local histopathology sample archive (tissue
bank), with potential to link to national cancer (tissue) bank
Flow of imaging data (MRI, ECG, etc.) from local NHS providers
Set up of a public advocates group
Linkage of national (cross-sectional) surveys – consent issues
Genomics data under consideration (special IG issues!)
Increasingly used by NHS to monitor and plan services – change of use
Residential Linking Fields (RALFs) . . .
RALFs
Desire to know more about:
The properties people live in (characteristics, proximity
to geographical features)
Who they live with (household relationships, familial
relationships etc)
A real problem to do while maintaining anonymity
Our Solution: RALFs
An ALF has a RALF, all RALFs have 1+ ALFs (usually)
b. KEY and addresses with environment metrics
d. RALFs and environment metrics
e. Combination of RALFs with ALFs
OS Dataa. Create environment metrics
HIRU GIS WDS
EncryptEncrypt
SAIL
c. Match incoming address data and
attach RALFs
HIRU HSW
Residential Anonymous Linking Fields - RALFs
Methodology references - Architecture
Methodology references - Matching
Methodology references - RALFs
Summary
Privacy is not just about the individual – it sometimes
relates to the organisation
Preserving privacy reduces research utility
Finding the balance between privacy protection and
research utility is the key
There is no perfect balance
Thanks
Data providers - NHS organisations, local authorities and government agencies, and more
Health Solutions Wales
NHS Wales Informatics Service
National Institute for Social Care and Health Research
Welsh Government
Information Governance Review Panel
Researchers of Wales and beyond
And to you for listening!
Thanks