Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17...

46
Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011

Transcript of Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17...

Page 1: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Privacy Protection

& the SAIL Databank

David Ford

ECCONET Data Linkage Workshop

Bergen 15 – 17 June 2011

Page 2: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Overview

1. The SAIL system and how it operates

2. Privacy Protection Issues and Drivers

3. Privacy Protection approach

4. Current developments

5. Examples of research studies

6. Future work

Page 3: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

What are HIRU and SAIL?

HIRU – the Health Information Research Unit

SAIL – Secure Anonymous Information Linkage

Main aim of HIRU is to realise the potential of

electronically-held, routinely-collected, person-based

data to conduct and support health-related studies

The SAIL databank already holds over 1 billion

anonymised and encrypted individual-level records, from

a range of sources relevant to health and well-being

Page 4: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Is SAIL a Cohort?

Perhaps!

Total population databank for the 3 million people of

Wales

Multi source data (administrative, clinical, research)

Many nested e-cohorts within SAIL (such as WECC )

Page 5: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Data Linkage is the key!

Data linkage (at a person level) is essential to reap the

benefits of routine data

Good quality data linkage needs some form of consistent

personal identifiers on which to link

In the UK multi source data do not share a common ID

number. Names, Address, Date of Birth, ARE however,

normally collected.

Page 6: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

In the beginning . . .

There was a real opportunity to create this data resource

We had established how linked, routine data was useful

for research

We knew there were numerous technical (computing)

challenges to overcome

But the idea required data owners / guardians to feel able

to provide their data to SAIL.

Constructing the circumstances that enabled data

guardians to supply data to SAIL become the single

biggest challenge!

Page 7: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

The issues facing SAIL

Data guardians across Wales:

Wanted to participate and saw the potential benefits

Were nervous of breaching the Data Protection Act

Did not have clear guidelines that helped

Needed a way of guaranteeing the privacy of their

data

Were nervous about the uses to which the data might

be put

Wanted access to be controlled (in some way)

Page 8: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

The issues facing SAIL

Researchers (across the UK) wanted:

As much data as they wanted, whenever they wanted

it

To avoid detailing what they wanted to do

Data delivered to them

Data to arrive quickly

No admin, no approvals, no constraints

Clean, easy and consistent data

Simple, flat data structures

Page 9: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Our response

Set a series of objectives

Undertook pilot work

Consulted very widely

Understood relevant legislation and good practice

guidance (Information Governance)

Developed the approaches over time

Continued to consult and have external inspection

Continuous improvement process

Page 10: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

The initial IG challenge

Matching up the same people in different datasets (data

linkage) is very inaccurate without access to identifiers

(Matching with imperfect identifiers is still a challenge!)

Sophisticated but standardised matching was therefore

required.

Data owners felt able to part with “anonymised at source”

data. However including identifiers in the supply was seen

to be illegal without consent

Page 11: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Setting out

Pilot to prove the concept

One health economy area – Swansea (pop. c. 250k)

Data General Practices (36), Patient Episode Database

Wales (PEDW) and social services data extracts

Purpose: to develop, review, refine technical and

procedural methodologies.

Page 12: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Setting out

Consultations with regulatory and professional bodies (local and

national)

Suitability of system

In the public interest

Protection of patient privacy

Ethics and governance

Usefulness to enhance research and inform policy

Value for money

Exhaustive (and exhausting!) efforts

File of evidence of acceptability

Page 13: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

The base level

Response: development of “Split-file anonymisation”

technique

Using the “separation principle”

No flow of identifiable information to SAIL

No flow of identifiable confidential information to

ANYONE

Clear, written, formal data sharing agreements

Clarity about use cases, conditions and exception

clauses

Page 14: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Other design constraints

A pledge to data providers that no data will ever leave the

databank

They can ask for it to be deleted

They know who has accessed it

They know what it has been used for

Page 15: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

HIRU methodology (illustration)

Anonymisation process

HIRU (Blue C)

Demographic data only

Clinical / activity data

Recombine

Other recombined data

Validated, anonymised data

Encrypt and load

Operational system

Health Solutions Wales

Data Provider

HIRU (Blue C)

Co

ns

truc

t A

LF

Va

lida

te

Page 16: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Available Computing infrastructure

Blue C supercomputer, one of the fastest computers in

Europe dedicated to Life Science research

Strategic partnership with IBM (through School of

Medicine’s Institute of Life Sciences initiative)

Advanced software toolset (database, data mining,

GIS)

Page 17: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objectives

1) Secure data transportation

2) Reliable matching process

3) Anonymisation and encryption

4) Disclosure control

5) Data access controls

6) Scrutiny of data utilisation proposals

7) External verification of compliance with IG

Page 18: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 1

1) Secure data transportation

Data transported using HTTPS (Hyper-Text Transfer

Protocol Secure)

DPOs split datasets at source

Clinical details to HIRU (none to HSW)

Demographics to HSW for matching and anonymisation

Brown Envelope principle)

Linking key – re-join after anonymisation

Page 19: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 2

2) Reliable matching process

• Partnership with Trusted Third Party – HSW

• HSW = NHS Agency with right to hold identifiers for NHS

admin purposes

• Use the Welsh Demographic Service administrative register

as gold standard

• MACRAL (Matching Algorithm for Consistent Results in

Anonymous Linkage) - SQL-based algorithm – sequential

passes

• Deterministic and probabilistic record linkage

Page 20: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

MACRAL

Exact match on valid NHS

number

Exact match on firstname,

surname, d.o.b, gender, postcode

Soundexing

Lexicon matching

Assigns match probability on

Bayesian model

Informs analysts

Page 21: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Validation and optimisation

Firstly – Validation exercise Obtained specificity values >99.8% and sensitivity > 94.6%

with error rates <0.2%

Then – Optimised techniques for matching a variety of datasets:

primary care (GP), hospital/secondary care (PEDW), and social care (PARIS)

Page 22: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Matching rates Levels of matched records Primary Care General

Practice (GP dataset)

Secondary Care Hospital Admissions

(PEDW dataset)

Social Services (PARIS database)

Number % Number % Number % Sample size

229,127 290,650 18,540

Valid NHS Number

229,117 99.996% 264,868 91.13% - 0.00%

Valid NHS Number plus DRL:

229,123 99.998% 280,729 96.59% 14,158 76.36%

Valid NHS Number plus PRL (99% cut off):

229,125 99.999% 287,572 98.94% 17,095 92.21%

Valid NHS Number plus PRL (95% cut off):

229,125 99.999% 288,186 99.15% 17,431 94.02%

Valid NHS Number plus PRL (90% cut off):

229,125 99.999% 288,424 99.23% 17,553 94.68%

Valid NHS Number plus PRL (50% cut off):

229,125 99.999% 288,670 99.32% 17,639 95.14%

Overall combining Valid NHS, DRL & PRL (50%):

229,125 99.999% 288,683 99.32% 17,642 95.16%

Page 23: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 3

3) Anonymisation and encryption

Anonymous Linking Field (ALF)

One person – one ALF

Aggregation and categorisation

Further processing at HIRU

Into ALF_E

Recombination

Page 24: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 4

4) Disclosure controls

Assessment of Uniques and low-copy numbers

Data reduced to minimum required for study

Operated at various stages:

When the data view is created

Before dissemination

Numerical Evaluation of Multiple Outputs

Combination of expert review and machines processes

Page 25: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Numerical Evaluation of Multiple Outputs

NEMO

SQL-based algorithm

Counts unique and low-copy number records

Allows the judicious application of suppression and/or

aggregation

Manual review

Linkage/Homogeneity attack

Page 26: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 5

5) Data access controls

Technical and permission-based control

Policies and Standard Operating Procedures (SOPs)

User agreements – clarity + penalties

Project-based approvals and linked access

Physical restrictions - technology

Time-limited, specific data views per approved project

SAIL Gateway

Page 27: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Improving access: SAIL Gateway

DB2 & SSH Firew

all

WebServers

AD authentication

ftp

ssh

ftp

Firew

all

DB2 Node

DB2 Co-ord

DB2 Node

DB2 Node

DB2 Node

Remote Desktops

SAILInner

Firew

all

SAIL DMZ

RDP

IN

Out

Secure file transfer

Guaridan

ADAuthentication

DB2 database ports only

Secure file transfer

Local Tec Team

VPN Server

InternalFileservers

Project App Server

Firew

all

SAILPerimeter Remote

User(off campus)

ssh

Local Analysis(on campus)

Firewall

HIRU

VmwareServers

WSUSServer

Firew

all

SAIL NHS

Firew

all

NHS controlled firewall

NHS UserNHS

VPN Server

RDP

Hel

p R

esou

rces

Con

figur

atio

n D

etai

ls

Hel

p R

esou

rces

Con

figur

atio

n D

etai

ls

SAIL GatewayWeb server

Page 28: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

SAIL Gateway: Critical features

Firewalled network Windows XP Desktops one per user running in a virtualised

environment (VPN) All desktop and server members of active directory and specific

group policies applied Only remote desktop (RDP) allowed through firewall to the windows

XP desktops Localised file storage for windows XP desktops both private and

shared between desktops within the Gateway Ability to host application servers within environment Automated one-way transfer of data into the environment Authorised limited transfer of data out of the environment

Page 29: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 6

6) Scrutiny of data utilisation proposals Collaboration Review System – applies to all uses Information Governance Review Panel (IGRP)

British Medical Association Public Health Wales National Research Ethics Service Informing Healthcare Involving People

Page 30: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Objective 7

7) External verification of compliance with IG (Audit) Important to:

Reassure DPOs and other partners Gain recommendations for improvement

Conduct: Policies and SOPs Interviews System verification

Page 31: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

The SAIL systemProject

ViewProject

Request

National Datasets

Social care

OthersNHS

IGRP

SOPs and Policies

Disclosure control

Access controls

Views

SAIL databank

Masking and encryption

Anonymisation service

Data Users

HIRU&

IGRP

HIRU

HSW

Data Sources

Page 32: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Subsequent refinements• Role based access

• Technical, Senior Analyst, Approved Analyst, User / statistician, HSW technical

• SAIL Gateway

• Uploading, tool selection, performance

• Results out / approvals

• Wiki, help, training materials, code of conduct, messaging

• Tighter user agreements (line management sign off) & Clearer sanctions for misuse

• Purpose-built virtual IGRP committee technology

Page 33: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Data

• Data on 3 million people, ≈ 2 billion records, and growing!

• Historical data 5 – 20 years

• Maintains address history for full period (exposures)

• Most codified using ICD, Read codes, OPCS codes, SNOMED, etc.

• Many hundreds of separate data suppliers

• Free text a real (IG) challenge• Unknown use of identifiers

• Potential for ‘risky’ comments

• Hard to analyse in quantity

• Now a major work stream

Page 34: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

National datasets - examples PEDW - in-patients & day cases and out-patients National Community Child Health Database NHS Direct Wales Cancer incidence registry for Wales National screening programmes Congenital abnormalities Ambulance service data National Pupil Database (performance and attendance of

children at School)

And much more…

Page 35: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Local datasets - examples

General Practice

Pathology

A&E departments

Social services

Local authority housing data – RALFs

And more….

Page 36: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Research datasets

Data collected as part of research studies where the aim is

to use routine data as well

Permissions, consent and regulatory approvals

Do not release SAIL data to researchers to link to study

datasets

Treat as dataset from DPO – study dataset anonymised

and loaded into SAIL for linkage with SAIL data

Page 37: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Clinical systems

Introduced new clinical systems to send data direct to SAIL

(via standard mechanisms)

Working with NHS Wales to introduce new national

systems

SAIL now central part of the NHS’s “secondary uses”

approaches – new data from new national systems e.g. -

radiology, pathology, emergency, etc.

Page 38: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Other advancements Data collected directly from the people of Wales (and beyond) via

internet portals. Currently disease cohort specific, moving to all-Wales

SAIL data now linked to local histopathology sample archive (tissue

bank), with potential to link to national cancer (tissue) bank

Flow of imaging data (MRI, ECG, etc.) from local NHS providers

Set up of a public advocates group

Linkage of national (cross-sectional) surveys – consent issues

Genomics data under consideration (special IG issues!)

Increasingly used by NHS to monitor and plan services – change of use

Residential Linking Fields (RALFs) . . .

Page 39: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

RALFs

Desire to know more about:

The properties people live in (characteristics, proximity

to geographical features)

Who they live with (household relationships, familial

relationships etc)

A real problem to do while maintaining anonymity

Our Solution: RALFs

An ALF has a RALF, all RALFs have 1+ ALFs (usually)

Page 40: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

b. KEY and addresses with environment metrics

d. RALFs and environment metrics

e. Combination of RALFs with ALFs

OS Dataa. Create environment metrics

HIRU GIS WDS

EncryptEncrypt

SAIL

c. Match incoming address data and

attach RALFs

HIRU HSW

Residential Anonymous Linking Fields - RALFs

Page 41: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Methodology references - Architecture

Page 42: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Methodology references - Matching

Page 43: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Methodology references - RALFs

Page 44: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Summary

Privacy is not just about the individual – it sometimes

relates to the organisation

Preserving privacy reduces research utility

Finding the balance between privacy protection and

research utility is the key

There is no perfect balance

Page 45: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Thanks

Data providers - NHS organisations, local authorities and government agencies, and more

Health Solutions Wales

NHS Wales Informatics Service

National Institute for Social Care and Health Research

Welsh Government

Information Governance Review Panel

Researchers of Wales and beyond

And to you for listening!

Page 46: Privacy Protection & the SAIL Databank David Ford ECCONET Data Linkage Workshop Bergen 15 – 17 June 2011.

Thanks