IBM Research IMIA Conference – Security in Health Information Systems | April 29, 2006 Securing...

29
IBM Research IMIA Conference – Security in Health Information Systems | April 29, 2006 Securing Electronic Health Records without Impeding the Flow of Information Christopher Johnson IBM Almaden Research Center San Jose, CA [email protected] Rakesh Agrawal * Microsoft Search Labs Mountain View, CA rakesha@microsoft. com * Based on work done while author was at IBM Almaden

Transcript of IBM Research IMIA Conference – Security in Health Information Systems | April 29, 2006 Securing...

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Securing Electronic Health Records without Impeding the Flow of Information

Christopher JohnsonIBM Almaden Research CenterSan Jose, [email protected]

Rakesh Agrawal*

Microsoft Search LabsMountain View, [email protected]

* Based on work done while author was at IBM Almaden

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Based on joint work with

Roberto Bayardo Alvin Cheung Alexandre Evfimievski Tyrone Grandison Jerry Kiernan Kristen Lefevre Ramakrishnan Srikant Yirong Xu

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Thesis

Technology alone cannot solve the complex problem of securely managing the health information; at the same time, policy and law needs to be informed of what is technically feasible and in what timeframe.

By advancing technology, we can:– change the mix of legislation, societal norms,

market forces, and technology comprising the solution; and

– improve the overall quality of the solution.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Outline

Illustrate thesis with technology examples based on Hippocratic database work

Recommendations for– Policy designers and legislators

– Solution developers

– Scientists and researchers

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Hippocratic Database Technologies

Sovereign Information Integration

Selective, minimal sharing across autonomous data sources, without trusted

third party

Optimal k-anonymization

De-identifies records in a way that maintains truthful

data but is not prone to data linkage attacks

GOAL

Create a new generation of information systems that protect the privacy, security, and ownership of data while not impeding the flow of information.

Active EnforcementData item level

enforcement of disclosure policies and patient

preferences

Compliance AuditingDetermine whether data has been accessed in violation

of specified policies

Privacy-Preserving Data Mining

Preserves privacy at individual level, allowing

accurate data mining models at aggregate level

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

DATABASE

Application DataRetrieval

EnforcementJDBC/ODBC Driver

Patient Records

Patient Preferences& Data Collection

NegotiationPatient Preferences& Policy Matching

Installed Policy

Policy

Creation

InstallationPolicyParser

-40Daniel4

333-3333-Bob3

111-111125Adam1

PhoneAgeName#

• Provides cell-level disclosure control.

• Application modification not required.

• Database agnostic; does not require changes to the database engine.

• Active Enforcement system intercepts and rewrites incoming queries to comply with policies, subject choices, and context.

• Rewritten queries benefit from all of the optimizations and performance enhancements provided by the underlying engine (e.g. parallelism).

VLDB 02, WWW 03, VLDB 04

Active Enforcement• Privacy Policy: Organizations define a set of policies describing who may access data (users or roles), for what purposes the data may be accessed (purposes) and to whom the data may be disclosed (recipients).

• Consent: Data subjects are given control, through opt-in and opt-out choices, over who may see their data and under what circumstances

• Disclosure Control: Database enforces privacy policies and data subject consent choices with respect to all data access.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Query Modification Example (Disclose Name only of Patients who have opted-in)

SELECT Name FROM Patients WHERE Age < 20

SELECTCASE WHEN EXISTS

(SELECT Name_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Name_Choice = 1)

THEN Name ELSE null ENDFROM PatientsWHERE Age < 20AND EXISTS

(SELECT Patient#_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Patient#_Choice = 1)

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

0

10

20

30

40

0 20 40 60 80 100Choice Selectivity (%)

Ela

pse

d T

ime

(sec

on

ds)

Modified External MultipleUnmodified

Modified Internal

Measured performance of a query selecting all records from a 5 million-record table Compared performance of original and modified queries for varied choice selectivity Not surprisingly, performance actually better for modified queries when we use

privacy enforcement as an additional selection condition– Able to use indexes on choice values

Shows the importance of database-level privacy enforcement for performance

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Audit Scenario

Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes

The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action

Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests

Jane has not been feeling well and decides to consult her doctor

The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Audit Expression

audit T.disease

from Customer C, Treatment T

where C.cid=T.pcid and C.name = ‘Jane’

Who has accessed Jane’s disease information?

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Problem Statement

Given– A log of queries

– An audit expression specifying sensitive data

NOT Given– Log of data accesses

Precisely and Efficiently identify– Those queries that accessed the data

specified by the audit expression in the past

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Compliance Auditing

DataTables

2004-02…

2004-02…

Timestamp

S. RobertsTreatmentS. RobertsSelect …2

PharmaCo.MarketingB. JonesSelect …1

RecipientPurposeUserQueryID

Query Audit Log

DatabaseLayer

Query with purpose, recipient

Generate audit recordfor each query

Updates, inserts, delete

Backlog

Database triggers track updates to base tables

Audit

DatabaseLayer

Audit queryIDs of log queries having accessed data specified by the audit query

• Audits whether particular data has been disclosed in violation of the specified policies.

• Audit expression specifies what potential data disclosures need monitoring.

• Identifies logged queries that accessed the specified data.

• Auditors can analyze the circumstances of violations.

• Make necessary corrections to procedures, policies, security.

VLDB 04

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Overhead on Updates

0

50

100

150

200

250

5 20 35 50

# of versions per tuple

Tim

e (

min

ute

s)

CompositeSimpleNo IndexNo Triggers

7x if all tuples are updates3x if a single tuple is updated

Negligibleby usingRecovery

Log to buildBacklog tables

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Audit Query Execution Time

1

10

100

1000

# versions per tuple

Tim

e (

mse

c.)

Simple-ISimple-CComposite-IComposite-C

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Privacy Preserving Data Mining

0

200

400

600

800

1000

1200

2 10 18 26 34 42 50 58 66 74 82

Original Randomized Reconstructed

128 | 130 | ... 126 | 210 | ...

Randomizer Randomizer

161 | 165 | ... 129 | 190 | ...

Reconstructdistribution

of LDL

Reconstructdistributionof weight

Data Mining Algorithms

Data Mining Model

Kevin’s LDL

Kevin’s weight

Julie’s LDL

126+35

0

20

40

60

80

100

120

10 20 40 60 80 100 150 200

Randomization Level

Original Randomized Reconstructed

Preserves privacy at the individual patient level, but allows accurate data mining models to be constructed at the aggregate level.

Adds random noise to individual values to protect patient privacy.

EM algorithm estimates original distribution of values given randomized values + randomization function.

Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.

Sigmod00, KDD02, Sigmod05

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Optimal k-Anonymization

Goal: De-identify patient data such that it retains its integrity, but is resistant to data linkage attacks.

Motivation: Naïve de-identification methods are prone to data linkage attacks, which combine subject data with publicly available information to re-identify represented individuals.

Samarati and Sweeney k-Anonymity* Method

– A k-anonymized data set has the property that each record is indistinguishable from at least k-1 other records within the data set.

Optimal k-Anonymization

– We have developed a k-anonymization algorithm that finds optimal k-anonymizations under two representative cost measures and variations of k.

* P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when Disclosing Information.” In Proc. of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, 188, 1998.

Process of k-Anonymization

• Data Suppression - Involves deleting particular cell values or entire tuples.

• Value Generalization - Entails replacing specific values, such as a telephone number, with more general ones, such as the area code alone.

Advantages of Optimal k-anonymization

• Truthful - Unlike other disclosure protection techniques that use data scrambling, swapping, or adding noise, all information within a k-anonymized dataset is truthful.

• Secure - More secure than other de-identification methods, which may inadvertently reveal confidential information.

(k=2, on name, address, age)

13, rue des Canettes

Name

Eric

Paul Hypertens.

2821, rue du Mont DoreHenri

7, rue du Mont Dore

Marc

42

26

Diagnosis AgeAddress

Diabetes

Influenza

Asthma

4748, rue du Four

Paris

City

Paris

Paris

Paris

6th Arrond.

Name

*

* Hypertens.

20-2917th Arrond.*

17th Arrond.

*

40-49

20-29

Diagnosis AgeAddress

Diabetes

Influenza

Asthma

40-496th Arrond.

Paris

City

Paris

Paris

Paris

ICDE05

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Sigmod 03, DIVO 04

Sovereign Information Integration Separate databases due to statutory,

competitive, or security reasons. Selective, minimal sharing on a

need-to-know basis.

Example: Among those patients who took a particular drug, how many with a specified DNA sequence had an adverse reaction? Researchers must not learn anything

beyond counts.

• Algorithms for computing joins and join counts while revealing minimal additional information.

Minimal Necessary Sharing

R S R must not

know that S has b and y

S must not know that R has a and x

v

u

RS

x

v

u

a

y

v

u

b

R

S

Count (R S) R and S do not learn

anything except that the result is 2.Medical

ResearchInst.

DNA Sequences

DrugReactions

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Recommendations

Policy Makers & legislators– Continuous technology monitoring and understanding to inform policies

and laws (current and new)

– Invest in research

Solution Developers (Technologists)– Design-in ethical considerations (e.g. respect for privacy, safeguard

against misuse); they can’t be afterthoughts

– Engage in dialog with policy makers and legislators to educate them on performance implications of the policies/laws

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Recommendations for Researchers

Asking questions is easy:it's answering them that's

hard.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Policy Specification

How to determine if the policy specification accurately captures the intent of the policy maker? (The person specifying the policy is usually not a computer scientist.)

How to help the patient understand the policy and the implications of his or her choices?

How to design a policy language that reconciles the goals of understandability and efficient computation?

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Sticky Policies

Healthcare organizations should be assured that original policy controls will be enforced over data after transfer to other entities.

Transferees of patient data should be capable of applying source disclosure policies to any information in its database.

Database should enforce source and enterprise policies and resolve any conflicts among policies.

Patient data + policy annotations

Hospital 1

policies

patient data

Hospital 2

Patient Records DB

Data compliantwith source and enterprise policies

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Data Pointillism

Name Phone

Bob 394-1015

Alice 396-1012

Alice 396-1112

Phone Address City

396-1012 Maple St Chatham

394-1015 - Madison

396-1112 Maple St Madison

Patient Policy#

Alice AAA1035

Bob AAA1035

Alice UHG1035

Bob 394-1015 Maple St Madison AAA1035

Alice 396-1012 Maple St Chatham UHG1035

Alice 396-1112 Maple St Madison AAA1035

Pointillist • > 14B records with Choicepoint

• Data from > 22,000 sources in RDC’s GRID

• >550 companies compiling databases of pvt information

• Accuracy? Limits?

• How to allow someone to verify data?

•Identifying and correcting errors?

• Usage control?

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Massively Distributed Data Management

What if patient data is stored on personal devices? Pervasive monitoring devices will also collect patient data. How to protect the security of these devices? Enable selective sharing of information stored on devices? Distributed backup in the network to prevent data loss?

512MB SanDisk Cruzer $47.99

Transcend 40GB Portable Hard Disk USB 95mm x 71.5mm x 15mm, $189

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Data Life Cycle Management

Healthcare organizations must define data retention policies based on legal requirements and patient specifications:

–HIPAA: 6 years (21 years for pediatric care).

–Medicare: 5 to 7 years

–AHA & AHIMA: at least 10 years

Data compression vs. encryption

How to remove expired data and forget persistent data?

How to establish truthfulness of data?

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Interoperability

Sovereign health information systems must be able to communicate among one another, using standard data formats and clinical vocabularies.

Examples of current efforts include:–HL7 messaging standards

–SNOMED-CT vocabularies

–CDA and CCR document standards

Much work remains to be done to make systems interoperable.

Mass collaboration might be useful in defining clinical vocabularies and taxonomies.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Concluding Remarks

Hippocratic Database technologies protect the security of electronic health records and patient privacy without impeding the flow of information.

We need not sacrifice security or privacy to gain value from EHRs for diagnosis, treatment, and research.

We must focus on:– Deriving value from bits we know how

to manage.

– Demonstrating what could not be done before.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

Thank you!

Papers: rakesh.agrawal-family.comCollaborations:

[email protected]@us.ibm.com

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

ReferencesActive Enforcement

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. “Hippocratic Databases.” 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.

K. Lefevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, D. DeWitt. "Limiting Disclosure in Hippocratic Databases". Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004.

Compliance Auditing

R. Agrawal, R. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau and R. Srikant. “Auditing Compliance with a Hippocratic Database.” Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004.

Privacy-Preserving Data Mining

R. Agrawal and R. Srikant. "Privacy-Preserving Data Mining". Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.

A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. "Privacy Preserving Mining of Association Rules". Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, July 2002.

IBM Research

IMIA Conference – Security in Health Information Systems | April 29, 2006

ReferencesOptimal k-Anonymization

R. J. Bayardo and R. Agrawal. "Data Privacy Through Optimal k-Anonymization". To appear in Proc. of the 21st Int'l Conf. on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005.

Sovereign Information Integration

R. Agrawal, A. Evfimievski, R. Srikant. “Information Sharing Across Private Databases.” ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.

R. Agrawal, D. Asonov and R. Srikant. "Enabling Sovereign Information Sharing Using Web Services". Proc. of the ACM SIGMOD Conference on Management of Data, Paris, France, June 2004.