IBM Research IMIA Conference – Security in Health Information Systems | April 29, 2006 Securing...
-
Upload
harvey-nelson -
Category
Documents
-
view
213 -
download
0
Transcript of IBM Research IMIA Conference – Security in Health Information Systems | April 29, 2006 Securing...
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Securing Electronic Health Records without Impeding the Flow of Information
Christopher JohnsonIBM Almaden Research CenterSan Jose, [email protected]
Rakesh Agrawal*
Microsoft Search LabsMountain View, [email protected]
* Based on work done while author was at IBM Almaden
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Based on joint work with
Roberto Bayardo Alvin Cheung Alexandre Evfimievski Tyrone Grandison Jerry Kiernan Kristen Lefevre Ramakrishnan Srikant Yirong Xu
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Thesis
Technology alone cannot solve the complex problem of securely managing the health information; at the same time, policy and law needs to be informed of what is technically feasible and in what timeframe.
By advancing technology, we can:– change the mix of legislation, societal norms,
market forces, and technology comprising the solution; and
– improve the overall quality of the solution.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Outline
Illustrate thesis with technology examples based on Hippocratic database work
Recommendations for– Policy designers and legislators
– Solution developers
– Scientists and researchers
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Hippocratic Database Technologies
Sovereign Information Integration
Selective, minimal sharing across autonomous data sources, without trusted
third party
Optimal k-anonymization
De-identifies records in a way that maintains truthful
data but is not prone to data linkage attacks
GOAL
Create a new generation of information systems that protect the privacy, security, and ownership of data while not impeding the flow of information.
Active EnforcementData item level
enforcement of disclosure policies and patient
preferences
Compliance AuditingDetermine whether data has been accessed in violation
of specified policies
Privacy-Preserving Data Mining
Preserves privacy at individual level, allowing
accurate data mining models at aggregate level
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
DATABASE
Application DataRetrieval
EnforcementJDBC/ODBC Driver
Patient Records
Patient Preferences& Data Collection
NegotiationPatient Preferences& Policy Matching
Installed Policy
Policy
Creation
InstallationPolicyParser
-40Daniel4
333-3333-Bob3
111-111125Adam1
PhoneAgeName#
• Provides cell-level disclosure control.
• Application modification not required.
• Database agnostic; does not require changes to the database engine.
• Active Enforcement system intercepts and rewrites incoming queries to comply with policies, subject choices, and context.
• Rewritten queries benefit from all of the optimizations and performance enhancements provided by the underlying engine (e.g. parallelism).
VLDB 02, WWW 03, VLDB 04
Active Enforcement• Privacy Policy: Organizations define a set of policies describing who may access data (users or roles), for what purposes the data may be accessed (purposes) and to whom the data may be disclosed (recipients).
• Consent: Data subjects are given control, through opt-in and opt-out choices, over who may see their data and under what circumstances
• Disclosure Control: Database enforces privacy policies and data subject consent choices with respect to all data access.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Query Modification Example (Disclose Name only of Patients who have opted-in)
SELECT Name FROM Patients WHERE Age < 20
SELECTCASE WHEN EXISTS
(SELECT Name_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Name_Choice = 1)
THEN Name ELSE null ENDFROM PatientsWHERE Age < 20AND EXISTS
(SELECT Patient#_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Patient#_Choice = 1)
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
0
10
20
30
40
0 20 40 60 80 100Choice Selectivity (%)
Ela
pse
d T
ime
(sec
on
ds)
Modified External MultipleUnmodified
Modified Internal
Measured performance of a query selecting all records from a 5 million-record table Compared performance of original and modified queries for varied choice selectivity Not surprisingly, performance actually better for modified queries when we use
privacy enforcement as an additional selection condition– Able to use indexes on choice values
Shows the importance of database-level privacy enforcement for performance
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Audit Scenario
Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes
The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action
Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests
Jane has not been feeling well and decides to consult her doctor
The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Audit Expression
audit T.disease
from Customer C, Treatment T
where C.cid=T.pcid and C.name = ‘Jane’
Who has accessed Jane’s disease information?
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Problem Statement
Given– A log of queries
– An audit expression specifying sensitive data
NOT Given– Log of data accesses
Precisely and Efficiently identify– Those queries that accessed the data
specified by the audit expression in the past
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Compliance Auditing
DataTables
2004-02…
2004-02…
Timestamp
S. RobertsTreatmentS. RobertsSelect …2
PharmaCo.MarketingB. JonesSelect …1
RecipientPurposeUserQueryID
Query Audit Log
DatabaseLayer
Query with purpose, recipient
Generate audit recordfor each query
Updates, inserts, delete
Backlog
Database triggers track updates to base tables
Audit
DatabaseLayer
Audit queryIDs of log queries having accessed data specified by the audit query
• Audits whether particular data has been disclosed in violation of the specified policies.
• Audit expression specifies what potential data disclosures need monitoring.
• Identifies logged queries that accessed the specified data.
• Auditors can analyze the circumstances of violations.
• Make necessary corrections to procedures, policies, security.
VLDB 04
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Overhead on Updates
0
50
100
150
200
250
5 20 35 50
# of versions per tuple
Tim
e (
min
ute
s)
CompositeSimpleNo IndexNo Triggers
7x if all tuples are updates3x if a single tuple is updated
Negligibleby usingRecovery
Log to buildBacklog tables
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Audit Query Execution Time
1
10
100
1000
# versions per tuple
Tim
e (
mse
c.)
Simple-ISimple-CComposite-IComposite-C
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Privacy Preserving Data Mining
0
200
400
600
800
1000
1200
2 10 18 26 34 42 50 58 66 74 82
Original Randomized Reconstructed
128 | 130 | ... 126 | 210 | ...
Randomizer Randomizer
161 | 165 | ... 129 | 190 | ...
Reconstructdistribution
of LDL
Reconstructdistributionof weight
Data Mining Algorithms
Data Mining Model
Kevin’s LDL
Kevin’s weight
Julie’s LDL
126+35
0
20
40
60
80
100
120
10 20 40 60 80 100 150 200
Randomization Level
Original Randomized Reconstructed
Preserves privacy at the individual patient level, but allows accurate data mining models to be constructed at the aggregate level.
Adds random noise to individual values to protect patient privacy.
EM algorithm estimates original distribution of values given randomized values + randomization function.
Algorithms for building classification models and discovering association rules on top of privacy-preserved data with only small loss of accuracy.
Sigmod00, KDD02, Sigmod05
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Optimal k-Anonymization
Goal: De-identify patient data such that it retains its integrity, but is resistant to data linkage attacks.
Motivation: Naïve de-identification methods are prone to data linkage attacks, which combine subject data with publicly available information to re-identify represented individuals.
Samarati and Sweeney k-Anonymity* Method
– A k-anonymized data set has the property that each record is indistinguishable from at least k-1 other records within the data set.
Optimal k-Anonymization
– We have developed a k-anonymization algorithm that finds optimal k-anonymizations under two representative cost measures and variations of k.
* P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when Disclosing Information.” In Proc. of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, 188, 1998.
Process of k-Anonymization
• Data Suppression - Involves deleting particular cell values or entire tuples.
• Value Generalization - Entails replacing specific values, such as a telephone number, with more general ones, such as the area code alone.
Advantages of Optimal k-anonymization
• Truthful - Unlike other disclosure protection techniques that use data scrambling, swapping, or adding noise, all information within a k-anonymized dataset is truthful.
• Secure - More secure than other de-identification methods, which may inadvertently reveal confidential information.
(k=2, on name, address, age)
13, rue des Canettes
Name
Eric
Paul Hypertens.
2821, rue du Mont DoreHenri
7, rue du Mont Dore
Marc
42
26
Diagnosis AgeAddress
Diabetes
Influenza
Asthma
4748, rue du Four
Paris
City
Paris
Paris
Paris
6th Arrond.
Name
*
* Hypertens.
20-2917th Arrond.*
17th Arrond.
*
40-49
20-29
Diagnosis AgeAddress
Diabetes
Influenza
Asthma
40-496th Arrond.
Paris
City
Paris
Paris
Paris
ICDE05
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Sigmod 03, DIVO 04
Sovereign Information Integration Separate databases due to statutory,
competitive, or security reasons. Selective, minimal sharing on a
need-to-know basis.
Example: Among those patients who took a particular drug, how many with a specified DNA sequence had an adverse reaction? Researchers must not learn anything
beyond counts.
• Algorithms for computing joins and join counts while revealing minimal additional information.
Minimal Necessary Sharing
R S R must not
know that S has b and y
S must not know that R has a and x
v
u
RS
x
v
u
a
y
v
u
b
R
S
Count (R S) R and S do not learn
anything except that the result is 2.Medical
ResearchInst.
DNA Sequences
DrugReactions
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Recommendations
Policy Makers & legislators– Continuous technology monitoring and understanding to inform policies
and laws (current and new)
– Invest in research
Solution Developers (Technologists)– Design-in ethical considerations (e.g. respect for privacy, safeguard
against misuse); they can’t be afterthoughts
– Engage in dialog with policy makers and legislators to educate them on performance implications of the policies/laws
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Recommendations for Researchers
Asking questions is easy:it's answering them that's
hard.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Policy Specification
How to determine if the policy specification accurately captures the intent of the policy maker? (The person specifying the policy is usually not a computer scientist.)
How to help the patient understand the policy and the implications of his or her choices?
How to design a policy language that reconciles the goals of understandability and efficient computation?
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Sticky Policies
Healthcare organizations should be assured that original policy controls will be enforced over data after transfer to other entities.
Transferees of patient data should be capable of applying source disclosure policies to any information in its database.
Database should enforce source and enterprise policies and resolve any conflicts among policies.
Patient data + policy annotations
Hospital 1
policies
patient data
Hospital 2
Patient Records DB
Data compliantwith source and enterprise policies
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Data Pointillism
Name Phone
Bob 394-1015
Alice 396-1012
Alice 396-1112
Phone Address City
396-1012 Maple St Chatham
394-1015 - Madison
396-1112 Maple St Madison
Patient Policy#
Alice AAA1035
Bob AAA1035
Alice UHG1035
Bob 394-1015 Maple St Madison AAA1035
Alice 396-1012 Maple St Chatham UHG1035
Alice 396-1112 Maple St Madison AAA1035
Pointillist • > 14B records with Choicepoint
• Data from > 22,000 sources in RDC’s GRID
• >550 companies compiling databases of pvt information
• Accuracy? Limits?
• How to allow someone to verify data?
•Identifying and correcting errors?
• Usage control?
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Massively Distributed Data Management
What if patient data is stored on personal devices? Pervasive monitoring devices will also collect patient data. How to protect the security of these devices? Enable selective sharing of information stored on devices? Distributed backup in the network to prevent data loss?
512MB SanDisk Cruzer $47.99
Transcend 40GB Portable Hard Disk USB 95mm x 71.5mm x 15mm, $189
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Data Life Cycle Management
Healthcare organizations must define data retention policies based on legal requirements and patient specifications:
–HIPAA: 6 years (21 years for pediatric care).
–Medicare: 5 to 7 years
–AHA & AHIMA: at least 10 years
Data compression vs. encryption
How to remove expired data and forget persistent data?
How to establish truthfulness of data?
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Interoperability
Sovereign health information systems must be able to communicate among one another, using standard data formats and clinical vocabularies.
Examples of current efforts include:–HL7 messaging standards
–SNOMED-CT vocabularies
–CDA and CCR document standards
Much work remains to be done to make systems interoperable.
Mass collaboration might be useful in defining clinical vocabularies and taxonomies.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Concluding Remarks
Hippocratic Database technologies protect the security of electronic health records and patient privacy without impeding the flow of information.
We need not sacrifice security or privacy to gain value from EHRs for diagnosis, treatment, and research.
We must focus on:– Deriving value from bits we know how
to manage.
– Demonstrating what could not be done before.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
Thank you!
Papers: rakesh.agrawal-family.comCollaborations:
[email protected]@us.ibm.com
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
ReferencesActive Enforcement
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. “Hippocratic Databases.” 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.
K. Lefevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, D. DeWitt. "Limiting Disclosure in Hippocratic Databases". Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004.
Compliance Auditing
R. Agrawal, R. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau and R. Srikant. “Auditing Compliance with a Hippocratic Database.” Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004.
Privacy-Preserving Data Mining
R. Agrawal and R. Srikant. "Privacy-Preserving Data Mining". Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.
A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. "Privacy Preserving Mining of Association Rules". Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, July 2002.
IBM Research
IMIA Conference – Security in Health Information Systems | April 29, 2006
ReferencesOptimal k-Anonymization
R. J. Bayardo and R. Agrawal. "Data Privacy Through Optimal k-Anonymization". To appear in Proc. of the 21st Int'l Conf. on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005.
Sovereign Information Integration
R. Agrawal, A. Evfimievski, R. Srikant. “Information Sharing Across Private Databases.” ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.
R. Agrawal, D. Asonov and R. Srikant. "Enabling Sovereign Information Sharing Using Web Services". Proc. of the ACM SIGMOD Conference on Management of Data, Paris, France, June 2004.