DataTags: Sharing Privacy Sensitive Data by Michael Bar-sinai
Preserving privacy in data sharing - AAMC · Data sharing in multi-center studies • Benefits and...
Transcript of Preserving privacy in data sharing - AAMC · Data sharing in multi-center studies • Benefits and...
Preserving privacy in data sharing
Darren Toh, ScDDepartment of Population Medicine
Harvard Medical School & Harvard Pilgrim Health Care Institute
December 7, 2017
Disclosures
• The work presented here is/was supported by• Patient-Centered Outcomes Research Institute (ME-1403-11305)• Office of the Assistant Secretary for Planning and Evaluation• Food and Drug Administration (HHSF223200910006I)• National Institutes of Health (U01EB023683)• Agency for Healthcare Research and Quality (R01HS019912)
• All statements in this presentation, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of AHRQ, ASPE, FDA, NIH, PCORI, or PCORI’s Board of Governors or Methodology Committee
2
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
3
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
4
Multi-database studies
• Many studies are now done in multi-database settings
5
Benefits of multi-database studies
• Larger sample sizes• Allow studies of rare treatments or rare outcomes• Allow studies in specific subpopulations• Allow studies to be done more quickly
• More diverse populations• Allow more generalizable findings• Allow assessment of treatment effect heterogeneity
6
Types of data shared
• Insurance claims
• Electronic health records (inpatient & outpatient)
• Registries (e.g., birth, immunization, disease, treatment)
• Genomic data
• Patient-generated data
7
Multi-database studies
Analysis center
Site 1
Site 2Site 3
8
Multi-database studies
Analysis center
Site 1
Site 2Site 3
9
Multi-database studies
Analysis center
Site 1
Site 2Site 3
Pooling the entire databases10
Concerns about data sharing
• Patient privacy and confidentiality
• Data security
• Unauthorized use of data
• Inaccurate analysis or interpretation of data
• Disclosure of sensitive institutional or corporate info
• Contractual restrictions
11
Data sharing in multi-database studies
Data we need to conduct the
desired analysis
What data partners are willing or able
to share
12
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
13
Understand factors that influence data sharing
• Semi-structured interviews with key stakeholders
• Identify factors that facilitate data sharing
• Identify concerns that discourage data sharing
14
Stakeholders interviewed
Mazor et al, J Comp Eff Res, 2017;6(6):537-547 15
Stakeholder interview domains
Mazor et al, J Comp Eff Res, 2017;6(6):537-547 16
Findings from stakeholder interviews
Mazor et al, J Comp Eff Res, 2017;6(6):537-547 17
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
18
Distributed data network (DDN) architecture
• No pooling of the entire databases from all sites
• Data partners maintain physical control of their data
• Data partners have ability to opt out of any request
• Only transfer minimal necessary information
19
Distributed data networks – Vanilla version
Analysis center
Site 1
Site 2Site 3
20
Distributed data networks – Vanilla version
Analysis center
Site 1
Site 2Site 3
Pooling study-specific individual-level datasets21
Examples of distributed data networks
22
Typical analytic datasets shared in DDNs
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
Site 1
23
Typical analytic datasets shared in DDNs
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
Site 1
Each row represents an individual
24
Typical analytic datasets shared in DDNs
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
Site 1
Each column represents a variable
25
Standardizing databases
Adapted from: http://www.hcsrn.org/asset/b9efb268-eb86-400e-8c74-2d42ac57fa4F/VDW.Infographic031511.jpg
Individual data partners
Site 1 Site 2
Site 3 Site 4
Data standardization(common data model)
Site 1
Site 2
Site 3
Site 4
Data accessible to research projects
• Research projects
• Programs written against common data model
Data quality improvement feedback loop
26
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review results
5- Data Partners return results via secure network
6 Results are aggregated and returned
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 27
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review results
5- Data Partners return results via secure network
6 Results are aggregated and returned
2
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 28
Distributed analysis
Review & Run Query
Review & Return Outout
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review results
5- Data Partners return results via secure network
6 Results are aggregated and returned
23
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
3
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 29
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review output
5- Data Partners return results via secure network
6 Results are aggregated and returned
23 4
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
3 4
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 30
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review output
5- Data Partners return outputs via secure network
6 Results are aggregated and returned
23 4
5
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
3 4
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 31
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review output
5- Data Partners return outputs via secure network
6- Outputs are aggregated and analyzed
23 4
5
6
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
3 4
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 32
Sharing patient-level datasets in DDNs
• Patient-level info can generally be de-identified to avoid sharing of sensitive patient info
• But even so, several concerns may still persist
• Sometimes it is not possible to share patient-level info due to these concerns or other reasons
33
Challenges in de-identifying patient information
34
Question 1
• Do we have other ways to share data?
35
Question 2
• Can we perform the analysis we want without sharing potentially identifiable patient-level data?
36
Question 3
• Better yet, can we perform the analysis we want without sharing patient-level data at all?
37
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
38
Not using privacy-protecting analytic methods
Analysis center
Site 1
Site 2Site 3
39
Using privacy-protecting analytic methods
Analysis center
Site 1
Site 2Site 3
40
Using privacy-protecting analytic methods
Analysis center
Site 1
Site 2Site 3
Pooling study-specific summary-level datasets41
Confounder summary scores
Race
AgeSex
TxPx Dx
Propensity Score (PS)or
Disease Risk Score (DRS)
Treatment Outcome
Confounders
DRSPS
…
42
Typical analytic datasets shared in DDNs
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
Site 1
43
Using confounder summary scores
PatID Exposure Outcome Time PS
001 1 0 312 0.33
002 0 0 40 0.04
003 0 0 365 0.05
004 0 0 200 0.54
005 0 1 2 0.22
006 1 1 15 0.45
007 1 0 4 0.09
008 1 0 145 0.79
009 0 1 33 0.21
010 0 0 98 0.01
011 0 0 34 0.38
… … … … …
Site 1
Toh et al, Med Care, 2013;51:S4-S10 44
Summary score-matched analysis
PatID Exposure Outcome Time PS
001 1 0 312 0.33
002 0 0 40 0.04
003 0 0 365 0.05
004 0 0 200 0.54
005 0 1 2 0.22
006 1 1 15 0.45
007 1 0 4 0.09
008 1 0 145 0.79
009 0 1 33 0.21
010 0 0 98 0.01
011 0 0 34 0.38
… … … … …
Site 1 PT in
ExposedPT in Un-exposed
Event in Exposed
Event in Un-
exposed
355.6 233.4 40 35
• Only four numbers are needed (in 1:1 matching)
• Lead team uses data from all sites to obtain overall results
Toh et al, Med Care, 2013;51:S4-S10 45
Summary score-stratified analysis
PatID Exposure Outcome Time PS
001 1 0 312 0.33
002 0 0 40 0.04
003 0 0 365 0.05
004 0 0 200 0.54
005 0 1 2 0.22
006 1 1 15 0.45
007 1 0 4 0.09
008 1 0 145 0.79
009 0 1 33 0.21
010 0 0 98 0.01
011 0 0 34 0.38
… … … … …
Site 1
• Each record is a summary score-based stratum
• Lead team uses methods, e.g., the Mantel-Haenszelmethod, to obtain overall results
PS stratum
PT in Exposed
PT in Un-
exposed
Event in Exposed
Event in Un-
exposed
1 34.5 70.1 10 8
2 32.4 32.6 7 21
3 56.2 44.2 9 10
4 12.8 56.2 12 6
Toh et al, Med Care, 2013;51:S4-S10 46
Meta-analysisSite 1
• Each record is an effect estimate and its 95% CI
• There is only 1 record per site
• Lead team uses meta-analytic approach to obtain overall results
Toh et al, Med Care, 2013;51:S4-S10
HR Lower 95% CI
Upper 95% CI
2.97 1.95 4.52
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
47
Distributed regression
PatID Exposure Outcome Time X1 X2 X3 X4 X5 …
001 1 0 312 0 1 0 1 1 …
002 0 0 40 1 1 0 1 0 …
003 0 0 365 1 0 0 0 0 …
004 0 0 200 2 0 1 0 0 …
005 0 1 2 3 0 0 1 0 …
006 1 1 15 3 1 0 0 1 …
007 1 0 4 1 1 1 0 1 …
008 1 0 145 0 0 1 0 0 …
009 0 1 33 2 1 0 0 0 …
010 0 0 98 1 1 0 0 0 …
011 0 0 34 1 0 0 0 0 …
… … … … … … … … … …
Site 1
Type Name INT X1 X2
SSCP INT 152.45 56.74 121.65
SSCP X1 56.74 342.45 88.55
SSCP X2 121.65 88.55 422.32
Mean 1.00 3.45 65.78
STD 0.00 4.65 22.34
N 500 500 500
Karr et al, J Comput Graph Stat, 2005;14:263-279
• Each record is a summary statistic
• Lead team uses the summary statistics to perform regression analysis
48
Distributed analysis
Review & Run Query
Review & Return Output
Data Partner 1
EnrollmentDemographics
UtilizationPharmacy
Etc
1- User creates and submits query
2- Data Partners retrieve query
3- Data Partners review and run query against their local data
4- Data Partners review output
5- Data Partners return outputs via secure network
6- Outputs are aggregated and analyzed
23 4
5
6
Review & Run Query
Review & Return Output
Data Partner 2
EnrollmentDemographics
UtilizationPharmacy
Etc
3 4
Analysis Center
Secure Network Portal
1
https://www.sentinelinitiative.org/privacy-and-security 49
Example: A comparative effectiveness study
• A Scalable Partnering Network for CER (SPAN) project
• Risk of long-term re-hospitalization with lap band vs. bypass procedure
• Included 7 of 11 data partners
Toh et al, Med Care, 2014;52:664-668 50
Study setting
http://www.hopkinsmedicine.org/healthlibrary/test_procedures/gastroenterology/laparoscopic_adjustable_gastric_banding_135,63/
http://www.hopkinsmedicine.org/healthlibrary/test_procedures/gastroenterology/roux-en-y_gastric_bypass_weight-loss_surgery_135,65/
51
Study design
•≥21 years at time of bariatric surgery•≥1 BMI of 35kg/m2 or greater •Continuous enrollment w/ benefits•No prior bariatric surgery•No prior diagnosis of study outcome
1/1/2005
Time
Contributing person-times
12/31/2010Start of follow up (discharge date)
•Re-hospitalization•Death•Health plan disenrollment•12/31/2010•730 days of follow-up
365 days
Index bariatric hospitalization
Toh et al, Med Care, 2014;52:664-668 52
ConfoundersAge Asthma*Sex Deep vein thrombosis*Race/ethnicity Pulmonary embolism*Diabetes* Congestive heart failure*Baseline BMI* Hyperlipidemia*Year of procedure Coronary artery disease*Charlson comorbidity score* Oxygen use*Atrial fibrillation* Assistive walking device*GERD* Smoking status*Hypertension* Blood pressure*Sleep Apnea* Length of stay assoc. with procedure
*Identified during the 365-day baseline period prior to the index bariatric hospitalization
Toh et al, Med Care, 2014;52:664-668 53
Statistical analysis
• Propensity score stratification
• Analysis• Pooled patient-level data analysis (benchmark)• Risk set-based analysis• PS-stratified analysis (by quintile)• Meta-analysis of site-specific effect estimates
Toh et al, Med Care, 2014;52:664-668 54
Selected baseline patient characteristicsCharacteristics Adjustable gastric band (n=1,550) Roux-en-y gastric bypass (n=5,792)
N %* N %*
Mean age (SD) 46.7 11.2 45.7 10.7
Age > 65 years 76 4.9 141 2.4
Female sex 1,266 81.7 4,823 83.3
Race/ethnicity
Black or African American 137 8.8 522 9.0
White 1,130 72.9 3,840 66.3
Hispanic 142 9.2 769 13.3
Other 62 4.0 280 4.8
Unknown 79 5.1 381 6.6
Baseline BMI
30-34.9 96 6.2 174 3.0
35-39.9 480 31.0 1,410 24.3
40-49.9 813 52.4 3,126 54.0
≥50 161 10.4 1,082 18.7
Toh et al, Med Care, 2014;52:664-668 55
Patient-level data analysis, by site
Site Adjusted HR 95% CISite 1 0.68 0.45, 1.02Site 2 0.65 0.37, 1.15Site 3 0.52 0.26, 1.04Site 4 0.72 0.35, 1.50Site 5 0.82 0.46, 1.48Site 6 0.32 0.13, 0.75Site 7 0.79 0.62, 1.01
Toh et al, Med Care, 2014;52:664-668 56
Overall results by method
Toh et al, Med Care, 2014;52:664-668 57
Results by method
Method Adjusted HR 95% CI
Benchmark 0.71 0.59, 0.84
Risk set analysis 0.71 0.59, 0.84
PS stratification 0.70 0.59, 0.83
Meta-analysis 0.71 0.60, 0.84
Toh et al, Med Care, 2014;52:664-668 58
Pooled patient-level linear regression (from PROC REG)
Distributed linearregression
59
Pooled patient-level logistic regression (from PROC LOGISTIC)
Distributed logistic regression
60
Pooled patient-level Cox PH regression (from PROC PHREG)
Distributed Cox PHregression
61
Data-sharing methods in multi-database studies
Data shared across sites
Patient-level data
Individual covariates
Confounder summary scores
A hybrid of above
Summary-level data
Stratum-specific counts
Risk-set data
Intermediate statistics
Database-specific effect estimates
62
Data-sharing methods in multi-database studies
Data shared across sites
Patient-level data
Individual covariates
Confounder summary scores
A hybrid of above
Summary-level data
Stratum-specific counts
Risk-set data
Intermediate statistics
Database-specific effect estimates
63
Analytic flexibility vs. privacy protection
Privacy protection
Anal
ytic
flex
ibili
ty
Patient-level info
with individual covariates
Database-specific effect
estimates
Patient-level info
with summary
scores
Stratum-specific counts
Risk-set data
Summary statistics
* Approximation
64
Overview
• Data sharing in multi-center studies• Benefits and challenges
• Ways to facilitate data sharing while protecting privacy• Stakeholders’ views on data sharing• Use of distributed data networks• Use of privacy-protecting analytic and data-sharing methods
• Discussion
65
Discussion
• Stakeholders are willing to share data if:• Benefits of research outweigh the risks• Risks are minimized• Cost is reasonable
• Although we did not spend too much time on it here, proper governance on data sharing is critical
66
Discussion
• Use of distributed data network structure and privacy-protecting analytic methods allow analysis of multiple databases while protecting patient privacy
67
A national DDN infrastructure for evidence generation
Health Plan 2
Health Plan 1
Health Plan 5
Health Plan 4
Health Plan 7 Hospital 1
Health Plan 3
Health Plan 6
Health Plan 8
Hospital 3Health Plan 9
Hospital 2
Hospital 4
Hospital 6
Hospital 5
Outpatient clinic 1
Outpatient clinic 3
Outpatient clinic 4
Outpatient clinic 6
Outpatient clinic 5
Outpatient clinic 2
• Each organization can participate in multiple networks• Each network controls its governance and coordination• Networks share infrastructure, analytics, lessons, security, software
68
Summary
• Sending analysis to the data
• Sharing information, not data
• Getting more by asking for less
69
Acknowledgments• HPHCI
• Jeffrey Brown• Mia Gallagher• Qoua Her• Xiaojuan Li• Sarah Malek• Jessica Malenfant• Richard Platt• Yury Vilk• Jessica Young• Zilu Zhang
• Others• Susan Gruber• Bruce Fireman• Lingling Li• Kazuki Yoshida
• Penn State University• Aleksandra Slavković• Yuji Samizo
• PCORnet• David Arterburn • Jason Block• Jane Anau• Yates Coley• Casie Horgan• Kathleen McTigue• Erick Moyneur• Roy Pardee• Juliane Reynolds• Sheryl Rifas-Shiman• Jessica Sturtevant• Robert Wellman• Many others
70