Treating heterogeneity and uncertainty in data...
Transcript of Treating heterogeneity and uncertainty in data...
Treating heterogeneity and uncertainty in data integration:study on Brazilian healthcare databases
Marcos Barreto1, Mauricio Barreto2, Spiros Denaxas3
1. Computer Science Dept., Federal University of Bahia (UFBA), Salvador, Bahia, Brazil2. Institute Gonçalo Moniz, Oswaldo Cruz Foundation (FIOCRUZ), Salvador, Bahia, Brazil
3. Farr Institute of Health Informatics Research, UCL, London
Outline Projects’ scopes
Platform under development Linkage methods / accuracy results
Proposed approach Initial issues / preliminary results
Current work
The 100 million cohort project Aim: develop a platform to support a population-basedcohort built from CadastroÚnico (socioeconomic database)and assess the impact of several social protectionprogrammes on health, education, work etc.
Social Programmes using CadastroÚnico
Databases Coverage
CadastroÚnico 2007 - 2015
Bolsa Família (PBF) 2007 - 2015
SIH (hospitalization) 1998 - 2012
SINAN (notifiable diseases) 2000 - 2012
SIM (mortality) 2000 - 2012
SINASC (live births) 2001 - 2012
# o
f lin
es
(mill
ion
s)
114 million
Long-term monitoring platform for Zika Aims:
Systematic and longitudinal monitoring of children born and registered in SINASC (live births) between July/2016 and July/2017
Assess the impact of microcephaly and outcomes (mortality, hospitalization etc) related to Zika virus.
Assess outcomes in cognitive ability through school performance studies.
Possible linkage with other databases (retroactively to 2001): 2,800,000 births / year≃ Possible introduction of other outcomes (Dengue, Chikungunya)
Bahia Notifications of Chikungunya:
Aedes aegypti infestation index: 1.4% (OMS suggested threshold: 1%)
Notifications of Dengue + Zika + Chikungunya (1 January – 6 August): 161,883
Jan/Dec 2015: 24,308Jan/Jul 2016: 47,092
Proposed platform
Users (scientists,government etc)
Web portal
Linkage pipeline
Original data sets and dedicated resources
Developers(Computing,Statistics,
Epidemiology)
Anonymizeddata marts
Metadata / IndexingCohort management
+
+
Yemoja supercomputer (#2 in LatAm)
Safe room + medium-scale clusters
Dedicatedfiber opticsconnection
(2 km)
Record linkage pipeline
Data quality assessment
Data conditioning
Record linkage
Accuracy assessment
CadU baseline + SUS files Metrics for qualitative analysis Candidate attributes for linkage
ETL-based routines (cleansing, standardization) Anonymization (Bloom filter) Blocking routines Comparison blocks
Linkage parameters Linkage routines (deterministc and probabilistic) Data marts
Assessment metrics (sensitivity, specificity, VPP etc) Controlled scenarios Accuracy results
A Spark-based workflow for probabilistic record linkage of healthcare dataPITA, R.; PINTO, C.; MELO, P.; SILVA, M.; BARRETO, M.; RASELLA, D. (BeyondMR - EDBT/ICDT 2015)
ATYIMOATYIMO
CadastroÚnicobaseline
Payments fromBolsa Família (PBF)
SUS (National Unified Health System)
SIH(hospitalization)
SINASC(live births)
SIM(mortality)
SINAN(notifiable diseases)
DeterministicProbabilistic
Record linkage methods Full probabilistic: Sorensen (Dice) index applied to Bloom filters.
2h
|a| + |b|Da,b = = [0, 1]
h = number of 1's at same position in both Bloom filtersa = number of 1's in Bloom filter Ab = number of 1's in Bloom filter B
A
B
Hybrid approach: individual comparison of attributes based on different rules
Correlação probabilística de bases de dados governamentais. PINTO, C.; PITA, R.; MELO, P.; SENA, S.; BARRETO, M. (Brazilian Symposium on Databases – SBBD 2015)
Record linkage methods – accuracy resultsControlled scenario: 2 databases
4 simulated scenarios different percentage of changes in records
Main metrics: Sensitivity ('sensibilidade') Positive predictive value (VPP)
Databases Numberof records
Truematches
Rotavirus (diarrhea) 686 486(positive exams)
Other causes(children treated at outpatient clinics)
9,678
Full prob., without blockingFull prob., blockingHybrid prob., without blockingHybrid prob., blocking
Blocking Without blocking
Record linkage methods – accuracy resultsUncontrolled scenario:
BCG vaccination X SIM (mortality)Manaus state
MA
Databases Linked pairs True positives
BCG vaccination (156,331 records)X SIM (16,260 records)
2,247 2,169(96,53%)
Record linkage methods – accuracy resultsUncontrolled scenario:
CadastroÚnico (2011 extraction)Hospitalizations (SIH) by tuberculosis
Sergipe (SE), Santa Catarina (SC) and Rondônia (RO)Notifications (SINAN) from Santa Catarina (SC)
624
SC
SERO
CadastroÚnicoX SIH (SE)
CadastroÚnicoX SIH (RO)
CadastroÚnicoX SIH (SC)
CadastroÚnicoX SINAN (SC)
Approach being discussedHeterogeneity treated inside the pipelining
How to learn from our data sources and linkage/accuracy results to understand the probabilistic behaviour of our scenarios / projects?
Use of ‘possible worlds’ (pw) abstraction to model these uncertain relationships and create reference (‘gold’) standards to assess and certify accuracy
Data conditioning
ETL-based routines (cleansing, standardization)Anonymization (Bloom filter)Blocking routinesComparison blocks
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible Mapping Probability
{(pname,name),(home-addr, mailing-addr)} 0.5
{(pname,name),(office-addr, mailing-addr)} 0.4
{(pname,name),(email-addr, mailing-addr)} 0.1
pname email-addr home-addr office-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
name mailing-addr
Alice Mountain View
Bob Sunnyvale
name mailing-addr
Alice Sunnyvale
Bob Sunnyvale
name mailing-addr
Alice alice@
Bob bob@
Pr(pw1)=0.5
Pr(pw2)=0.4
Pr(pw3)=0.1
DOAN, Anhai et al. Principles of Data Integration, Morgan Kaufmann, 2012.
Learning from our data sourcesIDB 2012 (indicators and basic data) TABNET, provided by DATASUS (http://datasus.saude.gov.br/)
Learning from our linkage resultsDice coefficients with good sensitivity and VPP vary significantly depending onthe databases involved
Usage of supervised and unsupervised machine learning techniques to analyzethe accuracy results and (try to) provide a way to eliminate manual review
– Supervised: ID3 and Naïve-Bayes– Unsupervised: partitional (k-means, CLARA) and hierarchical (AGNES, DIANA)
– Spark MLlib, R cluster / clusteval
– Data mart:• CadastroÚnico (2011) x SINAN 2011 (tuberculosis): 4,910 records
– Cross-validation based on a sliding windowfrom block #0 to block #9 as training data
Block #0 – 491 records
Block #1 – 491 records
Block #9 – 491 records
Learning from our linkage resultsMetrics:
– Dice coefficient (from linkage), edit distance of name (complete),given name and surname, equality on gender, municipality and on birth date (day, month and year)
Name (complete): low (0-2), medium (3-4), high (>=5) Given name: low (0-2), high (>=3) Surname: low (0-2), high (>=3) Day, month, year, gender, municipality: equal (true), different (false)
Supervised methods: ID3
Cross-validation ID3(average – 10 executions)
Partitioning (training/test) Expected (manual review)
Unsupervised methods
Current workDetailed study on models and metrics to deal with uncertainty in probabilistic data linkage scenarios
Generation of new data marts from AtyImo v2 (full + hybrid approach)+ accuracy assessment
Generation of training/test data from these data marts
New tests with DataFrame-based API in the spark.ml package.
Thanks!(Obrigado!)
Marcos [email protected]
Cohort setup / managementLongitudinal merge of CadastroÚnico based on NIS (social ID) attribute
# o
f lin
es
(mill
ion
s)
114 million
Table Filesize # of records Version
Metadata and indexing
Exposition(payments received)
2007 2008 2009 2010 2011 ... 2015
1
...
114,000,000
…N
Individuals(cohort
+
otherdatabases)
SINAN
SIMOutcomesSINASC
SIH
Metadata and indexing
2007 2008 …. 2015
CadastroÚnico Bolsa Família (PBF)
2007 2008 …. 2015
Baseline
Cohort profile
Health data (SUS)
SINASC 2007 …. 2012
SIH 2007 …. 2012
SINAN 2007 …. 2012
SIM 2007 …. 2012