Treating heterogeneity and uncertainty in data...

Treating heterogeneity and uncertainty in data integration:study on Brazilian healthcare databases

Marcos Barreto1, Mauricio Barreto2, Spiros Denaxas3

1. Computer Science Dept., Federal University of Bahia (UFBA), Salvador, Bahia, Brazil2. Institute Gonçalo Moniz, Oswaldo Cruz Foundation (FIOCRUZ), Salvador, Bahia, Brazil

3. Farr Institute of Health Informatics Research, UCL, London

Outline Projects’ scopes

Platform under development Linkage methods / accuracy results

Proposed approach Initial issues / preliminary results

Current work

The 100 million cohort project Aim: develop a platform to support a population-basedcohort built from CadastroÚnico (socioeconomic database)and assess the impact of several social protectionprogrammes on health, education, work etc.

Social Programmes using CadastroÚnico

Databases Coverage

CadastroÚnico 2007 - 2015

Bolsa Família (PBF) 2007 - 2015

SIH (hospitalization) 1998 - 2012

SINAN (notifiable diseases) 2000 - 2012

SIM (mortality) 2000 - 2012

SINASC (live births) 2001 - 2012

# o

f lin

es

(mill

ion

s)

114 million

Long-term monitoring platform for Zika Aims:

Systematic and longitudinal monitoring of children born and registered in SINASC (live births) between July/2016 and July/2017

Assess the impact of microcephaly and outcomes (mortality, hospitalization etc) related to Zika virus.

Assess outcomes in cognitive ability through school performance studies.

Possible linkage with other databases (retroactively to 2001): 2,800,000 births / year≃ Possible introduction of other outcomes (Dengue, Chikungunya)

Bahia Notifications of Chikungunya:

Aedes aegypti infestation index: 1.4% (OMS suggested threshold: 1%)

Notifications of Dengue + Zika + Chikungunya (1 January – 6 August): 161,883

Jan/Dec 2015: 24,308Jan/Jul 2016: 47,092

Proposed platform

Users (scientists,government etc)

Web portal

Linkage pipeline

Original data sets and dedicated resources

Developers(Computing,Statistics,

Epidemiology)

Anonymizeddata marts

Metadata / IndexingCohort management

+

+

Yemoja supercomputer (#2 in LatAm)

Safe room + medium-scale clusters

Dedicatedfiber opticsconnection

(2 km)

Record linkage pipeline

Data quality assessment

Data conditioning

Record linkage

Accuracy assessment

CadU baseline + SUS files Metrics for qualitative analysis Candidate attributes for linkage

ETL-based routines (cleansing, standardization) Anonymization (Bloom filter) Blocking routines Comparison blocks

Linkage parameters Linkage routines (deterministc and probabilistic) Data marts

Assessment metrics (sensitivity, specificity, VPP etc) Controlled scenarios Accuracy results

A Spark-based workflow for probabilistic record linkage of healthcare dataPITA, R.; PINTO, C.; MELO, P.; SILVA, M.; BARRETO, M.; RASELLA, D. (BeyondMR - EDBT/ICDT 2015)

ATYIMOATYIMO

CadastroÚnicobaseline

Payments fromBolsa Família (PBF)

SUS (National Unified Health System)

SIH(hospitalization)

SINASC(live births)

SIM(mortality)

SINAN(notifiable diseases)

DeterministicProbabilistic

Record linkage methods Full probabilistic: Sorensen (Dice) index applied to Bloom filters.

2h

|a| + |b|Da,b = = [0, 1]

h = number of 1's at same position in both Bloom filtersa = number of 1's in Bloom filter Ab = number of 1's in Bloom filter B

A

B

Hybrid approach: individual comparison of attributes based on different rules

Correlação probabilística de bases de dados governamentais. PINTO, C.; PITA, R.; MELO, P.; SENA, S.; BARRETO, M. (Brazilian Symposium on Databases – SBBD 2015)

Record linkage methods – accuracy resultsControlled scenario: 2 databases

4 simulated scenarios different percentage of changes in records

Main metrics: Sensitivity ('sensibilidade') Positive predictive value (VPP)

Databases Numberof records

Truematches

Rotavirus (diarrhea) 686 486(positive exams)

Other causes(children treated at outpatient clinics)

9,678

Full prob., without blockingFull prob., blockingHybrid prob., without blockingHybrid prob., blocking

Blocking Without blocking

Record linkage methods – accuracy resultsUncontrolled scenario:

BCG vaccination X SIM (mortality)Manaus state

MA

Databases Linked pairs True positives

BCG vaccination (156,331 records)X SIM (16,260 records)

2,247 2,169(96,53%)

Record linkage methods – accuracy resultsUncontrolled scenario:

CadastroÚnico (2011 extraction)Hospitalizations (SIH) by tuberculosis

Sergipe (SE), Santa Catarina (SC) and Rondônia (RO)Notifications (SINAN) from Santa Catarina (SC)

624

SC

SERO

CadastroÚnicoX SIH (SE)

CadastroÚnicoX SIH (RO)

CadastroÚnicoX SIH (SC)

CadastroÚnicoX SINAN (SC)

Approach being discussedHeterogeneity treated inside the pipelining

How to learn from our data sources and linkage/accuracy results to understand the probabilistic behaviour of our scenarios / projects?

Use of ‘possible worlds’ (pw) abstraction to model these uncertain relationships and create reference (‘gold’) standards to assess and certify accuracy

Data conditioning

ETL-based routines (cleansing, standardization)Anonymization (Bloom filter)Blocking routinesComparison blocks

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Possible Mapping Probability

{(pname,name),(home-addr, mailing-addr)} 0.5

{(pname,name),(office-addr, mailing-addr)} 0.4

{(pname,name),(email-addr, mailing-addr)} 0.1

pname email-addr home-addr office-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

name mailing-addr

Alice Mountain View

Bob Sunnyvale

name mailing-addr

Alice Sunnyvale

Bob Sunnyvale

name mailing-addr

Alice alice@

Bob bob@

Pr(pw1)=0.5

Pr(pw2)=0.4

Pr(pw3)=0.1

DOAN, Anhai et al. Principles of Data Integration, Morgan Kaufmann, 2012.

Learning from our data sourcesIDB 2012 (indicators and basic data) TABNET, provided by DATASUS (http://datasus.saude.gov.br/)

Learning from our linkage resultsDice coefficients with good sensitivity and VPP vary significantly depending onthe databases involved

Usage of supervised and unsupervised machine learning techniques to analyzethe accuracy results and (try to) provide a way to eliminate manual review

– Supervised: ID3 and Naïve-Bayes– Unsupervised: partitional (k-means, CLARA) and hierarchical (AGNES, DIANA)

– Spark MLlib, R cluster / clusteval

– Data mart:• CadastroÚnico (2011) x SINAN 2011 (tuberculosis): 4,910 records

– Cross-validation based on a sliding windowfrom block #0 to block #9 as training data

Block #0 – 491 records



Learning from our linkage resultsMetrics:

– Dice coefficient (from linkage), edit distance of name (complete),given name and surname, equality on gender, municipality and on birth date (day, month and year)

Name (complete): low (0-2), medium (3-4), high (>=5) Given name: low (0-2), high (>=3) Surname: low (0-2), high (>=3) Day, month, year, gender, municipality: equal (true), different (false)

Supervised methods: ID3

Cross-validation ID3(average – 10 executions)

Partitioning (training/test) Expected (manual review)

Unsupervised methods

Current workDetailed study on models and metrics to deal with uncertainty in probabilistic data linkage scenarios

Generation of new data marts from AtyImo v2 (full + hybrid approach)+ accuracy assessment

Generation of training/test data from these data marts

New tests with DataFrame-based API in the spark.ml package.

Thanks!(Obrigado!)

Marcos [email protected]

Cohort setup / managementLongitudinal merge of CadastroÚnico based on NIS (social ID) attribute

# o

f lin

es

(mill

ion

s)

114 million

Table Filesize # of records Version

Metadata and indexing

Exposition(payments received)

2007 2008 2009 2010 2011 ... 2015

1

...

114,000,000

…N

Individuals(cohort

+

otherdatabases)

SINAN

SIMOutcomesSINASC

SIH

Metadata and indexing

2007 2008 …. 2015

CadastroÚnico Bolsa Família (PBF)

2007 2008 …. 2015

Baseline

Cohort profile

Health data (SUS)

SINASC 2007 …. 2012

SIH 2007 …. 2012

SINAN 2007 …. 2012

SIM 2007 …. 2012

Treating heterogeneity and uncertainty in data...

Documents

Transcript of Treating heterogeneity and uncertainty in data...