Graphical models for combining multiple data sources
description
Transcript of Graphical models for combining multiple data sources
![Page 1: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/1.jpg)
Graphical models for combining multiple data sources
Nicky BestSylvia Richardson
Chris Jackson
Imperial College BIAS node
with thanks to Peter Green
![Page 2: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/2.jpg)
Outline
• Overview of graphical modelling• Case study 1: Water disinfection byproducts and
adverse birth outcomes – Modelling multiple sources of bias in observational
studies
• Case study 2: Socioeconomic factors and limiting long term illness– Combining individual and aggregate level data– Simulation study – Application to Census and Health Survey for England
![Page 3: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/3.jpg)
Graphical modelling
Modelling
Inference
Mathematics
Algorithms
![Page 4: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/4.jpg)
1. Mathematics
Modelling
Inference
Mathematics
Algorithms
• Key idea: conditional independence• X and Y are conditionally independent given Z if, knowing
Z, discovering Y tells you nothing more about XP(X | Y, Z) = P(X | Z)
![Page 5: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/5.jpg)
Example: Mendelian inheritance
• Z = genotype of parents • X, Y = genotypes of 2 children• If we know the genotype of the parents, then the
children’s genotypes are conditionally independent
Z
X Y
![Page 6: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/6.jpg)
Joint distributions and graphical models
Use ideas from graph theory to: • represent structure of a joint probability
distribution…..• …..by encoding conditional independencies
• Factorization thm:
Jt distribution P(V) = P(v | parents[v])
D
EB
CA
F
![Page 7: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/7.jpg)
Where does the graph come from?
• Genetics– pedigree (family tree)
• Physical, biological, social systems– supposed causal effects
• Contingency tables– hypothesis tests on data
• Gaussian case– non-zeros in inverse covariance matrix
![Page 8: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/8.jpg)
• Conditional independence provides mathematical basis for splitting up large system into smaller components
D
EB
CA
F
![Page 9: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/9.jpg)
• Conditional independence provides mathematical basis for splitting up large system into smaller components
D
EB
C
D
E
F
CA
![Page 10: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/10.jpg)
2. Modelling
• Graphical models provide framework for building probabilistic models for empirical data
Modelling
Inference
Mathematics
Algorithms
![Page 11: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/11.jpg)
Building complex models
Key idea• understand complex system• through global model• built from small pieces
– comprehensible– each with only a few variables– modular
![Page 12: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/12.jpg)
Example: Case study 1
• Epidemiological study of birth defects and mothers’ exposure to water disinfection byproducts
• Background– Chlorine added to tap water supply for disinfection– Reacts with natural organic matter in water to form
unwanted byproducts (including trihalomethanes, THMs)– Some evidence of adverse health effects (cancer, birth
defects) associated with exposure to high levels of THM– We are carrying out study in Great Britain using routine
data, to investigate risk of birth defects associated with exposure to different THM levels
![Page 13: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/13.jpg)
Data sources
• National postcoded births register• National and local congenital anomalies registers• Routinely monitored THM concentrations in tap
water samples for each water supply zone within 14 different water company regions
• Census data – area level socioeconomic factors• Millenium cohort study (MCS) – individual level
outcomes and confounder data on sample of mothers
• Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.)
![Page 14: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/14.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
![Page 15: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/15.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
Regression model for national data relating risk of
birth defects (pzk) to mother’s THM exposure
and other confounders (czk)
![Page 16: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/16.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
Regression model for MCS data relating risk of birth defects (pzi) to mother’s
THM exposure and other confounders (czi)
![Page 17: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/17.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
Missing data model to estimate confounders (czk)
for mothers in national data, using information on within area distribution of
confounders in MCS
![Page 18: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/18.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
Model to estimate true tap water THM concentration
from raw data
![Page 19: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/19.jpg)
Model for combining data sources
yzk
[c]
[T]
pzk
yzi
pzi
czk
z
czi
THMzk[pers]
THMzt[tap]
THMztj[raw]
THMzi[pers]
Model to predict personal exposure using estimated tap water THM level and
literature on distribution of factors affecting individual
uptake of THM
![Page 20: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/20.jpg)
3. Inference
Modelling
Inference
Mathematics
Algorithms
![Page 21: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/21.jpg)
Bayesian
![Page 22: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/22.jpg)
… or non Bayesian
![Page 23: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/23.jpg)
Bayesian Full Probability Modelling
• Graphical approach to building complex models lends itself naturally to Bayesian inferential process
• Graph defines joint probability distribution on all the ‘nodes’ in the model
• Condition on parts of graph that are observed (data)
• Update probabilities of remaining nodes using Bayes theorem
• Automatically propagates all sources of uncertainty
![Page 24: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/24.jpg)
4. Algorithms
• Many algorithms, including MCMC, are able to exploit graphical structure
• MCMC: subgroups of variables updated randomly• Ensemble converges to equilibrium (e.g. posterior) dist.
Modelling
Inference
Mathematics
Algorithms
![Page 25: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/25.jpg)
MCMC
?
?Updating - need only look at neighbours
Key idea exploited by WinBUGS software
![Page 26: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/26.jpg)
Case study 2
• Socioeconomic factors affecting health• Background
– Interested in individual versus contextual effects of socioeconomic determinants of health
– Often investigated using multi-level studies (individuals within areas)
– Ecological studies also widely used in epidemiology and social sciences due to availability of small-area data
• investigate relationships at level of group, rather than individual• outcome and exposures are available as group-level summaries• usual aim is to transfer inference to individual level
![Page 27: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/27.jpg)
Building the model
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
Multilevel model for individual data
![Page 28: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/28.jpg)
Building the model
Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
![Page 29: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/29.jpg)
Building the model
Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
log pik = i + [c] x[c]ik + [b] x[b]ik
![Page 30: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/30.jpg)
Building the model
Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
log pik = i + [c] x[c]ik + [b] x[b]ik
i ~ Normal(0, )
![Page 31: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/31.jpg)
Building the model
Multilevel model for individual data
yik ~ Bernoulli(pik), person k, area i
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
log pik = i + [c] x[c]ik + [b] x[b]ik
i ~ Normal(0, )
Prior distributions on 2, [c], [b]
![Page 32: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/32.jpg)
Building the model
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ni
Ecological model
![Page 33: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/33.jpg)
Building the model
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
Ni
![Page 34: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/34.jpg)
Building the model
X[c]i
Yi
[b]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
i = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]
Ni
[c]
![Page 35: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/35.jpg)
Assuming x[b], x[c] independent, with
X[b]i = proportion exposed to ‘b’ in area
i and fi(x[c]) = Normal(X[c]i, V[c]i), then
i = q0i(1-X[b]i) + q1iX[b]i
where
q0i = marginal prob of disease for unexposed
= exp(i + [c]X[c]I + 2[c]V[c]i/2)
Building the model
X[c]i
Yi
[b]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
i = pik(x[b], x[c]) fi(x[b], x[c]) dx[c]dx[c]
Ni
[c]
![Page 36: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/36.jpg)
Assuming x[b], x[c] independent, with
X[b]i = proportion exposed to ‘b’ in area
i and fi(x[c]) = Normal(X[c]i, V[c]i), then
i = q0i(1-X[b]i) + q1iX[b]i
where
q1i = marginal prob of disease for exposed
= exp(i + [b] + [c]X[c]I + 2[c]V[c]i/2)
Building the model
X[c]i
Yi
[b]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
i = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]
Ni
[c]
![Page 37: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/37.jpg)
Building the model
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
i = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]
Ni
i ~ Normal(0, )
![Page 38: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/38.jpg)
Building the model
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ecological model
Yi ~ Binomial(i, Ni), area i
i = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]
Ni
i ~ Normal(0, )
Prior distributions on 2, [b], [c]
![Page 39: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/39.jpg)
Combining individual and aggregate data
• Individual level survey data often lack power to inform about contextual and/or individual-level effects
• Even when correct (integrated) model used, ecological data often contain little information about some or all effects of interest
• Can we improve inference by combining both types of model / data?
![Page 40: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/40.jpg)
Combining individual and aggregate data
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ni
x[c]ik
yik
[b]
[c]
i
pikx[b]ik
Multilevel model for individual data
Ecological model
![Page 41: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/41.jpg)
Combining individual and aggregate data
X[c]i
Yi
[b]
[c]
i
i X[b]i
V[c]i
Ni
x[c]ik
yik
pikx[b]ik
Hierarchical Related Regression (HRR) model
![Page 42: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/42.jpg)
Simulation Study
Posterior mean beta[b](log OR for binary cov.)
0.2 0.4 0.6 0.8 1.00.2 0.4 0.6 0.8 1.0Ecological+ sample of 10
Wider continuousexposure range
Wider binaryexposure range
Basic case,individual data only
Basic case
Posterior mean beta[c](log OR for cts cov.)
0.5 0.7 0.9 1.10.5 0.7 0.9 1.1
![Page 43: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/43.jpg)
Simulation Study
Posterior mean beta[b](log OR for binary cov.)
0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Ecological+ sample of 10
Correlation,modelled
Correlation,ignored
Misspecified distribution
Estimated within-area variance
Posterior mean beta[c](log OR for cts cov.)
0.5 0.7 0.9 1.10.5 0.7 0.9 1.1
![Page 44: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/44.jpg)
Simulation Study
Posterior mean beta[b](log OR for binary cov.)
0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0Ecological+ sample of 10
Large ME, individualdata only
Large measurementerror
Small measurementerror
Posterior mean beta[c](log OR for cts cov.)
0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0
![Page 45: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/45.jpg)
Comments
• Inference from aggregate data can be unbiased provided exposure contrasts between areas are high (and appropriate integrated model used)
• Combining aggregate data with small samples of individual data can reduce bias when exposure contrasts are low
• Combining individual and aggregate data can reduce MSE of estimated compared to individual data alone
• Individual data cannot help if individual-level model is misspecified
![Page 46: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/46.jpg)
Application to LLTI
• Health outcome– Limiting Long Term Illness (LLTI) in men aged 40-59 yrs
living in London
• Exposures– ethnicity (white/non-white), income, area deprivation
• Data sources– Aggregate: 1991 Census aggregated to ward level and
ACORN (CACI) ward level income data – Individual: Health Survey for England (with ward identifier)
• 1-9 observations per ward (median 1.6)
![Page 47: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/47.jpg)
Ward level data
Deprivation
Deprivation
Dep
rivat
ion
% non white
% non white% non white
Mean income
Mea
n in
com
e
Mea
n in
com
e
Pre
vale
nce
of
LLT
I
Pre
vale
nce
of
LLT
I
Pre
vale
nce
of
LLT
I
![Page 48: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/48.jpg)
ResultsModel Non-white Log income Deprivation Between-
area variance
Individual -0.36
(-0.98, 0.23)
-0.55
(-0.80, -0.32)
-0.022
(-0.032, 0.074)
0.18
(0.052, 0.64)
Ecological 0.50
(0.27, 0.72)
-0.72
(-0.93, -0.51)
0.063
(0.054, 0.073)
0.19
(0.17, 0.21)
Combined 0.48
(0.23, 0.72)
-0.70
(-0.91, -0.50)
0.064
(0.054, 0.074)
0.19
(0.17, 0.22)
Combined (correlation modelled)
0.50
(0.24, 0.73)
-0.71
(-0.91, -0.51)
0.064
(0.054, 0.073)
0.19
(0.17, 0.21)
![Page 49: Graphical models for combining multiple data sources](https://reader035.fdocuments.in/reader035/viewer/2022062410/56815999550346895dc6e02d/html5/thumbnails/49.jpg)
Concluding Remarks
• Graphical models are powerful and flexible tool for building realistic statistical models for complex problems– Applicable in many domains– Allow exploiting of subject matter knowledge– Allow formal combining of multiple data sources– Built on rigorous mathematics– Principled inferential methods
Thank you for your attention!