Predictive statistical modelling approach to …...The confidence intervals of 3 Nigeria subnational...
Transcript of Predictive statistical modelling approach to …...The confidence intervals of 3 Nigeria subnational...
Predictive statistical modelling
approach to estimating TB prevalence Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker
Model inputs and outputs
Independent
variables
- Bacteriologically confirmed TB prevalence - Surveys conducted after 2007 (“Lime book” methodology) - Subnational TB prevalence estimates
Independent
variables
- TB data, programmatic factors, co-morbidities and socio-environmental predictors
- National level data: TME, WB, GHR, UNICEF, IDF - Subnational data: NTP, DHS, MICS, CBS and other
representative surveys. - Predictors only available nationally averaged out at
subnational level - Total used in univariate analyses: 37
Training set - 30 datapoints in total
Countries to predict
- 2013 estimates - 25 low and 49 middle income countries - without prevalence survey - expected prevalence <0.1% according to WHO estimates
Titel
2
Total 30 data points
13 National prevalence surveys • 2007 Philippines • 2007 Vietnam • 2008 Bangladesh • 2009 Myanmar • 2010 China • 2011 Pakistan • 2011 Cambodia • 2011 Ethiopia • 2011 Lao PDR • 2012 Gambia • 2012 Nigeria • 2012 Rwanda • 2012 Thailand
Waiting for Tanzania, Ghana, Malawi,
Sudan, Zambia and Indonesia
Titel
3
Subnational estimates from 5 countries • Vietnam (3 areas) • Myanmar (2) • China (3) • Pakistan (6) • Nigeria (6)
2 district level surveys in India • 2009 Jabalpur (Madhya Pradesh) • 2009 Bangalore Rural (Karnataka)
• 2007 Thiruvallur (Tamil Nadu)
dropped - methodology?
On the lookout for reports of surveys
conducted in Wardha, Agra (Jalma) and
Faridabad districts
Training set vs. predictions
Titel
4
Titel
5
0
500
1000
0
500
1000
0
500
1000
0
500
1000
0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 1 2 3 4 5 6 0 1 2 3 4 5 6
2007: PHL 2007: VNM 2008: BGD 2009: IND
2009: MMR 2010: CHN 2011: ETH 2011: KHM
2011: LAO 2011: PAK 2012: GMB 2012: NGA
2012: RWA 2012: THA
95% CI Point estimate
Ra
te p
er
100
'000
Subnational area*
*Subnational area=0 refers to national estimate
Prevalence estimates in training set, by country
Numerators and denominators
Candidate models for this task included GLM models - numerator and denominator need to be specified explicitly. Prevalence surveys report - numerators (BC TB) - denominators (number of participants in survey) - estimated prevalence resulting from models
However 1. Ratio between these two not equate the final estimated
prevalence: - models take into account population weighing, clustering,
non-participation and missing values. 2. Subnational data: numerators and denominators sometimes not
available.
Titel
6
Adjusted numerators and denominators
Solution: adjusted number of BC and participants based on - prevalence estimates and confidence intervals - average between
- n1=(p*(1-p))/(((ul-p)/1.96)^2) - n2=(p*(1-p))/(((ll-p)/1.96)^2)
Very crude method, needs to be revised at later stage - adequately capture the asymmetrical nature of CI for a proportion - Arcsine tranformation? Note: - Adjusted numerators and denominators approximately half of
number of cases and participants in the survey - Consistent with a design effect = 2
Titel
7
Model fitting
Two types GLM considered - binomial (logistic link) - negative binomial (offset: log adjusted number of participants)
+ A random effect to account for clustering by country. Model building strategy: - Univariate models fitted against 37 predictor variables (complete data) - Fit assessed by AIC - Multivariate model: 10 cases/covariate to avoid overfitting = 3 predictors - Variables dropped by backward elimination (p<0.05) - Principal components analysis for variable reduction
Titel
8
Best fitting final model: • Binomial model (logistic link) • Without 3 subnational estimates in Nigeria with very large
confidence interval (North Central, North West and South South)
• lower AIC
Climatic score: • PCA score: average temperature, maximum temperature in
warmest month, average rainfall • higher values indicate warmer wetter countries • (tropical/subtropical countries) • First component explains 77% of variation
Titel
9
Final model
Titel
10
Final model
Final Multivariate model coefficients (binomial), logistic scale
Model predictors Coefficient Strength
(Intercept) -3.03588
Climate score 0.16039 160
New laboratory confirmed rate 0.00812 8
BCG coverage -0.03610 -36
Predicted vs. observed (training set)
Titel
11
0
.00
2.0
04
.00
6.0
08
Pre
dic
ted p
reva
lence
0 .002 .004 .006 .008Observed prevalence
Model fit
Cross validation k=2, x5 R-sq (mean) =0.76
Cross validation k=2, x1000 R-sq (median) =0.57
Titel
12
0.1
.2.3
.4.5
Den
sity
-5 0 5Deviance residual
0
.00
2.0
04
.00
6.0
08
p_h
at
-5 0 5Deviance residual
WHO estimates vs. model predictions
Titel
13
CAF
NER
SOM
ZAF
0
100
02
00
03
00
0
Mo
de
l pre
dic
tio
ns
0 200 400 600 800 1000WHO estimate
Outliers
Titel
14
0.1
.2.3
.4
Den
sity
-4 -2 0 2Scores for component 1
0
.00
5.0
1.0
15
.02
Den
sity
0 50 100 150 200new_labconfr
Climate score (β=0.160) New lab confirmed rate (β =0.008)
0
.02
.04
.06
.08
Den
sity
20 40 60 80 100bcg
BCG (β = -0.036)
SOM
CAF
SOM
CAF
NER
NER
“Bland and Altman” plot of agreement
Model predictions greater
than WHO estimates
• mean difference=55
cases per 100.000
(exc. 3 outliers)
• random scatter around
this difference (a part
from outliers)
Titel
15
NER
ZAF
SOM
CAF
-300
0-2
00
0-1
00
0
0
100
0
Diffe
ren
ce (
WH
O e
stim
ate
- m
od
el p
red
iction
s)
0 500 1000 1500 2000Mean (WHO estimate, model predictions)
BC in adults vs. all forms all ages
Titel
16
WHO estimates: All forms all ages Model predictions: BC in adults → model predictions "too high" Solution? WHO estimates of BC in adults? → keep model "free" from WHO assumptions Crude adjustment: correct BC in adults by factor of 0.83 → ratio from TME prevalence survey dataset
“Bland and Altman” plot of agreement -
adjusted estimates
Titel
17
CAF
NER
SOM
ZAF
-200
0-1
50
0-1
00
0-5
00
0
500
Diffe
ren
ce (
WH
O e
stim
ate
- m
od
el p
red
iction
s)
0 500 1000 1500Mean (WHO estimate, model predictions)
Model predictions greater
than WHO estimates
• mean difference=3
cases per 100.000
(exc. 3 outliers)
Limitations of this
correction:
• too crude, blanket
correction for all
estimates after model
prediction
• better to compare with
WHO BC in adults
estimates
Discussion
Prevalence model successfully fitted • More datapoints with less precision vs. fewer datapoints with
more precision → sensitivity analysis • Model predictions broadly in line with WHO estimates • Model estimates heavily reliant on climatic score. Useful? • CAR and Somalia → sensitivity analyses exc. climate score
Methodological improvements • More precise estimates of adjusted BC and participants numbers • Confidence intervals, propagation of error • How to factor in time (lags, repeat surveys) • Predictions for high vs. low prevalence estimates (overestimate
/underestimate low prevalences with logistic model?) • Include survey specific variables (coverage, participation rate) as
random effects to filter out nuissance variability induced by these factors
• Consider fitting two models (Asia and Africa)
Titel
18
Data wishlist
From WHO • BC adults estimates using WHO estimation methods • Estimates from more recent prevalence surveys • China disaggregated NTP data • Reports for all India district level surveys
Note: in addition we will also include the following: • Disaggregated data for climate, population density • New data recently compiled (large cities, prevalence of high risk
groups)
Titel
19
Questions?
Comments?
Suggestions for improvement?
Titel
20
Extra slide: adjustments to prevalence
estimates The Bangladesh survey only reported SS+, so estimated BC based on the ratio
between SS+ and BC from prevalence surveys conducted WPR and SEA region in
2007 (year of Bangaldesh survey). The surveys used for the calculation were thus:
China, Cambodia, Lao People's Democratic Republic, Myanmar, Philippines, Thailand
and Viet Nam. The ratio was 0.456, so the prevalence of BC was estimated as follows:
prev_bc_100k=prev_sp_100k/0.4565.
The report from the Jabalpur survey concluded that BC estimates from the survey
should be corrected by a factor 1.7 to account for no x-ray screening, which was
done.
The confidence intervals of 3 Nigeria subnational estimates were very wide. Given the
paucity of datapoints for model 2 these were keep for modeling but their impact on
model fit was assessed after all modeling.
Titel
21