Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada
-
Upload
taha-kass-hout-md-ms -
Category
Health & Medicine
-
view
3.500 -
download
0
description
Transcript of Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada
Taha Kass-Hout, MD, MSNicolás di TadaOctober 2008
MACHINE LEARNING AND DISEASE SURVEILLANCE
Image source: http://www.birds.cornell.edu/crows/images/deadcrow.jpg Image source: http://farm3.static.flickr.com/2029/2239605500_6ef2fd2295.jpg?v=0
DAY
CA
SES
Opportunity
for control
LATE DETECTION – RESPONSE
DAY
CA
SES
Opportunity
for control
EARLY DETECTION AND RESPONSE
INFORMATION SOURCES
Event-based – ad-hoc unstructured reports issued by formal or informal sources
Indicator-based – (number of cases, rates, proportion of strains…)
PUBLIC HEALTH MEASURES
Representativeness
Completeness
Predictive Value
Timeliness
PUBLIC HEALTH MEASURES
1000 Malaria infections (100%)
50 Malaria notifications (5%)
Get as close to the bottom of the pyramid
as possible
Urge frequent reporting: Weekly daily immediately
Specificity / Reliability
Sensitivity / Timeliness • Main attributes
o Representativenesso Completenesso Predictive value positive
Analyze and interpret
Signal as early
as possible
Automated analysis/thresholds
Time
• Main attributeso Timeliness
PUBLIC HEALTH MEASURES
Health care hotline
THE PROBLEM SPACE
Current systems design, analysis and evaluation has been geared towards specific data sources and detection algorithms – not humans
We have systems in place for those threats we have been faced with before
PUBLIC HEALTH – TWO PERSPECTIVES
Case management Individual cases of notifiable diseases Relationship networks (contact tracing)
Population surveillance Larger risk patterns
CASE MANAGEMENT
Questions/problems: Is a case due to recent transmission? If so, does the case share any feature with other,
recent cases?
Ways it's being done: Investigations/interviews Meeting with other investigators
POPULATION SURVEILLANCE
Questions/problems: Are more cases happening than expected? Does an excess suggest ongoing transmission in
a specific region?
Way it's being done: Semi-automated routine temporal and space-
time statistical analysis
WHY LOCATION MATTERS – CASE MANAGEMENT
If you are studying a case of a certain disease that was just declared
It is harder to picture the situation by looking at something as this..
WHY LOCATION MATTERS – CASE MANAGEMENT
WHY LOCATION MATTERS – CASE MANAGEMENT
Than by looking at this..
WHY LOCATION MATTERS – CASE MANAGEMENT
WHY LOCATION MATTERS – POP SURVEILLANCE
If you are studying the spatial distribution of a set of disease clusters
This would seem more difficult..
WHY LOCATION MATTERS – POP SURVEILLANCE
WHY LOCATION MATTERS – POP SURVEILLANCE
Than this..
WHY LOCATION MATTERS – POP SURVEILLANCE
MODERN DISEASE SURVEILLANCE
In the past two decades, much disease surveillance research has focused on developing analytical methods for automatically detecting anomalous patterns in data
Modern methods can achieve timely detection of anomalies by incorporating temporal, spatial, and multivariate information
9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…
Huge mass of data Detection algorithm “What are we supposed to do with
this?”
Too many alerts
MODERN DISEASE SURVEILLANCE
9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…
Huge mass of dataFeedback loop
MODERN DISEASE SURVEILLANCE
ADVANTAGES OF MACHINE LEARNING
P(malaria) = 22% P(influenza) = 13% P(other ILI) = 33%
MACHINE LEARNING TECHNIQUES
Classifiers Clustering Bayesian Statistics Neural Networks Genetic Algorithms
HOW TO REPRESENT A DOCUMENT?
“This morning I woke up with fever, I might have a flu.”
“I had a flu last month. […] I had a flu early this week.”
flu
fever
CLASSIFIERS – PROBLEM DEFINITION
Map items to vectors (Feature extraction) Normalize those vectors Train the classifier Measure the results with new information Feedback the classifier Separate classes in feature space
CLASSIFIERS - SVM
SVM – MARGIN MAXIMIZATION
Support vectors define the separator
SVM – NON LINEAR?
Φ: x → φ(x)
Map to higher-dimension space
SVM – FILTERING OR CLASSIFYING
Classifier
Document 1
Document 1
Document 2
Document 2
Document 3
Document 3
PositivesPositives
NegativesNegatives
Training DocumentTraining
DocumentTraining
DocumentTraining
Document
CLUSTERING – PROBLEM DEFINITION
Map items to vectors (Feature extraction) Normalization Agglomerative and Partitional
CLUSTERING - AGGLOMERATIVE
CLUSTERING - PARTITIONAL
BAYESIAN STATISTICS
€
P(A |B) =P(B | A).P(A)
P(B)
Probability of disease A (flu)
once symptoms B (fever) are
observed
Probability of fever once flu is confirmed
Probability of flu (prior or marginal)
Probability of fever (prior or marginal)
NEURAL NETWORKS
Given a set of stimulus, train a system to produce a given output
Hidden LayerHidden Layer
Output LayerOutput Layer
Input LayerInput Layer
NEURAL NETWORKS - STRUCTURE
[…]
[…]
{I0,I1,……In}
{O0,O1,……On}
Weight
Weight
€
Hn = (Ii .i= 0
I
∑ win )
NEURAL NETWORK - APPLICATION
Event?
Event?
GENETIC ALGORITHM - BASICS
Define the model that you want to optimize Create the fitness function Evolve the gene pool testing against the
fitness function. Select the best individual
GENETIC ALGORITHM – MODEL
Model the transmission process using a set of parameters: Onset time between an infection and illness Latency period Incubation period Symptomatic period Infectious period
(Onset, Latency, Incubation, Symptomatic , Infectious)
( 2 days, 3 days, 1 day, 4 days, 3 days)
GENETIC ALGORITHM – MODEL FITNESS
Fitness = 1/Area
GENETIC ALGORITHM – PROCESS
1. Create an initial population of candidates2. Use operators to generate new candidates
(mating and mutation)3. Discard worst individuals or select best
individuals in generation4. Repeat from 2 until you find a candidate
that satisfies the solution searched
(4,5,6,3,5) (4,3,6,2,5)
GENETIC ALGORITHM - PROCESS
(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)
(2,3,4,6,5) (3,4,5,2,6)
(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)
(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)
(5,3,2,6,5)
(3,4,4,6,2)
(5,3,2,6,5)
(3,4,4,6,2)
RESULTS – IMPROVED SURVEILLANCE
Q&A
THANK YOU!
Taha Kass-Hout, MD, MShttp://www.instedd.org [email protected]://taha.instedd.org
Nicolás di Tadahttp://[email protected]://weblogs.manas.com.ar/ndt/
BACKUP SLIDES
REFERENCES Izadi, M. and Buckeridge, D., Decision Theoretic
Analysis of Improving Epidemic Detection, AMIA 2007, Symposium Proceedings 2007
EpiNorth-Based material (http://www.epinorth.org): Mereckiene, J., Outbreak Investigation Operational
Aspects. Jurmala, Latvia, 2006 Bagdonaite, J., and Mereckiene, J., Outbreak
Investigation Methodological aspects. Jurmala, Latvia, 2006
Epidemic Intelligence: Signals from surveillance systems, Anne Mazick, Statens Serum Institut, Denmark, EpiTrain III, Jurmala, August 2006
Daniel Neil, Incorporating Learning into Disease Surveillance Systems
REFERENCES Algorithms
Complex Event Processing Over Uncertain Data in Wasserkrug (2008)
Outbreak detection through automated surveillance A review of the determinants of detection in Buckeridge (2007)
Approaches to the evaluation of outbreak detection methods in Watkins (2006)
Algorithms for rapid outbreak detection a research synthesis Buckeridge (2004)
Data mining in bioinformatics using Weka in Frank (2004)
REFERENCES Automating Laboratory Reporting
Automatic Electronic Laboratory-Based Reporting in Panackal (2002)
Benefits and Barriers to Electronic Laboratory Results Reporting for Notifiable Diseases in Nguyen (2007)
Using EMR Data for Disease Surveillance Using Electronic Medical Records to Enhance Detection and
Reporting of Vaccine Adverse Events in Hinrichsen (2007) Electronic Medical Record Support for PH in Klompas (2007) A knowledgebase to support notifiable disease surveillance in
Doyle (2005) Automated Detection of Tuberculosis Using Electronic Medical
Record Data in Calderwood (2007) Misc Readings
Breakthrough in modeling emerging disease hotspots in Jones (2008)
Use of data mining techniques to investigate disease risk classification as a proxy for compromised biosecurity of cattle herds in Wales in Ortiz-Pelaez (2008)
RELATED PROJECTS InSTEDD RNA (or Event Evolution): Collaborative Analytics and Environment
for Linking Early Health-Related Event Detection to an Effective Response (http://taha.instedd.org/2008/09/collaborative-analytics-and-environment.html )
ALPACA "ALPACA Light Parsing And Classifying Application (ALPACA) is a classifying tool designed for use in community-oriented software as well as in Academia. The application consists of two parts: a parsing tool for transforming raw documents into readable data, and a classifying tool for categorizing documents into user-provided classes. The application provides a user-friendly interface and a Plug-in functionality to provide a simple way to add more parsers/classifiers to the application." http://2008.hfoss.org/ALPACA
Surveillance Project An Open Source R-package disease surveillance framework for "...the development and the evaluation of outbreak detection algorithms in univariate and multivariate routine collected public health surveillance data." http://surveillance.r-forge.r-project.org/
Weka An open source "...collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes." http://www.cs.waikato.ac.nz/~ml/weka/