Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

52
Taha Kass-Hout, MD, MS Nicolás di Tada October 2008 MACHINE LEARNING AND DISEASE SURVEILLANCE

description

The majority of the designs, analyses and evaluations of early detection (or biosurveillance) systems have been geared towards specific data sources and detection algorithms. Much less effort has been focused on how these systems will "interact" with humans. For example, consider multiple domain experts working at different levels across different organizations in an environment where numerous biosurveillance algorithms may provide contradictory interpretations of ongoing events. We present a framework that consists of a collection of autonomous, machine learning-enabled analytic processes, services and tools that; for the first time, will seamlessly integrate surveillance and response systems with human experts.

Transcript of Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Page 1: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Taha Kass-Hout, MD, MSNicolás di TadaOctober 2008

MACHINE LEARNING AND DISEASE SURVEILLANCE

Page 2: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Image source: http://www.birds.cornell.edu/crows/images/deadcrow.jpg Image source: http://farm3.static.flickr.com/2029/2239605500_6ef2fd2295.jpg?v=0

Page 3: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

DAY

CA

SES

Opportunity

for control

LATE DETECTION – RESPONSE

Page 4: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

DAY

CA

SES

Opportunity

for control

EARLY DETECTION AND RESPONSE

Page 5: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

INFORMATION SOURCES

Event-based – ad-hoc unstructured reports issued by formal or informal sources

Indicator-based – (number of cases, rates, proportion of strains…)

Page 6: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

PUBLIC HEALTH MEASURES

Representativeness

Completeness

Predictive Value

Timeliness

Page 7: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

PUBLIC HEALTH MEASURES

1000 Malaria infections (100%)

50 Malaria notifications (5%)

Get as close to the bottom of the pyramid

as possible

Urge frequent reporting: Weekly daily immediately

Specificity / Reliability

Sensitivity / Timeliness • Main attributes

o Representativenesso Completenesso Predictive value positive

Page 8: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Analyze and interpret

Signal as early

as possible

Automated analysis/thresholds

Time

• Main attributeso Timeliness

PUBLIC HEALTH MEASURES

Health care hotline

Page 9: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

THE PROBLEM SPACE

Current systems design, analysis and evaluation has been geared towards specific data sources and detection algorithms – not humans

We have systems in place for those threats we have been faced with before

Page 10: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

PUBLIC HEALTH – TWO PERSPECTIVES

Case management Individual cases of notifiable diseases Relationship networks (contact tracing)

Population surveillance Larger risk patterns

Page 11: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CASE MANAGEMENT

Questions/problems: Is a case due to recent transmission? If so, does the case share any feature with other,

recent cases?

Ways it's being done: Investigations/interviews Meeting with other investigators

Page 12: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

POPULATION SURVEILLANCE

Questions/problems: Are more cases happening than expected? Does an excess suggest ongoing transmission in

a specific region?

Way it's being done: Semi-automated routine temporal and space-

time statistical analysis

Page 13: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – CASE MANAGEMENT

If you are studying a case of a certain disease that was just declared

It is harder to picture the situation by looking at something as this..

Page 14: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – CASE MANAGEMENT

Page 15: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – CASE MANAGEMENT

Than by looking at this..

Page 16: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – CASE MANAGEMENT

Page 17: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – POP SURVEILLANCE

If you are studying the spatial distribution of a set of disease clusters

This would seem more difficult..

Page 18: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – POP SURVEILLANCE

Page 19: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – POP SURVEILLANCE

Than this..

Page 20: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

WHY LOCATION MATTERS – POP SURVEILLANCE

Page 21: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

MODERN DISEASE SURVEILLANCE

In the past two decades, much disease surveillance research has focused on developing analytical methods for automatically detecting anomalous patterns in data

Modern methods can achieve timely detection of anomalies by incorporating temporal, spatial, and multivariate information

Page 22: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…

Huge mass of data Detection algorithm “What are we supposed to do with

this?”

Too many alerts

MODERN DISEASE SURVEILLANCE

Page 23: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

9/20, 15213, cough/cold, …9/21, 15207, antifever, …9/22, 15213, CC = cough, ...1,000,000 more records…

Huge mass of dataFeedback loop

MODERN DISEASE SURVEILLANCE

Page 24: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

ADVANTAGES OF MACHINE LEARNING

P(malaria) = 22% P(influenza) = 13% P(other ILI) = 33%

Page 25: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

MACHINE LEARNING TECHNIQUES

Classifiers Clustering Bayesian Statistics Neural Networks Genetic Algorithms

Page 26: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

HOW TO REPRESENT A DOCUMENT?

“This morning I woke up with fever, I might have a flu.”

“I had a flu last month. […] I had a flu early this week.”

flu

fever

Page 27: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CLASSIFIERS – PROBLEM DEFINITION

Map items to vectors (Feature extraction) Normalize those vectors Train the classifier Measure the results with new information Feedback the classifier Separate classes in feature space

Page 28: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CLASSIFIERS - SVM

Page 29: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

SVM – MARGIN MAXIMIZATION

Support vectors define the separator

Page 30: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

SVM – NON LINEAR?

Φ: x → φ(x)

Map to higher-dimension space

Page 31: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

SVM – FILTERING OR CLASSIFYING

Classifier

Document 1

Document 1

Document 2

Document 2

Document 3

Document 3

PositivesPositives

NegativesNegatives

Training DocumentTraining

DocumentTraining

DocumentTraining

Document

Page 32: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CLUSTERING – PROBLEM DEFINITION

Map items to vectors (Feature extraction) Normalization Agglomerative and Partitional

Page 33: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CLUSTERING - AGGLOMERATIVE

Page 34: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

CLUSTERING - PARTITIONAL

Page 35: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

BAYESIAN STATISTICS

P(A |B) =P(B | A).P(A)

P(B)

Probability of disease A (flu)

once symptoms B (fever) are

observed

Probability of fever once flu is confirmed

Probability of flu (prior or marginal)

Probability of fever (prior or marginal)

Page 36: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

NEURAL NETWORKS

Given a set of stimulus, train a system to produce a given output

Page 37: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Hidden LayerHidden Layer

Output LayerOutput Layer

Input LayerInput Layer

NEURAL NETWORKS - STRUCTURE

[…]

[…]

{I0,I1,……In}

{O0,O1,……On}

Weight

Weight

Hn = (Ii .i= 0

I

∑ win )

Page 38: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

NEURAL NETWORK - APPLICATION

Event?

Event?

Page 39: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

GENETIC ALGORITHM - BASICS

Define the model that you want to optimize Create the fitness function Evolve the gene pool testing against the

fitness function. Select the best individual

Page 40: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

GENETIC ALGORITHM – MODEL

Model the transmission process using a set of parameters: Onset time between an infection and illness Latency period Incubation period Symptomatic period Infectious period

(Onset, Latency, Incubation, Symptomatic , Infectious)

( 2 days, 3 days, 1 day, 4 days, 3 days)

Page 41: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

GENETIC ALGORITHM – MODEL FITNESS

Fitness = 1/Area

Page 42: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

GENETIC ALGORITHM – PROCESS

1. Create an initial population of candidates2. Use operators to generate new candidates

(mating and mutation)3. Discard worst individuals or select best

individuals in generation4. Repeat from 2 until you find a candidate

that satisfies the solution searched

Page 43: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

(4,5,6,3,5) (4,3,6,2,5)

GENETIC ALGORITHM - PROCESS

(5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2)

(2,3,4,6,5) (3,4,5,2,6)

(3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6)

(4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4)

(5,3,2,6,5)

(3,4,4,6,2)

(5,3,2,6,5)

(3,4,4,6,2)

Page 44: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

RESULTS – IMPROVED SURVEILLANCE

Page 45: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

Q&A

Page 46: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

THANK YOU!

Taha Kass-Hout, MD, MShttp://www.instedd.org [email protected]://taha.instedd.org

Nicolás di Tadahttp://[email protected]://weblogs.manas.com.ar/ndt/

Page 47: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

BACKUP SLIDES

Page 48: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

REFERENCES Izadi, M. and Buckeridge, D., Decision Theoretic

Analysis of Improving Epidemic Detection, AMIA 2007, Symposium Proceedings 2007

EpiNorth-Based material (http://www.epinorth.org): Mereckiene, J., Outbreak Investigation Operational

Aspects. Jurmala, Latvia, 2006 Bagdonaite, J., and Mereckiene, J., Outbreak

Investigation Methodological aspects. Jurmala, Latvia, 2006

Epidemic Intelligence: Signals from surveillance systems, Anne Mazick, Statens Serum Institut, Denmark, EpiTrain III, Jurmala, August 2006

Daniel Neil, Incorporating Learning into Disease Surveillance Systems

Page 49: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

REFERENCES Algorithms

Complex Event Processing Over Uncertain Data in Wasserkrug (2008)

Outbreak detection through automated surveillance A review of the determinants of detection in Buckeridge (2007)

Approaches to the evaluation of outbreak detection methods in Watkins (2006)

Algorithms for rapid outbreak detection a research synthesis Buckeridge (2004)

Data mining in bioinformatics using Weka in Frank (2004)

Page 50: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

REFERENCES Automating Laboratory Reporting

Automatic Electronic Laboratory-Based Reporting in Panackal (2002)

Benefits and Barriers to Electronic Laboratory Results Reporting for Notifiable Diseases in Nguyen (2007)

Using EMR Data for Disease Surveillance Using Electronic Medical Records to Enhance Detection and

Reporting of Vaccine Adverse Events in Hinrichsen (2007) Electronic Medical Record Support for PH in Klompas (2007) A knowledgebase to support notifiable disease surveillance in

Doyle (2005) Automated Detection of Tuberculosis Using Electronic Medical

Record Data in Calderwood (2007) Misc Readings

Breakthrough in modeling emerging disease hotspots in Jones (2008)

Use of data mining techniques to investigate disease risk classification as a proxy for compromised biosecurity of cattle herds in Wales in Ortiz-Pelaez (2008)

Page 51: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

RELATED PROJECTS InSTEDD RNA (or Event Evolution): Collaborative Analytics and Environment

for Linking Early Health-Related Event Detection to an Effective Response (http://taha.instedd.org/2008/09/collaborative-analytics-and-environment.html )

ALPACA "ALPACA Light Parsing And Classifying Application (ALPACA) is a classifying tool designed for use in community-oriented software as well as in Academia. The application consists of two parts: a parsing tool for transforming raw documents into readable data, and a classifying tool for categorizing documents into user-provided classes. The application provides a user-friendly interface and a Plug-in functionality to provide a simple way to add more parsers/classifiers to the application." http://2008.hfoss.org/ALPACA

Surveillance Project An Open Source R-package disease surveillance framework for "...the development and the evaluation of outbreak detection algorithms in univariate and multivariate routine collected public health surveillance data." http://surveillance.r-forge.r-project.org/

Weka An open source "...collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes." http://www.cs.waikato.ac.nz/~ml/weka/

Page 52: Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada