Predicting Students Drop Out: a Casestudy Gerben Dekker, Mykola Pechenizkiy and Jan Vleeshouwers.
IEEE CBMS06, DM Track Salt Lake City, Utah 22.06.06 Dynamic Integration of Classifiers for Handling...
-
Upload
zoe-farleigh -
Category
Documents
-
view
213 -
download
1
Transcript of IEEE CBMS06, DM Track Salt Lake City, Utah 22.06.06 Dynamic Integration of Classifiers for Handling...
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
1
Dynamic Integration of Classifiers for Handling Concept Drift
Alexey TsymbalDepartment of Computer
ScienceTrinity College Dublin
Ireland
Seppo PuuronenDept. of CS and IS
University of JyväskyläFinland
Mykola PechenizkiyDept. of Mathematical ITUniversity of Jyväskylä
Finland
IEEE CBMS’06: DM Track Salt Lake City, Utah, USA June 21-23, 2006
Pádraig Cunningham Department of Computer
ScienceTrinity College Dublin
Ireland
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
2
Outline Introduction
– Supervised Learning – The Problem of Concept Drift (CD)
Approaches to Handle CD:– Instance selection; instance weighting; and ensemble
learning Dynamic Integration of Classifiers for Handling CD
– Dynamic Selection, Dynamic Integration, and their mix Domain of Antibiotic resistance
– How resistance occurs, concept drift context Experiments design
– C4.5 ensembles with static and dynamic integration Results and Conclusion
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
3
CLASSIFICATIONCLASSIFICATION
New instance to be classified
Class Membership ofthe new instance
J classes, n training observations, p features
Given n training instances
(xi, yi) where xi are values of
attributes and y is class
Goal: given new x0,
predict class y0
Training Set
The task of classification
Examples:
- diagnosis of thyroid diseases;
- heart attack prediction, etc.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
4
The Task of Classification
Predicting Antibiotic Resistance– predict the sensitivity of a pathogen to an antibiotic based on
data about the antibiotic, the isolated pathogen, and the demographic and clinical features of the patient.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
5
The Problem of Concept Drift Changes in the hidden context can induce more
or less radical (gradual or abrupt) changes in the target concept
– A typical example – antibiotic resistance: • pathogen sensitivity may change over time as new
pathogen strains develop resistance to antibiotics that were previously effective
– Even in most strictly controlled environments some unexpected changes may happen due to:
• fail and replacement of some medical equipment, or • changes in personnel, causing the necessity to change
the model.– The necessity in the change of current model due to
the change of data distribution is called virtual concept drift
An effective learner should be able to track such changes and to quickly adapt to them.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
6
Approaches to Handle Concept Drift instance selection:
– select instances relevant to the current concept;– generalizing from a moving window and uses the learnt
concepts for prediction only in the immediate future;– case-base editing strategies in CBR that delete noisy,
irrelevant and redundant cases; instance weighting:
– weighting according to “age”, and competence wrt the current concept;
– weighting techniques handle CD worse than analogous instance selection techniques (due overfitting the data);
ensemble learning: – maintains a set of concept descriptions, predictions of
which are combined using e.g. a form of voting;– dividing the data into sequential blocks of fixed size and
building an ensemble on them.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
7
Handling Concept Drift with Ensembles
Ensemble is constructed as a set of concept descriptions corresponding to different time intervals:
time
training set for next base classifier Usually simple voting is used for model combination– does not work in complex domains with local concept drift
Our basic idea: use local accuracies for model combination in order to handle local concept drift
– adapts to concept drift better (e.g. with antibiotic resistance data)
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
8
Local Concept Drift In the real world, concept drift may often be local,
– changes in the concept or data distribution occur in some regions of instance space only,
• only particular bacteria may develop their resistance to certain antibiotics, while resistance to the others could remain the same.
– the type and severity of changes may depend on the location in the instance space.
Local CD - changes in concept and data distribution occurring at an instance rather than data set level.
– Local CD occurs between two consecutive time points • if there is a sub-space of the whole instance space such that
it has different changes of concept and/or data distribution in comparison with the rest of the data.
– This is reflected by a different change in (local) predictive performance of currently used model in this sub-space.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
9
Stability of Regions: Rotating Hyperplane
Base models of an ensemble should not be discarded if
- global accuracy on the current block of data falls, but they are still good experts in the stable parts of the data.
One solution to this problem is the use of DIC:- the models are integrated at an instance level
according to their local accuracies.
t1 t2 t3 t4 Stability of regions in the rotating hyperplane problem
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
10
Local Concept Drift: Most gradual CDs may be considered local, if:
– the velocity of changes is small relative wrt. arriving instances in the data stream;
– most regions of the data remain stable. Most abrupt CDs are
– not local unless substantial sub-areas remain stable between the two changing concepts.
– local, if it relates to a subgroup of the whole population. CD may also be complex, - different concept or data
distribution changes (potentially also differently!) in different clusters
– changes in AR and data distribution are usually different for different bacteria in the AR problem.
Local CD occurs at an instance level – its treatment should be at that level as well!
Potential approaches to handle local CD: – CBR: a case base is updated at an instance level;– a hybrid of ensemble learning and instance selection– Ensemble integration based on local accuracies
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
11
How Antibiotic Resistance Happens
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
12
How Antibiotic Resistance Happens
In spontaneous DNA mutation, bacterial DNA may mutate spontaneously. Drug-resistant tuberculosis arises this way.
In a form of microbial sex called transformation, one bacterium may take up DNA from another bacterium. Pencillin-resistant gonorrhea results from transformation.
Resistance acquired from a small circle of DNA called a plasmid, that can flit from one type of bacterium to another.
– A single plasmid can provide a slew of different resistances.
– In 1968, 12,500 people in Guatemala died in an epidemic of Shigella diarrhea. The microbe harbored a plasmid carrying resistances to four antibiotics!
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
13
Data Collection & Organization N.N. Burdenko Institute of Neurosurgery Bacterial analyzer “Vitek-60” (by “bioMérieux”) Information Systems: "Microbiologist" & "Microbe"
Each instance: one sensitivity test: – pathogen that is isolated during the bacterial identification
analysis, – antibiotic that is used in the sensitivity test– the result of the sensitivity test itself (sensitive, resistant or
intermediate), obtained from “Vitek” according to the guidelines of (NCCLS).
– The above information is connected with patient, his or her demographical data (sex, age) and hospitalization in the Institute (main department, days spent in ICU, days spent in the hospital before test, etc.).
4430 sensitivity tests corresponding to a single specimen (liquor) including the meningitis cases of the year 2002 - 2004.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
14
Classification over Sequential Data Blocks
accuracy for C4.5 ensembles
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 3 5 7 9 11 13 15 17 19 21 23 25 27
v
wv
ds
dv
dvs
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
15
Weighted Average of Classification Accuracy
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
min aver max v wv ds dv dvs
C4.5 ensembles
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
16
Summary and Conclusions
In the real world concepts are often not stable but change with time, which is known as the problem of concept drift (CD).
Among the most popular and effective approaches to handling CD is ensemble learning:
– a set of concept descriptions built on data blocks corresponding to different time intervals is maintained, and
– the final prediction is the aggregated prediction of ensemble members.
We suggested a dynamic integration approach for ensembles (DIC) used in handling CD:
– integrates the base classifiers at an instance level, assigning to them weights proportional to their local accuracy on each instance considered.
We considered an example of CD from the area of antibiotic resistance.
We demonstrated that DIC often results in better accuracy with the considered data set than the more commonly used weighted voting:
– this supports our hypothesis that favors DIC for handling CD, especially in the presence of local CD.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
17
Contact Info
Mykola Pechenizkiy
Department of Mathematical Information Technology,
University of Jyväskylä, FINLANDE-mail: [email protected]
http://ww.cs.jyu.fi/~mpechen
THANK YOU!
MS Power Point slides of this and other recent talks and full texts of selected publications are available online at: http://www.cs.jyu.fi/~mpechen
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
18
Additional Slides …
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
19
Antibiotic Resistance in Nosocomial Infections
3 - 40% of patients admitted to hospital acquire an infection during their stay, and that the risk for hospital-acquired infection, or nosocomial infection, has risen steadily in recent decades.
The frequency depends mostly on the type of conducted operation being greater for “dirty” operations (10-40%), and smaller for “pure” operations (3-7%). E.g. such serious infectious complication as postoperative meningitis is often the result of nosocomial infection.
Antibiotics are the drugs that are commonly used to fight against infections caused by bacteria.
According to the Center for Disease Control and Prevention (CDC) statistics, more than 70% of the bacteria that cause hospital-acquired infections are resistant to at least one of the antibiotics most commonly used to treat infections.
Analysis of the microbiological data included in antibiograms collected in different institutions over different periods of time is considered as one of the most important activities to restrain the spreading of antibiotic resistance and to avoid the negative consequences of this phenomenon.
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
20
Antibiotic sensitivity of different bacteria
Comparing the antibiotic sensitivity of different bacteria
© Jim Deacon, Institute of Cell and Molecular Biology, The University of Edinburgh
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
21
The emergence of antibiotic resistance
Effects of different antibiotics on growth of a Bacillus strain. The right-hand image shows a close-up of the novobiocin disk (marked by an arrow on the whole plate). In this case some individual mutant cells in the bacterial population were resistant to the antibiotic and have given rise to small colonies in the zone of inhibition.
© Jim Deacon, Institute of Cell and Molecular Biology, The University of Edinburgh
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
22
How Antibiotic Resistance Happens
Horizontal Gene Transfer (© Grace Yim and Fan Sozzi)
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
23
Mechanisms of Antibiotic Resistance
© Grace Yim and Fan Sozzi
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
24
Mechanisms of Antibiotic ResistanceAntibiotic Method of resistance
Chloramphenicol reduced uptake into cell
Tetracycline active efflux from the cell
β-lactams, Erythromycin, Lincomycin
eliminates or reduces binding of antibiotic to target
β-lactams, Erythromycin hydrolysis
Aminoglycosides, Chloramphenicol, Fosfomycin, Lincomycin
inactivation of antibiotic by enzymatic modification
β-lactams, Fusidic Acidsequestering of the antibiotic by protein binding
Sulfonamides, Trimethoprimmetabolic bypass of inhibited reaction
Sulfonamides, Trimethoprimoverproduction of antibiotic target (titration)
Bleomycinbinding of specific immunity protein to antibiotic
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
25
Dataset CharacteristicsPatient and hospitalization related
Sex {Male, Female}Age IntegerRecurring stay {True,False}Days of stay in NSI IntegerDays of stay in ICU IntegerDays of stay in NSI before specimen was received IntegerBacterium is isolated when patient is in ICU {True,False}Main department {1,…,10}Department of stay (departments + ICU) {1,…,11}
Pathogen and pathogen groupsPathogen name {Pat_name1, …, Pat_name17}Gram(+/- ) {True,False}Staphylococcus {True,False}Enterococcus {True,False}Enterobacteria {True,False}Nonfermenters {True,False}
Antibiotic and antibiotic groupsAntibiotic name {Ant_name1, …, Ant_name39}Group1 {True,False}… …
Group15 {True,False}sensitivity {Sensitive, Intermediate, Resistant}
IEEE CBMS’06, DM TrackSalt Lake City, Utah 22.06.06
“Dynamic Integration of Classifiers for Handling Concept Drift” by A. Tsymbal, M. Pechenizkiy, P. Cunningham and S. Puuronen
26
Experiment design In Naïve Bayes, a normal distribution was assumed for numeric features, and the Laplace
correction with a multiplicative factor of 1 was used in probability estimation for categorical features.
C4.5 decision trees were built using 0.25 as the confidence factor for pruning and 2 as the minimum number of instances per leaf.
With all ensembles considered here we use the simple so-called replace the loser ensemble pruning strategy.
– if the ensemble size is greater than or equal to 25, the worst classifier, according to the current validation estimates, is replaced with a new one trained on the most recent data.
We experimented with 5 different sizes of neighbourhood k; 7, 15, 31, 63, and 127. – Naturally, usually accuracy decreases with the increase in the size of neighbourhood, becoming
closer to static voting. – Our experiments demonstrated that DIC was not very sensitive to the size of neighbourhood. – A reason for that is the locally weighted learning scheme used, with which the more distant an
instance is from the current test instance, the less influence it will have on the prediction of local performance.
– However, the smaller neighbourhoods (7 and 15) sometimes result in noisy performance estimates and inferior accuracies (especially with DS).
– We continue our analysis of experimental results focusing on the size of neighbourhood equal to 31, as usually it gives the best improvement due to DIC in the problems considered.
WEKA3 environment: Data Mining Software in Java: – http://www.cs.waikato.ac.nz/ml/weka/ – Default settings were used in the WEKA learning algorithms used in our experiments.