An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive...
Transcript of An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive...
An Effective Combination based on Class-Wise Expertise of
Diverse Classifiers for Predictive Toxicology Data
Mining
ADMA 2006, Xi’an, China
Dr. Daniel NEAGU, UK
Dr. Gongde GUODept. of Computer Science, Fujian Normal University, China
Ms. Shanshan WANGDept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China
Bradford, UK
Bradford,West Yorkshire
National Museum of Film and Television
School of Informatics, University of Bradford
Overview (1)
Introduction to ML applications to KDD Proposal of Combination Operators Model Construction and Classification
Algorithms Model Library for Predictive Toxicology
Collection of datasets Central store for models and results
Formal structure to speed access and improve organisation; reduce ‘misplaced’ files
Remote Access Secure access to data from remote locations possible in
the future
Overview (2)
Comparative Studies Results from UoB Model Library Study of different Machine Learning techniques Variety of Feature Selection techniques Many datasets and endpoints Large variation in accuracy of created models One aim is to automatically build ensembles
based on best class-wise models Results and Conclusions
Current Context
Nowadays more scientific data is generated and flows within systems: Man power/ laboratories Techniques and computational power
(Moore’s Law) Funds/ Legislation
More data is stored and available: Storage technology faster and cheaper
(Storage Law) DBMS capable of handling bigger DB Web/on line access to distributed data
Consequences Human expert is overloaded: very little data
is checked Knowledge Discovery is NEEDED for data
understanding and use
Hardware
SW (Algorithm
s)
Data collection/
management
General definitions
Data is defined as facts regarding things (such as people, objects, events) which can be digitally transmitted or processed.
Information is generally defined as data that have been processed and presented in a form suitable for human interpretation with the purpose of revealing meanings (such as patterns or rules).
Models are defined as creating representations of patterns. Knowledge: the theoretical and practical comprehension of a certain
domain, that supports making decisions. Intelligence: the capability of learning, understanding and finding
solutions for problems in a specific domain.
1234567.89 is data. "Your bank balance has jumped 80.87% to £1234567.89" is
information. "Nobody owes me that much money" is knowledge. "I'd better talk to the bank before I spend it, because of what has
happened to other people" is intelligence.
http://foldoc.doc.ic.ac.uk
Knowledge Discovery in Databases (KDD)
Knowledge
Data sources
Feature Selection Models Extracted information
Select/preprocessSelect/preprocess TransformTransform Data miningData mining Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate
Data preparation
The nontrivial process of identifying valid, novel, potentially useful and, ultimately understandable patterns in data.
Involves the following steps: understanding the application domain and definition of the goals selecting the target data set data cleaning and pre-processing data reduction and projection choosing the function of data modelling and the algorithm data mining interpretation evaluation and utilization of the discovered knowledge
Predictive Data Mining
The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables.
Classification identifies characteristics of data and identifies a data item as member of one of several predefined categorical classes.
Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable.
Machine Learning Applications in Data Mining Dynamics (ISI Thompson Web of Knowledge)
1995199619971998199920002001200220032004
ANNs
ILP
DTs0
500
1000
1500
2000
2500
3000 ANNs GAs
ILP RI
DTs k-NNRI
GAs
k-NN
DTs10%
GAs 30%
RI 1%
ILP 1%
k-NN 3%
ANNs 55%
ANNs
GAs
ILP
RI
DTs
k-NN
References to Machine Learning techniques with applications in Predictive Data Mining:
Multi-Classifier Systems
Different classifiers potentially offer complementary or at least additional information about patterns to be classified
Various approaches to classifier combinations: Majority voting [4] Entropy-based combination [5] Dempster-Shafer theory-based combination [6], [7] Bayesian classifier combination [8] Similarity-based classifier combination [9] Fuzzy inference [10] Gating networks [11] Statistical models [2]
We propose a hybrid classifier combination scheme which makes use of class-wise expertise of diverse classifiers – a priori knowledge obtained from the training set - to achieve potentially better performance.
2 Operators proposed:
The Proposed Effective Combination Scheme
LjmiFPTP
TPij
ij
ij
Mij i,..,2,1},,..,2,1|{maxarg ,
},..,2,1|{maxarg , miCAiMi i
Architecture of the Effective Multiple Classifier System
If x is classified as CL
Best Modelfor Class L
OutputTraining
data
Data Pre-processing
Best Modelfor Class 1
Best Modelfor Class 2
Best Modelfor All Classes
Am
…
A1
A3
A2
…
Testingdata
Nox
Nox
If x is classified as C1
If x is classified as C2
1
2
L
Otherwise
x
Model construction algorithm
Classification Algorithm
ML applications for Predictive Toxicology
The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models.
To guide the validation of computer-based methods, five OECD principles for the validation of (Quantitative) Structure-Activity Relationships were adopted: a defined endpoint an unambiguous algorithm a defined domain of applicability appropriate measures of goodness-of-fit, robustness
and predictivity a mechanistic interpretation, if possible
Datasets (1) DEMETRA*
1. LC50 96h Rainbow Trout acute toxicity (ppm)
282 compounds2. EC50 48h Water Flea
acute toxicity (ppm) 264 compounds
3. LD50 14d Oral Bobwhite Quail (mg/ kg)
116 compounds4. LC50 8d Dietary
Bobwhite Quail (ppm) 123 compounds
5. LD50 48h Contact Honey Bee (μg/ bee)
105 compounds
*http://www.demetra-tox.net
Datasets (2)
CSL APC* Datasets 5 endpoints A single endpoint/descriptor set used for our
experiments Mallard Duck LD50 toxicity value 60 organophosphates 248 descriptors
*http://www.csl.gov.uk
Datasets (3)
TETRATOX*/LJMU** Dataset Tetrahymena Pyriformis inhibition of growth IGC50 Phenols data 250 phenolic compounds 187 descriptors
• http://www.vet.utk.edu/tetratox/
• http://www.ljmu.ac.uk
Descriptors
Multiple descriptor types Various software packages to calculate 2D
and 3D attributes*
http://www.demetra-tox.net
Model Library
Algorithms chosen for their representability and diversity, easy, simple and fast access Instance-based Learning algorithm (IBL) Decision Tree learning algorithm (DT) Repeated Incremental Pruning to Produce
Error Reduction (RIPPER) Multi-Layer Perceptrons (MLPs) Support Vector Machine (SVM)
Dimensionality
Algorithms
Featu
re
Featu
re
Sele
ctio
nS
ele
ctio
n
Model
Results file
Parameter file
Dataset OneAlgorithms
Featu
re
Featu
re
Sele
ctio
nS
ele
ctio
n
Dataset TwoAlgorithmsFe
atu
re
Featu
re
Sele
ctio
nS
ele
ctio
n
Dataset ThreeAlgorithmsFe
atu
re
Featu
re
Sele
ctio
nS
ele
ctio
n
Dataset Four
Organisation
CSL DEMETRA TETRATOX/LJMUSource
Endpoint/Descriptors
OralQuail
Feature Subsets Models Parameters ResultsFile Type
Files Model 1 Model 2 Model 3 Model n
FeatureSelection
GR IG ReliefF
WaterFlea
DietaryQuail
BeeTroutAPCMallard_Duck
PHENOLS
SVMCSChi KNNMFSCFS Raw
Comparison of performance of combination schemes on seven data sets
MCS: Majority Voting-based Combination (MVC)
Maximal Probability-based Combination (MPC)
Average Probability-based Combination (APC)
Classifier Combination based on Dempster Rule of Combination (DRC)
CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers)
Conclusions
The proposed combination scheme CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers): not only makes use of the expertise of best
individual classifiers but removes their negative influences as well therefore results presented previously show
significant improvement of global performance
Acknowledgements This work is part-funded by:
EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach http://pythia.inf.brad.ac.uk/
EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture http://www.demetra-tox.net
Special thanks also to: Dr. Q. Chaudhry (CSL York) Dr. Mark Cronin (LJMU)
and PhD students: Ms. Ladan Malazizi, BSc, PhD student
Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment
Mr. Paul Trundle, BSc, PhD student Research Theme: Hybrid Intelligent Systems applied to predict Pesticide
Toxicity Ms. Areej Shhab, BEng, MPhil
Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining
Mr. M. Craciun (University of Galati), BSc, MSc