An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive...

25
An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel NEAGU, UK Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China Ms. Shanshan WANG Dept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China

Transcript of An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive...

Page 1: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

An Effective Combination based on Class-Wise Expertise of

Diverse Classifiers for Predictive Toxicology Data

Mining

ADMA 2006, Xi’an, China

Dr. Daniel NEAGU, UK

Dr. Gongde GUODept. of Computer Science, Fujian Normal University, China

Ms. Shanshan WANGDept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China

Page 2: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Bradford, UK

Bradford,West Yorkshire

National Museum of Film and Television

School of Informatics, University of Bradford

Page 3: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Overview (1)

Introduction to ML applications to KDD Proposal of Combination Operators Model Construction and Classification

Algorithms Model Library for Predictive Toxicology

Collection of datasets Central store for models and results

Formal structure to speed access and improve organisation; reduce ‘misplaced’ files

Remote Access Secure access to data from remote locations possible in

the future

Page 4: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Overview (2)

Comparative Studies Results from UoB Model Library Study of different Machine Learning techniques Variety of Feature Selection techniques Many datasets and endpoints Large variation in accuracy of created models One aim is to automatically build ensembles

based on best class-wise models Results and Conclusions

Page 5: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Current Context

Nowadays more scientific data is generated and flows within systems: Man power/ laboratories Techniques and computational power

(Moore’s Law) Funds/ Legislation

More data is stored and available: Storage technology faster and cheaper

(Storage Law) DBMS capable of handling bigger DB Web/on line access to distributed data

Consequences Human expert is overloaded: very little data

is checked Knowledge Discovery is NEEDED for data

understanding and use

Hardware

SW (Algorithm

s)

Data collection/

management

Page 6: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

General definitions

Data is defined as facts regarding things (such as people, objects, events) which can be digitally transmitted or processed.

Information is generally defined as data that have been processed and presented in a form suitable for human interpretation with the purpose of revealing meanings (such as patterns or rules).

Models are defined as creating representations of patterns. Knowledge: the theoretical and practical comprehension of a certain

domain, that supports making decisions. Intelligence: the capability of learning, understanding and finding

solutions for problems in a specific domain.

1234567.89 is data. "Your bank balance has jumped 80.87% to £1234567.89" is

information. "Nobody owes me that much money" is knowledge. "I'd better talk to the bank before I spend it, because of what has

happened to other people" is intelligence.

http://foldoc.doc.ic.ac.uk

Page 7: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Knowledge Discovery in Databases (KDD)

Knowledge

Data sources

Feature Selection Models Extracted information

Select/preprocessSelect/preprocess TransformTransform Data miningData mining Interpret/Evaluate/AssimilateInterpret/Evaluate/Assimilate

Data preparation

The nontrivial process of identifying valid, novel, potentially useful and, ultimately understandable patterns in data.

Involves the following steps: understanding the application domain and definition of the goals selecting the target data set data cleaning and pre-processing data reduction and projection choosing the function of data modelling and the algorithm data mining interpretation evaluation and utilization of the discovered knowledge

Page 8: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Predictive Data Mining

The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables.

Classification identifies characteristics of data and identifies a data item as member of one of several predefined categorical classes.

Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable.

Page 9: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Machine Learning Applications in Data Mining Dynamics (ISI Thompson Web of Knowledge)

1995199619971998199920002001200220032004

ANNs

ILP

DTs0

500

1000

1500

2000

2500

3000 ANNs GAs

ILP RI

DTs k-NNRI

GAs

k-NN

DTs10%

GAs 30%

RI 1%

ILP 1%

k-NN 3%

ANNs 55%

ANNs

GAs

ILP

RI

DTs

k-NN

References to Machine Learning techniques with applications in Predictive Data Mining:

Page 10: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Multi-Classifier Systems

Different classifiers potentially offer complementary or at least additional information about patterns to be classified

Various approaches to classifier combinations: Majority voting [4] Entropy-based combination [5] Dempster-Shafer theory-based combination [6], [7] Bayesian classifier combination [8] Similarity-based classifier combination [9] Fuzzy inference [10] Gating networks [11] Statistical models [2]

Page 11: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

We propose a hybrid classifier combination scheme which makes use of class-wise expertise of diverse classifiers – a priori knowledge obtained from the training set - to achieve potentially better performance.

2 Operators proposed:

The Proposed Effective Combination Scheme

LjmiFPTP

TPij

ij

ij

Mij i,..,2,1},,..,2,1|{maxarg ,

},..,2,1|{maxarg , miCAiMi i

Page 12: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Architecture of the Effective Multiple Classifier System

If x is classified as CL

Best Modelfor Class L

OutputTraining

data

Data Pre-processing

Best Modelfor Class 1

Best Modelfor Class 2

Best Modelfor All Classes

Am

A1

A3

A2

Testingdata

Nox

Nox

If x is classified as C1

If x is classified as C2

1

2

L

Otherwise

x

Page 13: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Model construction algorithm

Page 14: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Classification Algorithm

Page 15: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

ML applications for Predictive Toxicology

The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models.

To guide the validation of computer-based methods, five OECD principles for the validation of (Quantitative) Structure-Activity Relationships were adopted: a defined endpoint an unambiguous algorithm a defined domain of applicability appropriate measures of goodness-of-fit, robustness

and predictivity a mechanistic interpretation, if possible

Page 16: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Datasets (1) DEMETRA*

1. LC50 96h Rainbow Trout acute toxicity (ppm)

282 compounds2. EC50 48h Water Flea

acute toxicity (ppm) 264 compounds

3. LD50 14d Oral Bobwhite Quail (mg/ kg)

116 compounds4. LC50 8d Dietary

Bobwhite Quail (ppm) 123 compounds

5. LD50 48h Contact Honey Bee (μg/ bee)

105 compounds

*http://www.demetra-tox.net

Page 17: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Datasets (2)

CSL APC* Datasets 5 endpoints A single endpoint/descriptor set used for our

experiments Mallard Duck LD50 toxicity value 60 organophosphates 248 descriptors

*http://www.csl.gov.uk

Page 18: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Datasets (3)

TETRATOX*/LJMU** Dataset Tetrahymena Pyriformis inhibition of growth IGC50 Phenols data 250 phenolic compounds 187 descriptors

• http://www.vet.utk.edu/tetratox/

• http://www.ljmu.ac.uk

Page 19: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Descriptors

Multiple descriptor types Various software packages to calculate 2D

and 3D attributes*

http://www.demetra-tox.net

Page 20: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Model Library

Algorithms chosen for their representability and diversity, easy, simple and fast access Instance-based Learning algorithm (IBL) Decision Tree learning algorithm (DT) Repeated Incremental Pruning to Produce

Error Reduction (RIPPER) Multi-Layer Perceptrons (MLPs) Support Vector Machine (SVM)

Page 21: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Dimensionality

Algorithms

Featu

re

Featu

re

Sele

ctio

nS

ele

ctio

n

Model

Results file

Parameter file

Dataset OneAlgorithms

Featu

re

Featu

re

Sele

ctio

nS

ele

ctio

n

Dataset TwoAlgorithmsFe

atu

re

Featu

re

Sele

ctio

nS

ele

ctio

n

Dataset ThreeAlgorithmsFe

atu

re

Featu

re

Sele

ctio

nS

ele

ctio

n

Dataset Four

Page 22: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Organisation

CSL DEMETRA TETRATOX/LJMUSource

Endpoint/Descriptors

OralQuail

Feature Subsets Models Parameters ResultsFile Type

Files Model 1 Model 2 Model 3 Model n

FeatureSelection

GR IG ReliefF

WaterFlea

DietaryQuail

BeeTroutAPCMallard_Duck

PHENOLS

SVMCSChi KNNMFSCFS Raw

Page 23: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Comparison of performance of combination schemes on seven data sets

MCS: Majority Voting-based Combination (MVC)

Maximal Probability-based Combination (MPC)

Average Probability-based Combination (APC)

Classifier Combination based on Dempster Rule of Combination (DRC)

CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers)

Page 24: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Conclusions

The proposed combination scheme CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers): not only makes use of the expertise of best

individual classifiers but removes their negative influences as well therefore results presented previously show

significant improvement of global performance

Page 25: An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining ADMA 2006, Xi’an, China Dr. Daniel.

Acknowledgements This work is part-funded by:

EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach http://pythia.inf.brad.ac.uk/

EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture http://www.demetra-tox.net

Special thanks also to: Dr. Q. Chaudhry (CSL York) Dr. Mark Cronin (LJMU)

and PhD students: Ms. Ladan Malazizi, BSc, PhD student

Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment

Mr. Paul Trundle, BSc, PhD student Research Theme: Hybrid Intelligent Systems applied to predict Pesticide

Toxicity Ms. Areej Shhab, BEng, MPhil

Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining

Mr. M. Craciun (University of Galati), BSc, MSc