Hierarchical multilabel classification trees for gene function prediction

21
Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sašo Džeroski Jožef Stefan Institute Ljubljana (Slovenia) Probabilistic Modeling and Machine Learning in Structural and Systems Biology Tuusula, Finland, 17-18 June 2006

description

Hierarchical multilabel classification trees for gene function prediction. Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven (Belgium) Amanda Clare University of Aberystwyth (Wales) Sa š o D ž eroski Jo ž ef Stefan Institute Ljubljana (Slovenia). - PowerPoint PPT Presentation

Transcript of Hierarchical multilabel classification trees for gene function prediction

Page 1: Hierarchical multilabel classification trees for gene function prediction

Hierarchical multilabel classification trees for gene function prediction

Leander SchietgatHendrik Blockeel

Jan StruyfKatholieke Universiteit Leuven (Belgium)

Amanda ClareUniversity of Aberystwyth (Wales)

Sašo DžeroskiJožef Stefan Institute Ljubljana (Slovenia)

Probabilistic Modeling and Machine Learning in Structural and Systems Biology

Tuusula, Finland, 17-18 June 2006

Page 2: Hierarchical multilabel classification trees for gene function prediction

Overview

The application gene function prediction

The machine learning context hierarchical multilabel classification

Decision trees for HMC the algorithm: Clus-HMC

Experimental results

Conclusions2/21

PMSB

2006

Page 3: Hierarchical multilabel classification trees for gene function prediction

Gene Function Prediction

Task Given a data set with descriptions of

genes and the functions they have Learn a model that can predict for a

new gene what functions it performs

Genes can have multiple functions

These functions are hierarchically organised3/21

PMSB

2006

c1 c3c2

c21 c22

Page 4: Hierarchical multilabel classification trees for gene function prediction

Machine Learning

Classifier predicts for unseen instances the

class to which they belong learned with already classified

training examples Different techniques

decision trees support vector machines bayesian networks …4/21

PMSB

2006

Page 5: Hierarchical multilabel classification trees for gene function prediction

Hierarchical Multilabel Classification Normal classification setting

only predicts a single class

HMC predict multiple classes at once classes are organized in a hierarchy

Hierarchy constraint instances of a class must be

instances of its superclasses5/21

PMSB

2006

Page 6: Hierarchical multilabel classification trees for gene function prediction

Two HMC approaches

1. Learn model for each class and combine the predictions

Advantage a lot of machine learning algorithms

available

Disadvantages efficiency skewed class distributions hierarchical relationships

6/21

PMSB

2006

m1 m2 mn

c1? c2? cn?

Page 7: Hierarchical multilabel classification trees for gene function prediction

Two HMC approaches (c’ted)2. Learn a single model that

predicts all the classes together Advantages

faster to learn easier to interpret hierarchy constraint

automatically imposed selection of features

relevant for all classes Disadvantage

may have worse predictive performance

M

[c1, c2, …, cn]

7/21

PMSB

2006

Page 8: Hierarchical multilabel classification trees for gene function prediction

Related work on HMC Barutcuoglu et al. (2006)

learn classes separately with SVM’s and combine the predictions with Naïve Bayes

Clare (2003) extension of C4.5 decision tree method that

learns all classes together A lot of work in the area of text classification

Rousu et al. (2005) give an overview on SVM-methods that learn a single model for all classes

PMSB

2006

Gene function prediction

Text classification

Approach 1 Barutcuoglu et al. …

Approach 2 Clare …

8/21

Page 9: Hierarchical multilabel classification trees for gene function prediction

Why decision trees?

fast to build fast to use accurate predictions easy to interpret

Gene ND HS … MF?G1 25 29 … G2 32 40 … +G3 19 0 … G4 44 45 … +… … … … …

Nitrogen depletion <= -2.74?

Heat shock > 1.28?

yes no

yes no

training examples

9/21

PMSB

2006

+++

+++

+ + ����

Positive

Positive Negative

Page 10: Hierarchical multilabel classification trees for gene function prediction

Decision trees for HMC

The Clus system created by Jan Struyf propositional DT learner, implemented in

Java uses ideas of:

C4.5 [Quinlan93] and CART [Breiman84] Predictive Clustering Trees [Blockeel98]

Heuristic for HMC look for test that minimizes the intra-

cluster variance (= generalisation of CART)

PMSB

2006

10/21

Page 11: Hierarchical multilabel classification trees for gene function prediction

can be used for HMC (Clus-HMC) …

… as well as binary classification (Clus-SC ~ CART)

Decision trees for HMC (c’ted)

2 n1

c1? c2? cn?

c1 c1,c21,c22

c2,c21,c22 c1c1,c2,c21 c1,c3

PMSB

2006

11/21

Page 12: Hierarchical multilabel classification trees for gene function prediction

Saccharomyces cerevisiae or baker’s/brewer’s yeast

MIPS FunCat hierarchy 250 functions of yeast genes

12 datasets [Clare03] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data

cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all)

Experiments in yeast functional genomics

1 METABOLISM

1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms

2 ENERGY

2/1 glycolysis and gluconeogenesis

…12/21

PMSB

2006

Page 13: Hierarchical multilabel classification trees for gene function prediction

Example run

each leaf contains multiple classes

which classes to predict?

problem: different class frequencies

use of threshold

precision-recall curves: independent of a specific threshold

PMSB

2006

nitrogen_depletion > 5

Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …

description functions

13/21

37C_to_25C_shock > 1.28

{1,5,5/1,3,3/5}

{5,5/1,40,40/3}

{1,5}

{40,40/3,40/16}

{5,5/1,40}

{40,40/3, 40/16}

{1,5,5/1,3,3/5}

{1,5}

{5,5/1,40}{5,5/1,40, 40/3}

{40,40/16}

{40,40/16}

{5,5/1,40}

{5,5/1,40}

40,40/3,40/16

5,5/1,40,40/3

1,5,5/1,3,3/5 p=0%

40,40/3,40/16

5,5/1,40 1,5 p=50%

40,40/16 5,5/1,40 1,5 p=100%

Predictions

Page 14: Hierarchical multilabel classification trees for gene function prediction

Comparison of Clus-HMC with [Clare03]

Average precision-recall curves

PMSB

2006

14/21

PRECISION

= proportion of (instance, class) predictions that is correct

RECALL

= proportion of true (instance, class) cases that are predicted

Page 15: Hierarchical multilabel classification trees for gene function prediction

Extracting rules

e.g. predictions for class 40/3 in “gasch1” dataset

IF Nitrogen_Depletion_8_h <= -2.74 AND

Nitrogen_Depletion_2_h > -1.94 AND

1point5_mM_diamide_5_min > -0.03 AND

1M_sorbitol___45_min_ > -0.36 AND

37C_to_25C_shock___60_min > 1.28

THEN 40,40/3

Precision: 0.97

Recall: 0.15

PMSB

2006

15/21

Page 16: Hierarchical multilabel classification trees for gene function prediction

HMC vs. single classification Tree sizes

on average HMC tree: 24 nodes SC tree: 33 nodes (250 of such trees)

Time to grow trees single SC tree is grown faster than single

HMC but 250 single trees have to be built HMC on average 37 times faster

Predictive performance next slide

PMSB

2006

16/21

Page 17: Hierarchical multilabel classification trees for gene function prediction

HMC vs. single classification Average precision-recall curves

PMSB

2006

17/21

Page 18: Hierarchical multilabel classification trees for gene function prediction

Explanation of the results The classes are not independent

different trees for different classes actually share structure

explains some complexity reduction achieved by Clus-HMC

one class carries information on other classes

this increases the signal-to-noise ratio provides better guidance when learning the

tree (explaining good predictive performance)

avoids overfitting (explaining further reduction of tree size)

this was confirmed empirically

PMSB

2006

18/21

Page 19: Hierarchical multilabel classification trees for gene function prediction

Conclusions

HMC decision trees are a useful tool for gene function prediction fast to learn high interpretability

Compared to regular tree learning, HMC tree learning: is even faster yields trees that:

are smaller are easier to interpret have equal or better predictive performance

PMSB

2006

19/21

Page 20: Hierarchical multilabel classification trees for gene function prediction

Further work

Comparison to other HMC learning algorithms kernel methods studied by Rousu et al.

and Barutcuoglu et al. other suggestions are welcome!

Use more advanced hierarchy such as Gene Ontology thousands of classes, spread over 19

levels how to handle the part_of relationship?

if a function A is part-of a function B then does a gene with function A also have function B?

gene “has” function B X vs. gene “is involved” in function B

PMSB

2006

20/21

Page 21: Hierarchical multilabel classification trees for gene function prediction

Questions?

PMSB

2006

21/21