K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label...

12
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel, Dragi Kocev, Sašo Džeroski K.U.Leuven Department of Computer Science

Transcript of K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label...

Page 1: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predicting gene functions using hierarchical multi-label

decision tree ensembles

Celine Vens, Leander Schietgat, Jan Struyf, Hendrik Blockeel,Dragi Kocev, Sašo Džeroski

K.U.LeuvenDepartment of

Computer Science

Page 2: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

K.U.LeuvenDepartment of

Computer Science

• Classification: a common machine learning task e.g.,

•Given: genes with known function

•Task: predict function for new genes

•Special case: hierarchical multi-label classification (HMC)

• gene can have multiple functions

• functions are organized in a hierarchy

•tree (e.g., MIPS FunCat)

•DAG (e.g., Gene Ontology)

Hierarchy constraint: if gene is labeled with function X, then

it is also labeled with all parents of X

Hierarchical Multi-Label Classification (HMC) for Gene Function Prediction

Page 3: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictions in Functional Genomics

• S. cerevisiae (13 datasets) and A. thaliana (12 datasets)

• two of biology’s model organisms

• most genes are annotated, ideal for testing purposes

• method can be applied to other organisms

• Data

• based on sequence statistics, phenotype, secondary structure, homology, microarray data,…

Page 4: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictive Clustering Trees•Our focus is on decision trees

•Advantages: fast to build, noise-resistant, fast to apply, accurate predictions, easy to interpret,

•General framework: predictive clustering trees (PCTs)

PCT-algo

genes with features and known functions

Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …

Input Algorithm Output

top-down inductionof PCTs PCT

Page 5: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Clus-SC Clus-HSC

Clus-HMC

Hierarchy constraint

Identifies global feats

Predictive performance

Model size

Efficiency

Standard approachlearns one tree per class

Special-purpose approachlearns one tree per class +

hierarchy constraint

Our approachlearns one single tree

for all classes

Decision Trees for HMC: Different Approaches

Page 6: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Predictive Clustering Forests

50 predictions

50 bootstrap replicates

Training set

•Ensembles

•Less interpretability

•Better performance

•Algorithm: Clus-HMC-Ens

1

2

n

3

Clus-HMC

50 PCTs

Test set

combined prediction

Clus-HMC

Clus-HMC

Clus-HMC

L1

L2

L3

Ln

L

Page 7: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

Clus-SC Clus-HSC

Clus-HMC Clus-HMC-Ens

Hierarchy constraint

Identifies global feats

Predictive performance

Model size

Efficiency

Standard approachlearns one tree per class

Special-purpose approachlearns one tree per class +

hierarchy constraint

Our approachlearns one single tree

for all classes

Variant of our approach

learns forest

Decision Trees for HMC: Different Approaches

Page 8: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Evaluation: precision-recall

• precision: percentage of predicted functions that are correct (TP/(TP+FP))

• recall: percentage of actual functions predicted by the algorithm (TP/(TP+FN))

• Average PR curve

– Consider (instance,class) couples

– Couple is (predicted) true if instance (is predicted to have) has class

Evaluation

TP FN

FP TN

Page 9: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

S. cerevisiae-FunCat (hom) A. thaliana-GO (seq)

S. cerevisiae-FunCat (expr) A. thaliana-GO (interpro)

•Clus-HMC-Ens better than Clus-HMC (average AUC improvement of 7%)

•Clus-HMC better than C4.5H (state-of-the-art system for HMC)(for the same recall of C4.5H, average precision improvement of 20.9%)

Page 10: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

Page 11: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Comparison with SVMs(Barutcuoglu et al.)

– Learn SVM per class

– Correct for HC violations with bayesian model

QuickTime™ en eenTIFF (ongecomprimeerd)-decompressor

zijn vereist om deze afbeelding weer te geven.

Page 12: K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,

K.U.LeuvenDepartment of

Computer Science

• Clus-HMC outperforms (or is comparable to) state-of-the-art methods on functional genomics tasks

• Ensembles of Clus-HMC are able to boost performance, if the user is willing to give up on interpretability

• “Revenge of the decision trees”

Conclusions