Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

10
1 Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

description

Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk. Random Forest – consensus modelling. Random Forest model is an ensemble of single decision trees. Rules for model construction - PowerPoint PPT Presentation

Transcript of Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk

Page 1: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

1

Application and Efficacy of Random Forest Method

for QSAR Analysis

presented byPavel Polishchuk

Page 2: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

2Random Forest – consensus modelling

Random Forest model is an ensemble of single decision trees.

Rules for model construction

1. Each tree growing on separate bootstrap sample of

initial training set compounds.

2. In each node only small randomly chosen fixed

number of descriptors are considered.

3. Each tree grows for its maximum depth (no

pruning).

Page 3: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

3

Initial dataset

Bootstrapsample

Bootstrapsample

Bootstrapsample

Tree1 Tree2 Tree3

Combined prediction

Random Forest algorithm

Page 4: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

4Random Forest advantages:

1. RF models are robust to over-fitting.

2. There is no need in pre-selection of variables.

3. RF has its own reliable procedure for estimation of predictive ability of model.

4. RF models are robust to “noise” in training dataset.

5. RF allows to estimate variable importance for target property (interpretability of RF model).

6. RF allows to analyze compounds with different mechanisms of action.

7. RF method is very fast and effective in working with huge datasets.

Page 5: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

5

Several examples of real QSAR tasks solutions

Page 6: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

6Toxicity of chemical compounds for T. pyriformis#

Diverse datasets:training set = 644 compoundstest set 1 (ts1) = 339 compoundstest set 2 (ts2) = 110 compounds

Total number of 2D simplex descriptors = 6021

was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50)

# Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.

Page 7: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

7

RF#

(2D simplex)

Consensus PLS

(2D simplex)

Consensus

literature##

R2(ws) 0.99 0.85 0.92

R2(oob) 0.81 --- ---

R2(ts1) 0.83 0.80 0.85

R2(ts2) 0.74 0.69 0.67

MAE(ts1)

0.30 0.33 0.29

MAE(ts2)

0.38 0.41 0.39mean absolute error of prediction

n

i

YYn

MAE1

^1

Comparison of RF model with other consensus ones

RF model (trees=500, vars=2000)#

# Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488## Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.

Page 8: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

8Estimation of mutagenic potential of chemical compounds (Ames test)

Model DescriptorsAccuracy

(oob)Accuracy

(5-fold CV)Accuracy(test set)

2D RFSimplex + Dragon

0.827 0.823 0.813

2D RF Simplex 0.823 0.810 0.8142D RF Dragon 0.815 0.803 0.805Consensus#

(32 models)

--- --- 0.828 0.823

# Results of collaboration of 13 scientific groups (not published yet)

training set = 4361 compoundstest set = 2181 compounds

Page 9: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

9Solubility in water QSPR task solution#

training set = 2537 compoundstest set = 301 compoundstraining setR2 = 0.99

out-of-bag setR2 = 0.88

test setR2 = 0.82

# Kovdienko, N.A., et al. Molecular Informatics, 2010. 29: p.394-406

Page 10: Application and Efficacy of  Random Forest Method  for QSAR Analysis presented by Pavel Polishchuk

10

(27.01.1928 – 07.07.2005)

Leo Breiman – author of Random Forest

«Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.»