KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

34
KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI Candida Manelfi Molecular Modelling Lab. Angelini Research Center “4th KNIME Users Group Meeting”, Zurich March 3rd 2011

Transcript of KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

Page 1: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

KNIME WORKFLOWS FOR CHEMOINFORMATICS

TASKS AT ANGELINI

Candida Manelfi

Molecular Modelling Lab.

Angelini Research Center

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011

Page 2: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

1 ANGELINI

Outline

Goal

Generate a representative subset of a proprietary library

Materials and Methods

• Fingerprints

• Similarity measure

• Subset generation methods

Process for subset selection

Subset characterization

Conclusions

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

KNIME

WORKFLOWS

Page 3: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

2 ANGELINI

Select a representative and structure-diverse sub-library of compounds of our proprietary library capable of:

fully mimicking the profile of the entire library

maximizing the structure diversity

The approach used is based on the “similar property principle”:

structurally similar molecules have similar physicochemical

properties and possibly similar biological activities

Goal

… then maximum coverage of the activity space should be

achieved by selecting a structurally diverse set of compounds.

Goal

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 4: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

3 ANGELINI

Tools

Knime

Materials

Schrodinger Knime Extensions (Canvas)

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 5: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

4 ANGELINI

KNIME workflow

Materials

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 6: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

5 ANGELINI

KNIME workflow

Materials

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 7: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

6 ANGELINI

Clustering of molecules

Diversity analysis and selection of molecules

QSAR model building

Similarity search of a database

etc…

Applications of fingerprint methods

Fingerprints are binary string representations of the molecules

as a sequence of “0” and “1”, encoding the absence or

presence of a set of structural fragments (or chemical features).

Diversity analysis and selection of molecules

Clustering of molecules

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 8: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

7 ANGELINI

-Structure keys

Each bit of a structure key maps to a specific,

predefined substructure or molecular feature

specified using a fragment dictionary.

-Hashed fingerprints

Bits are enumerated from the molecules

generating all possible fragments according to

the fingerprint type.

- May be applied to any type of molecule.

- May map two distinct fragments to the same

bit (collision).

Fingerprints: descriptions of molecules

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 9: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

8 ANGELINI

- Hashed fingerprints Linear

Dendritic

Radial (Circular fingerprints)

Molprint2D (Circular fingerprints)

Pairwise

Triplet

Torsion

- Structure keys MACCS keys

Linear Dendritic

Radial Molprint2D

Fingerprints in Canvas

Pairwise

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 10: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

9 ANGELINI

Atom Typing Schemes 1 - All atoms equivalent; all bonds equivalent.

2 - Atoms distinguished by HB acceptor/donor;

all bonds equivalent.

3 - Atoms distinguished by hybridization state;

all bonds equivalent.

4 - Atoms distinguished by functional type:

{H}, {C}, {F,Cl}, {Br,I}, {N,0}, {S}, {other};

bonds by hybridization. (Radial)

5 - Mol2 atom types; all bonds equivalent.

(MOLPRINT2D)

6 - Atoms distinguished by whether terminal,

halogen,HB acceptor/donor; bonds distinguished

by bond order.

7 - Atomic number and bond order.

8 - Atoms distinguished by ring size, aromaticity,

HB acceptor/donor, ionization potential, whether

terminal, whether halogen; bonds distinguished

by bond order.

9 - Carhart atom types (atom-pairs approach).

(Pairs, Triplet)

10 - Daylight invariant atom types; bonds

distinguished by bond order. (Linear, Dendritic,

Torsion)

FP Scaling options 0 - no scaling (default)

1 - scale counts by feature size to unity

2 - scale counts by feature size to feature size

3 - scale counts by feature size to molecule size

4 - scale squares of counts by feature size to unity

5 - scale squares of counts by feature size to feature size

6 - scale squares of counts by feature size to molecule size

7 - scale sqrt of counts by feature size to unity

8 - scale sqrt of counts by feature size to feature size

9 - scale sqrt of counts by feature size to molecule size

10 - use raw feature counts

11 - use raw feature counts squared

12 - use sqrt of raw feature counts

Metrics Buser, Cosine, Dice, Hamann, Hamming, Dixon,

Euclidean, Kulczynski, Matching PatternDifference, Tanimoto,

modifiedTan, Shape, Size, Simpson, Petke, Tversky, Yuel

Variance, Soergel, rogersT, Pearson, Minmax

8 Fingerprints 25,344 combinations!

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 11: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

10 ANGELINI

Among the wide variety of possible 2D fingerprint methods there is no

one that, a priori, consistently performed better than the others.

In the similarity searching based on 2D fingerprints it is rarely known

which fingerprint is most suitable for the particular target and query.

Several investigations into the information content of fingerprints and

atom-typing schemes revealed that no approach works better in all

cases.

The choice should be validate case by case exploring different

fingerprinting methods.

Chemical and molecular similarity are strongly dependent on how

the molecular structure is measured and described.

Choice of fingerprints

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 12: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

11 ANGELINI

- Hashed fingerprints

Linear

Dendritic

Radial (Circular fingerprints)

Molprint2D (Circular fingerprints)

Pairwise

Triplet

Torsion

- Structure keys

MACCS keys

Fingerprints in Canvas

Methods

- Combined fingerprints

Dendridic + Radial

Dendritic + Triplet

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 13: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

12 ANGELINI

Atom Typing Schemes 1 - All atoms equivalent; all bonds equivalent.

2 - Atoms distinguished by HB acceptor/donor;

all bonds equivalent.

3 - Atoms distinguished by hybridization state;

all bonds equivalent.

4 - Atoms distinguished by functional type:

{H}, {C}, {F,Cl}, {Br,I}, {N,0}, {S}, {other};

bonds by hybridization.

5 - Mol2 atom types; all bonds equivalent.

6 - Atoms distinguished by whether terminal,

halogen,HB acceptor/donor; bonds distinguished

by bond order.

7 - Atomic number and bond order.

8 - Atoms distinguished by ring size, aromaticity,

HB acceptor/donor, ionization potential, whether

terminal, whether halogen; bonds distinguished

by bond order.

9 - Carhart atom types (atom-pairs approach).

10 - Daylight invariant atom types; bonds

distinguished by bond order.

(Radial, Triplet, MOLPRINT2D)

FP Scaling options 0 - no scaling (default)

1 - scale counts by feature size to unity

2 - scale counts by feature size to feature size

3 - scale counts by feature size to molecule size

4 - scale squares of counts by feature size to unity

5 - scale squares of counts by feature size to feature size

6 - scale squares of counts by feature size to molecule size

7 - scale sqrt of counts by feature size to unity

8 - scale sqrt of counts by feature size to feature size

9 - scale sqrt of counts by feature size to molecule size

10 - use raw feature counts

11 - use raw feature counts squared

12 - use sqrt of raw feature counts

Metrics Buser, Cosine, Dice, Hamann, Hamming, Dixon,

Euclidean, Kulczynski, Matching, PatternDifference, Tanimoto,

modifiedTan, Shape, Size, Simpson, Petke, Tversky, Yuel

Variance, Soergel, rogersT, Pearson, Minmax, McConnaughey

8 Fingerprints 25,344 combinations!

10 - use raw feature counts

Tanimoto

Atom Typing Schemes 1 - All atoms equivalent; all bonds equivalent.

2 - Atoms distinguished by HB acceptor/donor;

all bonds equivalent.

3 - Atoms distinguished by hybridization state;

all bonds equivalent.

4 - Atoms distinguished by functional type:

{H}, {C}, {F,Cl}, {Br,I}, {N,0}, {S}, {other};

bonds by hybridization. (Radial)

5 - Mol2 atom types; all bonds equivalent.

(MOLPRINT2D) 6 - Atoms distinguished by whether terminal,

halogen,HB acceptor/donor; bonds distinguished

by bond order.

7 - Atomic number and bond order.

8 - Atoms distinguished by ring size, aromaticity,

HB acceptor/donor, ionization potential, whether

terminal, whether halogen; bonds distinguished

by bond order.

9 - Carhart atom types (atom-pairs approach).

(Triplet)

10 - Daylight invariant atom types; bonds

distinguished by bond order.

(Linear, Dendritic, Torsion)

Methods

0 – no scaling (default)

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 14: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

13 ANGELINI

Similarity measures are generally based on the presence and/or absence of

features in two molecules, estimating the number of matches or overlap

between them.

The Tanimoto coefficient is the most widely used coefficient for binary

fingerprints:

a : number of bits=1 present in A and absent in B

b : number of bits=1 present in B and absent in A

c : number of bits=1 common to both 0 ≤ T ≤ 1 T =

c

a + b + c

Similarity measure: Tanimoto coefficient

Methods

b c a

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 15: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

14 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 16: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

15 ANGELINI

Hierarchical Clustering It is a non-overlapping clustering method. Clusters increase in size,

starting from a situation of a single compound per cluster.

Agglomerative methods start at the bottom and merge similar clusters.

A clustering algorithm divide a group of objects into clusters where

molecules within each cluster are more closely related to one

another than objects assigned to different clusters.

“Distance-based” methods: - Cluster analysis

- Dissimilarity-based compound selection

Ward’s method Clusters are formed to minimize the variance

A representative subset was selected by taking one compound

(centroid) from each cluster.

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

“Distance-based” methods:

- Cluster analysis - Dissimilarity-based compound selection

Page 17: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

16 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 18: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

17 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 19: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

18 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 20: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

19 ANGELINI

Attempt to identify a diverse set of compounds directly.

1) The desired size of the final subset was decided.

2) A selection method to choose the first compound of the subset

was selected.

3) Dissimilarity was calculated with every other compounds in the

subset according to a dissimilarity algorithm.

4) The next compound was chosen as the most dissimilar to those in

the subset.

Most Dissimilar takes the compound with the

smallest sum of similarities

to the other molecules.

MaxMin chooses the compound with the

maximum distance to its closest

neighbour in the subset.

Methods

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

“Distance-based” methods: - Cluster analysis

- Dissimilarity-based compound selection

Page 21: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

20 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 22: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

21 ANGELINI

Fingerprints calculation • dendritic

• radial

• triplet

• molprint2D

• maccs

• radial-dendritic

• triplet-dendritic

Pairwise similarity

measure

Angelini

Library

Subsets generation

Cluster analysis: 7 subsets

Dissimilarity-based method: 7 subsets

1st step: generation of several subsets

Process

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 23: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

22 ANGELINI

Self-similarity matrix calculation

and graphical representation

Angelini

Library

Subset selection

The smaller the mean Tanimoto

coefficient, the greater the

subset structure diversity.

Subsets

Mol Max Tc

AF1 0,26

AF2 0,35

AF3 0,42

AFn 0,42

Mol AF1 AF2 AF3 AFn

AF1 1 0,12 0,04 0,26

AF2 0,12 1 0,35 0,23

AF3 0,04 0,35 1 0,42

AFn 0,26 0,23 0,42 1

2nd step: subset selection by self-similarity

Process

Similarity Frequency

0.1 0,26

0.2 0,55

.. 0,19

1 0

0.2 0,55

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 24: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

23 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 25: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

24 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 26: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

25 ANGELINI

Materials

KNIME workflow

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 27: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

26 ANGELINI

Mean Tc

library = 0.75

0.2 < subsets < 0.4

0

0,2

0,4

0,6

0,8

1

librarymolprint2D

maccstriplet

dendritic-tripletdendritic

dendritic-radialradial

0%

3%

5%

8%

10%

13%

15%

18%

20%

Perc

enta

ge

of

co

mp

ou

nd

s

Tanimoto coefficient Fingerprints

0%

5%

10%

15%

20%

25%

30%

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Tanimoto coefficient

Perc

en

tag

e o

f co

mp

ou

nd

s

librarymolprint2Dmaccs tripletdendritic-tripletdendriticdendritic-radialradial

Subsets of AF library by Cluster Analysis

Subset selection

Library

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 28: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

27 ANGELINI

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Tanimoto coefficient

Perc

en

tag

e o

f co

mp

ou

nd

s

librarymolprint2Dmaccstripletdendritic-tripletdendriticdendritic-radialradial

00,1

0,20,3

0,40,5

0,60,7

0,80,9

1

librarymolprint2D

maccstriplet

dendritic-tripletdendritic

dendritic-radialradial

0%

10%

20%

30%

40%

50%

60%

70%

80%

Perc

enta

ge

of

co

mp

ou

nd

s

Tanimoto coefficient Fingerprints

Mean Tc

library = 0.75

subsets = 0.1

Dendritic

-Radial Dendritic

-Radial

Subsets of AF library by Dissimilarity

Subset selection

Library

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 29: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

28 ANGELINI

Subsets of AF library by random selection

Random1-2: draw randomly

Random3: stratified sampling

by name

Random4: linear sampling

Subset selection

Library

Diverse-subset

Random-subsets

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 30: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

29 ANGELINI

Subset

physicochemical characterization

-Molecular Weight

- Log P

- Rotatable Bond

-Polar Surface Area

Process

3rd step: subset characterization

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 31: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

30 ANGELINI

Subset physicochemical profile:

Subset characterization

Compounds M.W LogP RB PSA

Subset 50-700 -2 – 8 0-14 0-180

Library 50-700 -2 – 8 0-14 0-180

Drug-like 200–500 -2 – 5 0-8 0-120

-10%

0%

10%

20%

30%

40%

50%

60%

0 100 200 300 400 500 600 700 800 900 1000

Molecular weight

Perc

en

tag

e o

f co

mp

ou

nd

s

library

dendridic-radial

Molecular weight

Subset Library

LogP

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

-4 -2 0 2 4 6 8 10 12LogP

Perc

en

tag

e o

f co

mp

ou

nd

s

library

dendridic-radial

Subset Library

Rotatable bond

0%

10%

20%

30%

40%

50%

60%

0 2 4 6 8 10 12 14 16 18 20Rotable bond

Perc

en

tag

e o

f co

mp

ou

nd

s

library

dendridic-radialSubset Library

Polar Surface Area

-10%

0%

10%

20%

30%

40%

50%

60%

0 30 60 90 120 150 180 210 240 270

Polar surface area

Perc

en

tag

e o

f co

mp

ou

nd

s

library

dendridic-radial

Subset

Library

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 32: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

31 ANGELINI

Subset chemical space profile compared to

the entire library

Subset characterization

PCA MFS map

-4.0

-3.0

-2.0

-1.0

0.0

1.0

2.0

3.0

4.0

-5 -4 -3 -2 -1 0 1 2 3 4

t[2]

t[1]

finalsubset-library_properties_AF-Asghate.M6 (PCA-X), acraf subset no outliers

t[Comp. 1]/t[Comp. 2]

Colored according to Obs ID ("SMILES")

R2X[1] = 0.422392 R2X[2] = 0.29364 Ellipse: Hotelling T2 (0.95)

acraf

acraf_subset

SIMCA-P+ 12 - 2009-12-15 14:04:52 (UTC+1)

Library

Subset Library

Subset

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 0,03 0,06 0,09 0,12 0,15 0,18

mean

ma

x

library

subset

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 33: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

32 ANGELINI

We demostrated that in general no approach works better than

the other but we identified the best exploring and validating

different fingerprinting methods.

Using knime allowed us to easily investigate different

combinations of methods.

In the particular case of our system the best choice, in terms of

self-similarity index distribution, is the Diversity Analysis using a

combination of fingerprints Dendritic-Radial with the specific

atom typing scheme Daylight.

Conclusion

Conclusions

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini

Page 34: KNIME WORKFLOWS FOR CHEMOINFORMATICS TASKS AT ANGELINI

33 ANGELINI

R. Ombrato (Head of Molecular Modelling Lab)

N. Cazzolla (Head of Chemistry Department)

The group of the Medicinal Chemistry Dept

E. Fioravanzo, A. Bassan (S-IN Soluzioni Informatiche)

And all of you for your attention!!

Acknowledgements

“4th KNIME Users Group Meeting”, Zurich March 3rd 2011 - KNIME workflows for chemoinformatics tasks at Angelini