Ultrafast Shape Recognition for Similarity Search in...

of 129/129
Introduction Ultrafast Shape Recognition Ligand-based Virtual Screening Future Work Conclusions Ultrafast Shape Recognition for Similarity Search in Molecular Databases Pedro J. Ballester NFCR Centre for Computational Drug Discovery University of Oxford Pedro J. Ballester USR for Similarity Search 1
  • date post

    18-Jun-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Ultrafast Shape Recognition for Similarity Search in...

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Ultrafast Shape Recognition

    for Similarity Search in Molecular Databases

    Pedro J. Ballester

    NFCR Centre for Computational Drug Discovery

    University of Oxford

    Pedro J. Ballester USR for Similarity Search 1

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Outline

    1 Introduction

    2 Ultrafast Shape Recognition

    Foundations

    Encoding

    Comparing Molecular Shapes

    Effectiveness

    Efficiency

    3 Ligand-based Virtual Screening

    Experimental Setup

    Enrichment Plots

    USR virtual query

    4 Future Work

    5 Conclusions

    Pedro J. Ballester USR for Similarity Search 2

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Introduction

    Pedro J. Ballester USR for Similarity Search 3

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Virtual Screening

    Ligand-based Virtual Screening

    Goal: Identifying drug-like molecules likely to bebiologically active.

    Principle: Molecules with similar patterns are likely to havesimilar biological activity.

    Template: e.g. a molecule of known biological activity.

    Strategy: Search a database of molecules for those with apattern similar to that of the template.

    Pedro J. Ballester USR for Similarity Search 4

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Virtual Screening

    Ligand-based Virtual Screening

    Goal: Identifying drug-like molecules likely to bebiologically active.

    Principle: Molecules with similar patterns are likely to havesimilar biological activity.

    Template: e.g. a molecule of known biological activity.

    Strategy: Search a database of molecules for those with apattern similar to that of the template.

    Pedro J. Ballester USR for Similarity Search 4

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Virtual Screening

    Ligand-based Virtual Screening

    Goal: Identifying drug-like molecules likely to bebiologically active.

    Principle: Molecules with similar patterns are likely to havesimilar biological activity.

    Template: e.g. a molecule of known biological activity.

    Strategy: Search a database of molecules for those with apattern similar to that of the template.

    Pedro J. Ballester USR for Similarity Search 4

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Virtual Screening

    Ligand-based Virtual Screening

    Goal: Identifying drug-like molecules likely to bebiologically active.

    Principle: Molecules with similar patterns are likely to havesimilar biological activity.

    Template: e.g. a molecule of known biological activity.

    Strategy: Search a database of molecules for those with apattern similar to that of the template.

    Pedro J. Ballester USR for Similarity Search 4

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Virtual Screening

    Ligand-based Virtual Screening

    Goal: Identifying drug-like molecules likely to bebiologically active.

    Principle: Molecules with similar patterns are likely to havesimilar biological activity.

    Template: e.g. a molecule of known biological activity.

    Strategy: Search a database of molecules for those with apattern similar to that of the template.

    Pedro J. Ballester USR for Similarity Search 4

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Molecular Shape Comparison

    Molecular Shape Comparison

    Molecular shape has been widely highlighted as animportant pattern for which to search.

    Shape complementarity between ligand and receptor isnecessary for binding.

    Additional advantage: chemical structure is not specifiedand therefore novel chemical scaffolds may be found.

    Pedro J. Ballester USR for Similarity Search 5

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Molecular Shape Comparison

    Molecular Shape Comparison

    Molecular shape has been widely highlighted as animportant pattern for which to search.

    Shape complementarity between ligand and receptor isnecessary for binding.

    Additional advantage: chemical structure is not specifiedand therefore novel chemical scaffolds may be found.

    Pedro J. Ballester USR for Similarity Search 5

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Molecular Shape Comparison

    Molecular Shape Comparison

    Molecular shape has been widely highlighted as animportant pattern for which to search.

    Shape complementarity between ligand and receptor isnecessary for binding.

    Additional advantage: chemical structure is not specifiedand therefore novel chemical scaffolds may be found.

    Pedro J. Ballester USR for Similarity Search 5

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Molecular Shape Comparison

    Molecular Shape Comparison

    Molecular shape has been widely highlighted as animportant pattern for which to search.

    Shape complementarity between ligand and receptor isnecessary for binding.

    Additional advantage: chemical structure is not specifiedand therefore novel chemical scaffolds may be found.

    Pedro J. Ballester USR for Similarity Search 5

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Alignment

    Some methods require alignment of the molecules beforecomparing their shapes.

    Essentially: a multimodal optimisation problem with verylimited number of objective function evaluations available.

    May lead to suboptimal molecular alignment and thuserrors in the comparison.

    Pedro J. Ballester USR for Similarity Search 6

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Alignment

    Some methods require alignment of the molecules beforecomparing their shapes.

    Essentially: a multimodal optimisation problem with verylimited number of objective function evaluations available.

    May lead to suboptimal molecular alignment and thuserrors in the comparison.

    Pedro J. Ballester USR for Similarity Search 6

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Alignment

    Some methods require alignment of the molecules beforecomparing their shapes.

    Essentially: a multimodal optimisation problem with verylimited number of objective function evaluations available.

    May lead to suboptimal molecular alignment and thuserrors in the comparison.

    Pedro J. Ballester USR for Similarity Search 6

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Alignment

    Some methods require alignment of the molecules beforecomparing their shapes.

    Essentially: a multimodal optimisation problem with verylimited number of objective function evaluations available.

    May lead to suboptimal molecular alignment and thuserrors in the comparison.

    Pedro J. Ballester USR for Similarity Search 6

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Challenges

    Efficiency

    Shape information regarded as difficult to encode efficientlyand use in database searching (e.g. Zauhar et al. 2003).

    Increasing size of molecular databases poses a seriouslimitation for current shape comparison methods.

    The more conformations, the less likely to miss moleculesthat can adopt the template’s shape.The more compounds, the more likely to find innovativebioactive molecules.

    Consequently, the speed of molecular shape comparisonmethods is highly important.

    Pedro J. Ballester USR for Similarity Search 7

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Ultrafast Shape Recognition (USR)

    Ballester, P.J., US Patent Application filed on 25 May 2007

    Ballester, P.J. and Richards, W.G. (2007) J Comput Chem

    Ballester, P.J. and Richards, W.G. (2007) Proc R Soc A

    Pedro J. Ballester USR for Similarity Search 8

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Foundations

    USR is based on the observation that the shape of a molecule isuniquely determined by the relative position of its atoms.

    Such positions are in turn determined by the set of allinter-atomic distances.

    No need for alignment or translation of the molecule, as this setof distances is independent of molecular orientation or position.

    Pedro J. Ballester USR for Similarity Search 9

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Foundations

    USR is based on the observation that the shape of a molecule isuniquely determined by the relative position of its atoms.

    Such positions are in turn determined by the set of allinter-atomic distances.

    No need for alignment or translation of the molecule, as this setof distances is independent of molecular orientation or position.

    Pedro J. Ballester USR for Similarity Search 9

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Foundations

    USR is based on the observation that the shape of a molecule isuniquely determined by the relative position of its atoms.

    Such positions are in turn determined by the set of allinter-atomic distances.

    No need for alignment or translation of the molecule, as this setof distances is independent of molecular orientation or position.

    Pedro J. Ballester USR for Similarity Search 9

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Foundations

    USR is based on the observation that the shape of a molecule isuniquely determined by the relative position of its atoms.

    Such positions are in turn determined by the set of allinter-atomic distances.

    No need for alignment or translation of the molecule, as this setof distances is independent of molecular orientation or position.

    Pedro J. Ballester USR for Similarity Search 9

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Foundations

    USR is based on the observation that the shape of a molecule isuniquely determined by the relative position of its atoms.

    Such positions are in turn determined by the set of allinter-atomic distances.

    No need for alignment or translation of the molecule, as this setof distances is independent of molecular orientation or position.

    Pedro J. Ballester USR for Similarity Search 9

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Further Considerations

    Furthermore, values of inter-atomic distances are heavilyconstrained:

    Distances between bound atoms strongly depends on whichare these atoms.Other inter-atomic distances depends on the flexibility ofthe molecule.

    The set of all inter-atomic distances may contain moreinformation than needed for accurate description of shape.

    Strategy: encoding shape from a subset of these distances.

    Pedro J. Ballester USR for Similarity Search 10

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Representation of molecular shape

    Reference Locations

    Distances from two close atoms are similar and thus containsimilar information →

    → consider sets of atomic distances from reference locationswhich are far from each other.

    Four reference locations: ctd, cst, fct and ftf.

    Each conformer is represented now by 4N distances.

    Pedro J. Ballester USR for Similarity Search 11

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Representation of molecular shape

    Reference Locations

    Distances from two close atoms are similar and thus containsimilar information →

    → consider sets of atomic distances from reference locationswhich are far from each other.

    Four reference locations: ctd, cst, fct and ftf.

    Each conformer is represented now by 4N distances.

    Pedro J. Ballester USR for Similarity Search 11

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Representation of molecular shape

    Reference Locations

    Distances from two close atoms are similar and thus containsimilar information →

    → consider sets of atomic distances from reference locationswhich are far from each other.

    Four reference locations: ctd, cst, fct and ftf.

    Each conformer is represented now by 4N distances.

    Pedro J. Ballester USR for Similarity Search 11

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Representation of molecular shape

    Reference Locations

    Distances from two close atoms are similar and thus containsimilar information →

    → consider sets of atomic distances from reference locationswhich are far from each other.

    Four reference locations: ctd, cst, fct and ftf.

    Each conformer is represented now by 4N distances.

    Pedro J. Ballester USR for Similarity Search 11

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Representation of molecular shape

    Reference Locations

    Distances from two close atoms are similar and thus containsimilar information →

    → consider sets of atomic distances from reference locationswhich are far from each other.

    Four reference locations: ctd, cst, fct and ftf.

    Each conformer is represented now by 4N distances.

    Pedro J. Ballester USR for Similarity Search 11

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Encoding

    Moments of atomic distance distributions

    But: how do we compare molecules with different N?

    Histogram of each distribution of distances has a number ofwell-known drawbacks:

    Difficulty of selecting a bin size suitable for all comparedmolecules.Relatively large storage needed for the histograms.Relatively large computing cost.

    A distribution is completely determined by its moments(e.g. Hall, 1983).

    Idea: describe the distribution of atomic distances by its firstmoments (avoids histogram calculation).

    Pedro J. Ballester USR for Similarity Search 12

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    USR descriptors

    12 descriptors: 4 reference locations x 3 first moments (i.e. mean,variance and skewness of each set of atomic distances).

    Excellent compromise between effectiveness and efficiency.

    Warning: if moments are poorly estimated, no reason to expectthe resulting implementation of USR to be effective!

    Pedro J. Ballester USR for Similarity Search 13

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    USR descriptors

    12 descriptors: 4 reference locations x 3 first moments (i.e. mean,variance and skewness of each set of atomic distances).

    Excellent compromise between effectiveness and efficiency.

    Warning: if moments are poorly estimated, no reason to expectthe resulting implementation of USR to be effective!

    Pedro J. Ballester USR for Similarity Search 13

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    USR descriptors

    12 descriptors: 4 reference locations x 3 first moments (i.e. mean,variance and skewness of each set of atomic distances).

    Excellent compromise between effectiveness and efficiency.

    Warning: if moments are poorly estimated, no reason to expectthe resulting implementation of USR to be effective!

    Pedro J. Ballester USR for Similarity Search 13

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    USR descriptors

    12 descriptors: 4 reference locations x 3 first moments (i.e. mean,variance and skewness of each set of atomic distances).

    Excellent compromise between effectiveness and efficiency.

    Warning: if moments are poorly estimated, no reason to expectthe resulting implementation of USR to be effective!

    Pedro J. Ballester USR for Similarity Search 13

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Molecular Shape Comparison

    USR similarity score

    Score to quantify the similarity between the query (q) and theith database conformer.

    Sqi =1

    1 + 112

    ∑12

    l=1 |Mq

    l − Mil |

    Pedro J. Ballester USR for Similarity Search 14

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Molecular Shape Comparison

    USR similarity score

    Score to quantify the similarity between the query (q) and theith database conformer.

    Sqi =1

    1 + 112

    ∑12

    l=1 |Mq

    l − Mil |

    Pedro J. Ballester USR for Similarity Search 14

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Molecular Shape Comparison

    USR similarity score

    Score to quantify the similarity between the query (q) and theith database conformer.

    Sqi =1

    1 + 112

    ∑12

    l=1 |Mq

    l − Mil |

    Pedro J. Ballester USR for Similarity Search 14

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Comparing Molecular Shapes: Example 1

    Pedro J. Ballester USR for Similarity Search 15

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Query 4 on a database with 2.5 million compounds

    USR

    1st 2nd 3rd 4th

    1st 2nd 3rd 4th

    ESshape3DPedro J. Ballester USR for Similarity Search 16

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Most Important Feature of USR

    Efficiency

    Compare USR with three state-of-the-art methods: ESshape3D,Shape Signatures and ROCS.

    Precalculating descriptors for a database (only once; in s/c):

    USR (1.18 · 10−4 s/c; Intel Core2 2.0GHz).ESshape3D (9.15 · 10−4 s/c; Intel Core2 2.0GHz).Shape Signatures (50.82 s/c; 450MHz Intel Pentium III).ROCS (none).

    As many queries are carried out, the important performancemeasure is how many conformers can be compared per second.

    Pedro J. Ballester USR for Similarity Search 17

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Descriptor-based Methods

    USR is 1 546 times faster than ESshape3D.

    USR is 2 038 times faster than Shape Signatures.

    Pedro J. Ballester USR for Similarity Search 18

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Descriptor-based Methods

    USR is 1 546 times faster than ESshape3D.

    USR is 2 038 times faster than Shape Signatures.

    Pedro J. Ballester USR for Similarity Search 18

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Descriptor-based Methods

    USR is 1 546 times faster than ESshape3D.

    USR is 2 038 times faster than Shape Signatures.

    Pedro J. Ballester USR for Similarity Search 18

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Superposition Methods

    USR is 14 238 times faster than ROCS.

    Based on ROCS’s reported comparison rate on a modernworkstation (USR on a 2.93 GHz Intel Core2 processor).

    Pedro J. Ballester USR for Similarity Search 19

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Superposition Methods

    USR is 14 238 times faster than ROCS.

    Based on ROCS’s reported comparison rate on a modernworkstation (USR on a 2.93 GHz Intel Core2 processor).

    Pedro J. Ballester USR for Similarity Search 19

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    FoundationsEncodingComparing Molecular ShapesEffectivenessEfficiency

    Efficiency Comparison with Superposition Methods

    USR is 14 238 times faster than ROCS.

    Based on ROCS’s reported comparison rate on a modernworkstation (USR on a 2.93 GHz Intel Core2 processor).

    Pedro J. Ballester USR for Similarity Search 19

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Ligand-based Virtual Screening

    Ballester, P.J., Finn, P.W. and Richards, W.G. (2007?)

    Pedro J. Ballester USR for Similarity Search 20

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Virtual Screening Validation

    Aim

    How good will be the method at identifying molecules with agiven activity?

    Retrospective virtual screening experiment.

    DrugBank-3D Test Database

    Publicly available resource: DrugBank(http://redpoll.pharmacy.ualberta.ca/drugbank/index.html).

    Input: set of 3 764 chemical structures formed by FDA-approved(708) and experimental (3 056) drugs.

    MOE’s conformer generator→ an average of about 200conformations per compound (3 330 chemical structures).

    DrugBank-3D: 666 892 conformers in 3D MDL SD format.

    Pedro J. Ballester USR for Similarity Search 21

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Adopted Procedure (Nicholls et al., 2004)

    For each activity class:

    1 Consider the lowest energy conformer of each activecompound.

    2 Perform hierarchical agglomerative clustering on the USRsimilarity matrix of these conformers (Sthreshold = 0.75).

    3 Main cluster ≡ that with highest number of actives.4 Identify the closest of these conformers to the centroid of

    the main cluster (consensus shape template).5 Use shape template as query against the whole database.

    Pedro J. Ballester USR for Similarity Search 22

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Used Activity Classes

    Activity Name # of Actives (Ai)TK Thymidine Kinase 13HH1R Histamine H1 Receptor 41COX-2 Cyclooxygenase-2 28NM Neuraminidase 85-HT-2A 5-HT-2A Receptor 15ER Estrogen Receptor 24PR Progesterone Receptor 12TKTL Transketolase 3AT1 Type-1 Angiotensin II Receptor 8HIV-1 HIV-1 Protease 6

    Pedro J. Ballester USR for Similarity Search 23

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Used Activity Classes

    Activity Name # of Actives (Ai)TK Thymidine Kinase 13HH1R Histamine H1 Receptor 41COX-2 Cyclooxygenase-2 28NM Neuraminidase 85-HT-2A 5-HT-2A Receptor 15ER Estrogen Receptor 24PR Progesterone Receptor 12TKTL Transketolase 3AT1 Type-1 Angiotensin II Receptor 8HIV-1 HIV-1 Protease 6

    Pedro J. Ballester USR for Similarity Search 23

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Query Selection

    Used Activity Classes

    Activity Name # of Actives (Ai)TK Thymidine Kinase 13HH1R Histamine H1 Receptor 41COX-2 Cyclooxygenase-2 28NM Neuraminidase 85-HT-2A 5-HT-2A Receptor 15ER Estrogen Receptor 24PR Progesterone Receptor 12TKTL Transketolase 3AT1 Type-1 Angiotensin II Receptor 8HIV-1 HIV-1 Protease 6

    Pedro J. Ballester USR for Similarity Search 23

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Activity Class: 5-HT-2A Receptor

    Pedro J. Ballester USR for Similarity Search 24

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Activity Class: 5-HT-2A Receptor

    Pedro J. Ballester USR for Similarity Search 24

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Selected Queries for each Activity

    Activity Query # of Heavy Atoms (N)TK EXPT01835-1 17HH1R APRD00587-1 19COX-2 APRD01060-1 19NM EXPT00332-1 205-HT-2A APRD00033-1 22ER APRD00754-1 23PR APRD00941-1 23TKTL EXPT02273-1 26AT1 APRD00052-1 30HIV-1 APRD00623-1 49

    Pedro J. Ballester USR for Similarity Search 25

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Selected Queries for each Activity

    Activity Query # of Heavy Atoms (N)TK EXPT01835-1 17HH1R APRD00587-1 19COX-2 APRD01060-1 19NM EXPT00332-1 205-HT-2A APRD00033-1 22ER APRD00754-1 23PR APRD00941-1 23TKTL EXPT02273-1 26AT1 APRD00052-1 30HIV-1 APRD00623-1 49

    Pedro J. Ballester USR for Similarity Search 25

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Selected Queries for each Activity

    Activity Query # of Heavy Atoms (N)TK EXPT01835-1 17HH1R APRD00587-1 19COX-2 APRD01060-1 19NM EXPT00332-1 205-HT-2A APRD00033-1 22ER APRD00754-1 23PR APRD00941-1 23TKTL EXPT02273-1 26AT1 APRD00052-1 30HIV-1 APRD00623-1 49

    Pedro J. Ballester USR for Similarity Search 25

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Shape of Selected Queries

    TK HH1R COX-2 NM 5-HT-2A

    ER PR TKTL AT1 HIV-1

    Pedro J. Ballester USR for Similarity Search 26

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Enrichment Plot: 5-HT-2A Receptor

    5-HT-2A (APRD00033-1)

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0 1 2 3 4 5

    top x%

    E(x%)

    USR

    ESshape3D

    Pedro J. Ballester USR for Similarity Search 27

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Comparing USR and ESshape3D on the 10 Activities

    Mean Enrichment Top 1% Top 3% Top 5%USR 25.5 9.9 7.5ESshape3D 15.8 8.0 6.0

    Pedro J. Ballester USR for Similarity Search 28

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Enrichment Plot: Thymidine Kinase

    Thymidine Kinase (EXPT01835-1)

    0

    10

    20

    30

    40

    50

    60

    0 1 2 3 4 5

    top x%

    E(x%)

    ESshape3D

    USR

    Pedro J. Ballester USR for Similarity Search 29

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    A Possible Explanation

    Thymidine Kinase (EXPT01835-1)

    0

    10

    20

    30

    40

    50

    60

    0 1 2 3 4 5

    top x%

    E(x%)

    ESshape3D

    USR

    +

    ++

    x +

    xx

    x

    x x

    xx

    Pedro J. Ballester USR for Similarity Search 30

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    A Possible Explanation

    Thymidine Kinase (EXPT01835-1)

    0

    10

    20

    30

    40

    50

    60

    0 1 2 3 4 5

    top x%

    E(x%)

    ESshape3D

    USR

    +

    ++

    x +

    xx

    x

    x x

    xx

    Pedro J. Ballester USR for Similarity Search 30

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    A Possible Explanation

    Thymidine Kinase (EXPT01835-1)

    0

    10

    20

    30

    40

    50

    60

    0 1 2 3 4 5

    top x%

    E(x%)

    ESshape3D

    USR

    +

    ++

    x +

    xx

    x

    x x

    xx

    Pedro J. Ballester USR for Similarity Search 30

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    A Possible Explanation

    Thymidine Kinase (EXPT01835-1)

    0

    10

    20

    30

    40

    50

    60

    0 1 2 3 4 5

    top x%

    E(x%)

    ESshape3D

    USR

    USR(C1-CTD)

    ++

    +x

    +

    x

    x

    x

    x x

    xx

    Pedro J. Ballester USR for Similarity Search 30

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Enrichment Plot: 5-HT-2A Receptor

    5-HT-2A (APRD00033-1)

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0 1 2 3 4 5

    top x%

    E(x%)

    USR

    ESshape3D

    USR(C1-CTD)

    Pedro J. Ballester USR for Similarity Search 31

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Experimental SetupEnrichment PlotsUSR virtual query

    Comparing USR and ESshape3D on the 10 Activities

    Mean Enrichment Top 1% Top 3% Top 5%USR-CTD 36.9 14.4 10.2USR 25.5 9.9 7.5ESshape3D 15.8 8.0 6.0

    Pedro J. Ballester USR for Similarity Search 32

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Future Work

    In Collaboration with:

    � GlaxoSmithKline (Harlow, UK)

    � Pfizer (Groton, USA)

    � University of Oxford (Dept. of Pharmacology)

    Pedro J. Ballester USR for Similarity Search 33

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Topics for Future Research

    Virtual Screening (VS): Validating generalisation ability of VSmethods using biological screening data.

    Prospective VS on selected activities.

    Clustering molecular databases in terms of shape:

    USR-based clustering on multi-million databases.Screening strategies for Docking and HTS.

    Combining USR with other VS methods (electrostatics).

    Pedro J. Ballester USR for Similarity Search 34

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    Pedro J. Ballester USR for Similarity Search 35

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    A novel molecular shape comparison approach (USR) hasbeen proposed.

    Effective at identifying similarly shaped conformers in adatabase.

    Effective in retrospective virtual screening across a set ofdiverse activities:

    Adopted query selection procedure works well.Proposed USR virtual query improves performance.Suggest that USR will be effective in prospective virtualscreening.

    Extremely fast: more than 3 orders de magnitude fasterthan the fastest existing methods!

    Pedro J. Ballester USR for Similarity Search 36

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    USR could be adapted to other shape recognition problems(e.g. internet search engine for 3D Shapes).

    Overall, this work has attracted the attention of the media.

    Pedro J. Ballester USR for Similarity Search 37

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    USR could be adapted to other shape recognition problems(e.g. internet search engine for 3D Shapes).

    Overall, this work has attracted the attention of the media.

    Pedro J. Ballester USR for Similarity Search 37

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Conclusions

    USR could be adapted to other shape recognition problems(e.g. internet search engine for 3D Shapes).

    Overall, this work has attracted the attention of the media.

    Pedro J. Ballester USR for Similarity Search 37

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Acknowledgements

    Graham Richards (University of Oxford):feedback.

    Paul Finn (Inhibox Ltd.):feedback and preparing molecular databases.

    US National Foundation of Cancer Research:funding.

    Pedro J. Ballester USR for Similarity Search 38

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Acknowledgements

    Graham Richards (University of Oxford):feedback.

    Paul Finn (Inhibox Ltd.):feedback and preparing molecular databases.

    US National Foundation of Cancer Research:funding.

    Pedro J. Ballester USR for Similarity Search 38

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Acknowledgements

    Graham Richards (University of Oxford):feedback.

    Paul Finn (Inhibox Ltd.):feedback and preparing molecular databases.

    US National Foundation of Cancer Research:funding.

    Pedro J. Ballester USR for Similarity Search 38

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Acknowledgements

    Graham Richards (University of Oxford):feedback.

    Paul Finn (Inhibox Ltd.):feedback and preparing molecular databases.

    US National Foundation of Cancer Research:funding.

    Pedro J. Ballester USR for Similarity Search 38

  • IntroductionUltrafast Shape Recognition

    Ligand-based Virtual ScreeningFuture WorkConclusions

    Thank You

    Pedro J. Ballester USR for Similarity Search 39