Molecular Representation, Similarity and Search

Molecular Representa/on, Similarity and Search

Rajarshi Guha NIH Chemical Genomics Center

December 3rd, 2009

Outline

•  How can we represent molecules on a computer?

•  How do we decide when molecules are similar?

•  What can we do using similarity?

Molecular Representa/ons

•  Explicit –  Indicate what the atoms are, what atom is connected to what other atom(s)

– Differing levels of explicitness •  Do we need to show hydrogens? •  Do we need to indicate actual bonds?

•  Implicit – Usually very compact (e.g., SMILES) – Need to know the assump/ons involved

•  In SMILES, no specific bond symbol implies single bond

2D Representa/ons ‐ Topological

•  (Usually) indicates what types of atoms are present

•  Indicates which atoms are connected to which other atoms

•  No indica/on of where these atoms are located in space

•  Very easy to store, manipulate Cl

3D Representa/ons ‐ Geometric

•  Similar to 2D, but now has explicit 3D coordinates

•  More complex – a molecule can have mul/ple sets of 3D coordinates (conforma/ons) – Which is the correct one?

•  Takes more space to store, /me consuming to generate

Molecular Similarity

•  Many, many ways to determine how similar two molecules are

•  A simple, manual approach is to look at a 2D depic/on

•  But what are we looking at?

Willet, J Chem Inf Comput Sci, 1998, 38, 983-996 Sheridan et al, Drug Discov Today, 2002, 7, 903-911

Molecular Similarity

•  But 2D can be misleading •  Iden/cal in 2D is not necessarily so in 3D

How Do We Quan/fy Similarity?

•  1D similarity can be computed just by using SMILES, similar to sequence alignment – LINGO, Holograms

•  2D similarity is commonly measured using binary fingerprints – Key based fingerprints – Hashed fingerprints


•  Given 2 fingerprints we can then calculate a variety of similarity func/ons

•  Tanimoto is the most commonly used – Ranges from 0 to 1 – A measure of the number of bits common to both fingerprints

– See Daylight for more details

•  Can also be extended to 3D similari/es


•  3D similarity is more complex •  Most methods require you to align two 3D structures

•  Then determine the “volume overlap” – To what extent do the two structures occupy the same region in space

•  Most well known tool for this is ROCS


•  Property based similarity will use various physical proper/es or biological ac/vi/es –  If two molecules exhibit similar ac/vity across mul/ple cell lines, they are likely similar

–  If two molecules have a set of similar physical proper/es (computed or experimental) they are likely similar

2D or 3D?

•  Fast and easy •  Not always biological relevant

•  But surprisingly useful

•  More “accurate” •  Computa/onally more expensive

•  Which conforma/on is the correct one?

Different representations and similarity methods will, in general, lead to different

results (hits)

What Can We Do With Similarity?

•  Searching databases – exact substructure searching is not always useful

•  Using the benzodiazepine substructure would miss midazolam

•  But, the 2D similarity between these two structures is rela/vely high

N

HN

O

N

N

F

Cl

N

Query Midazolam

But 2D Only Goes So Far …

•  Using the tradi/onal benzodiazepine core won’t let you retrieve atypical benzodiazepines

•  In this case, the 2D similarity between this and the usual core is low

•  But in terms of shape they are quite similar

•  (Ambien occupies the same region of the GABA receptor as tradi8onal benzodiazepines)

Ambien

Virtual Screening

•  In many cases the ques/on we’re asking is •  Find me other ac2ve molecules

•  A good star/ng point is to look for structurally similar molecules

•  We assume that molecules with similar structures will exhibit similar ac/vites –  J. Med. Chem., 2002, 45, 4350‐4358 –  The basis of predic/ve modeling –  But lots and lots of excep/ons!

Sheridan et al, Drug Discov Today, 2002, 7, 903-911

Virtual Screening

•  2D similarity is a cheap, easy and fast way to perform this type of task

•  Can “screen” databases of many millions of molecules extremely rapidly

•  Usually only consider “very similar” (Tc >= 0.85) hits

•  It works …

Virtual Screening

•  But can be of limited use if used naively – Similarity is usually supplanted by machine learning

– S/ll, the only way out if there is no receptor and only a few (or a single) known ac/ves

•  Main drawback is that the hits are structurally similar – D’oh! – Not great if you’re trying to find a molecule that someone else hasn’t already developed

Scaffold Hopping

•  Ideally, we’d like to find a molecule that is as ac/ve as our query, but with a different core structure

•  Solving this usually requires us to go to 3D – Structures can differ in connec/vity

– But exhibit similar shapes

•  Being able to do this in 2D is an interes/ng research topic (cf reduced graphs)

Bergmann et al, J Chem Inf Model, 2009, 49, 658-669

Dissimilarity & Library Design

•  Chemical libraries form the basis of high throughput screening and other discovery methods

•  Sizes can range from a few hundred molecules to millions (or billions for virtual libraries)

•  In most cases, we want to cover as much of chemical space as possible – How do we compare coverage? – So if we want to add new molecules, how do we choose them?

Dissimilarity & Library Design

•  Brute force – Evaluate similarity between new molecules and the library and keep those with low Tc

•  Sophis/cated – Use sta/s/cal techniques to effec/vely sample different regions of a chemical space

– Fill in the “holes”

Summary

•  Similarity (and dissimilarity) are fundamental concepts – Simple on the outside, complex on the inside

•  A wide variety of methods available – Need to consider pros/cons in terms of computa/onal expense, chemical u/lity, …

•  Visualizing similarity is useful

•  Many problems can be recast in terms of similarity or dissimilarity

Molecular Representation, Similarity and Search

Education

Transcript of Molecular Representation, Similarity and Search