Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry...

19
Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,* ,,Martin Vogt, § Dagmar Stumpfe, § and Jü rgen Bajorath* ,§ College of Pharmacy and BIO5 Institute, University of Arizona, 1295 North Martin, P.O. Box 210202, Tucson, Arizona 85721, United States Translational Genomics Research Institute, 445 North Fifth Street, Phoenix, Arizona 85004, United States § Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universitä t, Dahlmannstrasse 2, D-53113 Bonn, Germany ABSTRACT: Similarity is a subjective and multifaceted concept, regardless of whether compounds or any other objects are considered. Despite its intrinsically subjective nature, attempts to quantify the similarity of compounds have a long history in chemical informatics and drug discovery. Many computational methods employ similarity measures to identify new compounds for pharmaceutical research. However, chemoinformaticians and medicinal chemists typically perceive similarity in dierent ways. Similarity methods and numerical readouts of similarity calculations are probably among the most misunderstood computational approaches in medicinal chemistry. Herein, we evaluate dierent similarity concepts, highlight key aspects of molecular similarity analysis, and address some potential misunderstandings. In addition, a number of practical aspects concerning similarity calculations are discussed. INTRODUCTION Molecular similarity is one of the most heavily explored and exploited concepts in chemical informatics and is also a central theme in medicinal chemistry. 13 Many computational similarity methods have been (and continue to be) introduced. 1,2 Why do we apparently care so much about similarity in the molecular world? Simply put, comparing compounds and their properties, especially activity, is one of the most frequent exercises in chemical and pharmaceutical research but often for rather dierent reasons. In medicinal chemistry, questions are asked such as the following: Can a similar follow-up candidate compound be identied for a liability-associated lead? Is a candidate too similar to a competitors compound to establish an intellectual property position? How can we complement our compound collection with dierent (i.e., dissimilar) compounds?Providing answers to these and other questions requires the assessment of similarity (or dissimilarity) in one way or another. As will be discussed throughout this review, three basic components are required to construct suitable computational measures of molecular similarity: (1) a representation whose components encode the molecular and/or chemical features relevant for similarity assessment, (2) a potential weighting of representation features, and (3) a similarity function (also called a similarity coecient) that combines the information contained in the representations to yield an appropriate similarity. This value usually lies between 0and 1, where 1results from the complete identity of the molecular representations (but not necessarily the compounds). Repre- sentation features typically are dierent types of molecular descriptors. A weighting scheme will be required if contribu- tions of these features should be dierently prioritized for similarity assessment (otherwise, if all selected features should be equally considered, no weighting is required). Applications in chemical informatics that involve systematic comparisons of compounds and the quantication of their similarity provide a stimulating intellectual setting for method development. Quantitative readouts of similarity are also of practical relevance in, for example, the identication of new candidate compounds on the basis of known actives via virtual screening, 4,5 for which similarity searching is one of the most popular approaches. 6,7 Why is similarity assessment a complicated problem? Two compounds that share a common substructure can be detected unambiguously, or all compounds sharing this substructure can be retrieved from a compound database. However, as illustrated in Figure 1, it cannot be said with certainty if two compounds are similar to each other, what their degree of similarity might be and how similarity should be assessed. In this case, the catch is that it is dicult to rationalize relationships that are principally subjective in nature. First and foremost, similarity like beauty is more or less in the eye of the beholder. The diculty of the problem increases further when attempting to describe similarity relationships in a formally consistent manner and to quantify them with aid of computational methods, as Received: September 12, 2013 Published: October 23, 2013 Perspective pubs.acs.org/jmc © 2013 American Chemical Society 3186 dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 31863204

Transcript of Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry...

Page 1: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

Molecular Similarity in Medicinal ChemistryMiniperspective

Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jurgen Bajorath*,§

†College of Pharmacy and BIO5 Institute, University of Arizona, 1295 North Martin, P.O. Box 210202, Tucson, Arizona 85721,United States‡Translational Genomics Research Institute, 445 North Fifth Street, Phoenix, Arizona 85004, United States§Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, RheinischeFriedrich-Wilhelms-Universitat, Dahlmannstrasse 2, D-53113 Bonn, Germany

ABSTRACT: Similarity is a subjective and multifaceted concept, regardless of whethercompounds or any other objects are considered. Despite its intrinsically subjective nature,attempts to quantify the similarity of compounds have a long history in chemical informaticsand drug discovery. Many computational methods employ similarity measures to identify newcompounds for pharmaceutical research. However, chemoinformaticians and medicinalchemists typically perceive similarity in different ways. Similarity methods and numericalreadouts of similarity calculations are probably among the most misunderstood computationalapproaches in medicinal chemistry. Herein, we evaluate different similarity concepts, highlightkey aspects of molecular similarity analysis, and address some potential misunderstandings. Inaddition, a number of practical aspects concerning similarity calculations are discussed.

■ INTRODUCTION

Molecular similarity is one of the most heavily explored andexploited concepts in chemical informatics and is also a centraltheme in medicinal chemistry.1−3 Many computationalsimilarity methods have been (and continue to be)introduced.1,2 Why do we apparently care so much aboutsimilarity in the molecular world? Simply put, comparingcompounds and their properties, especially activity, is one ofthe most frequent exercises in chemical and pharmaceuticalresearch but often for rather different reasons. In medicinalchemistry, questions are asked such as the following: Can asimilar follow-up candidate compound be identified for aliability-associated lead? Is a candidate too similar to acompetitor’s compound to establish an intellectual propertyposition? How can we complement our compound collectionwith different (i.e., dissimilar) compounds?” Providing answersto these and other questions requires the assessment ofsimilarity (or dissimilarity) in one way or another.As will be discussed throughout this review, three basic

components are required to construct suitable computationalmeasures of molecular similarity: (1) a representation whosecomponents encode the molecular and/or chemical featuresrelevant for similarity assessment, (2) a potential weighting ofrepresentation features, and (3) a similarity function (alsocalled a similarity coefficient) that combines the informationcontained in the representations to yield an appropriatesimilarity. This value usually lies between ‘0’ and ‘1’, where‘1’ results from the complete identity of the molecularrepresentations (but not necessarily the compounds). Repre-

sentation features typically are different types of moleculardescriptors. A weighting scheme will be required if contribu-tions of these features should be differently prioritized forsimilarity assessment (otherwise, if all selected features shouldbe equally considered, no weighting is required).Applications in chemical informatics that involve systematic

comparisons of compounds and the quantification of theirsimilarity provide a stimulating intellectual setting for methoddevelopment. Quantitative readouts of similarity are also ofpractical relevance in, for example, the identification of newcandidate compounds on the basis of known actives via virtualscreening,4,5 for which similarity searching is one of the mostpopular approaches.6,7

Why is similarity assessment a complicated problem? Twocompounds that share a common substructure can be detectedunambiguously, or all compounds sharing this substructure canbe retrieved from a compound database. However, as illustratedin Figure 1, it cannot be said with certainty if two compoundsare similar to each other, what their degree of similarity mightbe and how similarity should be assessed. In this case, the catchis that it is difficult to rationalize relationships that areprincipally subjective in nature. First and foremost, similaritylike beauty is more or less in the eye of the beholder. Thedifficulty of the problem increases further when attempting todescribe similarity relationships in a formally consistent mannerand to quantify them with aid of computational methods, as

Received: September 12, 2013Published: October 23, 2013

Perspective

pubs.acs.org/jmc

© 2013 American Chemical Society 3186 dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−3204

Page 2: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

further detailed below. Although similarity is difficult torationalize and quantify, computational decision support insimilarity assessment is nevertheless often requested inmedicinal chemistry; unfortunately, it fails more often thannot. Why is this so? Herein, different similarity concepts andcomputational approaches for similarity assessment arediscussed. In addition, an attempt is made to rationalize whythere is often a discrepancy between computational andmedicinal chemical views of similarity and address somecommon misunderstandings. Finally, the use and interpretationof similarity calculations in the practice of medicinal chemistryare discussed.

■ DO SIMILAR STRUCTURES HAVE SIMILARPROPERTIES?

In the context of a seminal book publication8 that appeared inthe early 1990s when molecular similarity analysis first becamepopular, the similarity property principle (SPP) emerged, whichstated that similar compounds should have similar properties,the most frequently studied property being biological activity.Although this fundamental principle sounds simple enough, it isvery difficult to capture methodologically. At the heart of the

problem is the requirement to clearly define and consistentlyaccount for similarity. As illustrated in Figure 2, compoundsthat might not be considered similar often share similar activity(horizontal compound relationship) or other property values.In contrast, compounds that likely would be considered verysimilar might not do so (vertical compound relationship),clearly illustrating the limitations of the SPP. Structure−activityrelationship (SAR) discontinuity, i.e., small chemical mod-ifications that lead to significant changes in biological activity,represents a major limitation of the SPP. The extreme form ofSAR discontinuity is provided by “activity cliffs”.9−11

A key aspect associated with the SPP that strongly influencesnearly all considerations of similarity in chemical informaticsand medicinal chemistry is that molecular similarity values arerarely of interest per se. Rather, they are used as a basis forcorrelating similarity, however assessed, with compound-dependent properties such as biological activity. Despite itsfundamental importance, this aspect is surprisingly often notconsidered in computational similarity analysis.

Figure 1. Similarity perception and concepts. Two exemplary vascular endothelial growth factor receptor 2 ligands are shown, and different ways toassess their similarity are illustrated.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043187

Page 3: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

■ SIMILARITY HAS MANY DIFFERENT MEANINGS

It is evident that similarity is a widely used concept that is ofrelevance for recognizing and organizing all components of thephysical environment as well as many other aspects of life.However, even in the more narrowly confined molecular world,similarity may have many different meanings or interpretationsdepending on our individual perspective. Hence, if the ultimateaim is to formally describe similarity in a consistent mannerdespite its intrinsic limitations, it is of critical importance to firstdistinguish between different similarity criteria and concepts, asillustrated in Figure 1.Chemical or Molecular Similarity? Although the terms

chemical and molecular similarity are often used synonymously,this may not be entirely accurate. Chemical similarity is basedprimarily on the physicochemical characteristics of compounds(e.g., solubility, boiling point, log P, molecular weight, electrondensities, dipole moments, etc.) while molecular similarityfocuses primarily on the structural features (e.g., sharedsubstructures, ring systems, topologies, etc.) of compoundsand their representation. Physicochemical properties andstructural features are typically accounted for by differenttypes of descriptors. Such descriptors are generally defined asmathematical functions or models of chemical properties ormolecular structure. For chemical similarity assessment,reaction information and different functional groups can alsobe considered. In the current work, the focus is more onmolecular than chemical similarity.2D versus 3D Similarity. Similarity can be evaluated on

the basis of 2D and 3D molecular representations. 2D similaritymethods rely on information deduced from molecular graphs.Direct graph comparisons12 and graph similarity calculationsare computationally demanding and not widely applied inmolecular similarity analysis at present. By contrast, moleculardescriptors that capture graph information such as fragment13

or topological atom environment fingerprints14 are verypopular. Fingerprints are generally defined as bit string13 orfeature set14 representations of molecular structure andproperties. Such molecular representations can be efficientlycompared computationally, thus enabling similarity calculationson a large scale. Because compounds are inherently three-dimensional and their molecular conformations have generally

higher information content than their corresponding moleculargraphs, one might anticipate that 3D similarity, which involvesthe comparison of molecular conformations and associatedproperties,15,16 should be generally preferred to 2D similarity.However, this is not the case for two principal reasons. First,chemists are trained on the basis of molecular graphs (i.e., 2Dstructural representations) and in general are more comfortablewith basing their considerations on graphs than on the 3Dstructures of compounds. Molecular graphs typically used bychemists often also contain conformational and stereochemicalinformation. Second, given the uncertainties associated withidentifying biologically active conformations in vast conforma-tional ensembles of test compounds, 2D approaches aretypically more robust, despite their relative simplicity, andoften yield superior results in SAR analysis and activityprediction.17,18 Many current similarity methods preferentiallyutilize 2D molecular representations; most, however, do notcontain any stereochemical information, which limits theirability to properly treat enantiomeric compounds. Since suchcompounds have identical atom connectivity, their similarityvalues will be unity if stereoinsensitive molecular representa-tions are used. Furthermore, as will be discussed below indetail, similarity calculations on the basis of 2D molecularrepresentations have a number of other intrinsic limitations.In the following, we will base our discussion of similarity

calculations and similarity measures on 2D approaches, inparticular, fingerprint similarity searching, for several reasons.As pointed out above, chemists are generally more familiar with2D than 3D representations of compounds and considersimilarity mostly on the basis of 2D molecular graphs.Furthermore, many of the conclusions drawn from the analysisof simple similarity searching readily apply to more complexsimilarity methods. In this context, our preference for 2Dsimilarity assessment should not be interpreted as a disregard of3D similarity concepts and methods. Given the medicinalchemistry focus of our presentation, we mostly adhere to 2Dsimilarity considerations herein.

Molecular versus Biological Similarity. Another sim-ilarity concept that requires consideration is the biologicalsimilarity of compounds, which departs from the conceptualframework of the SPP. Instead, the usual structural orphysicochemical property descriptors are replaced by the

Figure 2. Similarity versus activity. Three vascular endothelial growth factor receptor 2 ligands are shown that represent different (vertical vshorizontal) similarity−activity (potency) relationships.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043188

Page 4: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

activities of the compounds against a panel of reference targets,generally proteins, that provide “biological signatures”19,20

analogous to the structure- or property-based representationsextensively discussed herein. In this case, the activity profilescorresponding to the biological signatures of the compoundsare compared using an appropriate similarity function as ameasure of pairwise similarity, irrespective of the structuralfeatures of the compounds. Hence, in this case, biologicalsimilarity is assessed in target space rather than chemical space.For SAR analysis and medicinal chemistry programs,

biological similarity is generally more difficult to implementthan structure- or property-based representations becausespecific activity values might not be available for compoundsof interest.In addition to their use as molecular similarity measures,

biological signatures can also provide an approximate measureof compound promiscuity.21 For example, summing theindividual values in a binary biological signature (active = 1or inactive = 0) yields the number of targets against which theassociated compound exhibits activity.Global versus Local Similarity. A very important criterion

for similarity analysis is distinguishing between global and localsimilarity views. For example, the comparison of pharmaco-phore models in drug design focuses only on selected atoms,groups, or functionalities that are known or hypothesized to beresponsible for activity. This represents a local view ofsimilarity, in contrast to the more global view typically foundin chemical informatics, where compounds are considered intheir entirety. In the latter case, the calculated property orstructural descriptors typically used to compute molecularsimilarities are generally derived from structural informationassociated with entire compounds. For example, if we translatethe structural information of a compound into a fragmentfingerprint, a global molecular representation is obtained. Thiswhole-compound view of similarity is characteristic of theperspective of chemoinformaticians.Medicinal Chemistry Perspective. In addition to local

and global views, however, special attention must also be paid

to a medicinal chemist’s perspective in this context. Consider,for example, the set of well-known cyclooxygenase (COX)inhibitors compared in Figure 3. All of these inhibitors areapproved drugs except lumiracoxib, which lost its United Statesapproval in 2007. If we apply a whole-compound view,compounds such as the ibuprofen enantiomers, ibuprofen andparacetamol, or diclofenac and lumiracoxib, appear visiblysimilar. From a medicinal chemistry point of view, however, thisassessment may not be generally agreed upon since smallchemical differences can lead to important changes in specificityprofiles (e.g., diclofenac vs lumiracoxib) or compoundscontaining different functional groups can be synthesized orderivatized in different ways (e.g., ibuprofen vs paracetamol).Hence, a medicinal chemist’s view of similarity might again bemore local in nature and/or take chemical reaction informationdirectly into account. Moreover, these COX inhibitors areinvolved in highly complex similarity−activity relationships thatalso cannot easily be separated from a medicinal chemistryperspective. For example, the (R)-(−)-enantiomers ofibuprofen and naproxen are inactive, but under physiologicalconditions the (R)-(−)-enantiomer of ibuprofen is convertedinto the active (S)-(+)-enantiomer by the enzyme 2-arylpropionyl-CoA epimerase. Furthermore, paracetamol andlumiracoxib are selective for COX-2, but the other inhibitorsare active against both COX-1 and COX-2, the former activitygiving rise to gastrointestinal side effects. Moreover, naproxenalone is also active against hormone-sensitive lipase.Such examples illustrate that considerations of chemical and

functional criteria might readily alter the perception of globalmolecular resemblance. Clearly, such similarity considerationsfall into a gray zone, as they are influenced by subjective criteriaas well as the experience of the investigator, and hence, there isno generally accepted way to judge such similarity relationships.Accordingly, relations between the cognitive and computationalaspects of molecular similarity are discussed in more detail inthe following section.

Figure 3. Complex similarity relationships. Cyclooxygenase (COX) inhibitors and their activity profiles are compared. HSL stands for hormone-sensitive lipase.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043189

Page 5: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

■ COGNITIVE VERSUS COMPUTATIONAL ASPECTSOF MOLECULAR SIMILARITY

While similarity as perceived by trained medicinal chemists isdecidedly not the same as similarity obtained by computationalmeans, there are some aspects of the two that are comparable.For example, in both cases, some type of symbolicrepresentation is required to characterize the structuralinformation of the compounds being compared, although inthe former case the representation is not explicitly stated.Regardless of their details, however, both types of symbolicrepresentation must make molecular information comprehen-sible in such a way that structural/feature patterns can beidentified and recognized. In general terms, pattern recognitionrefers to the ability to detect recurrent themes, organizationprinciples, relationships, and rules in large data sets,22 anessential requirement for decision making by humans as well asfor computational learning.22,23

The identification of patterns within data forms a basis forclassification and directly applies to our molecular world. Morethan anything else, the recognition of molecular patterns, basedon human or computational exploration, provides a basis forarriving at decisions as to whether two compounds are similarto each other or not. Since data complexity generally scales withthe number of patterns that can be discovered, it quicklybecomes impossible for humans to consider them in acomprehensive manner. Therefore, humans intuitively, andoften unconsciously, reduce patterns to simpler ones thatcontain the essential feature(s) of the original pattern. Butunlike applications of computational pattern recognition, theprecise nature of these key patterns in human patternrecognition is unknown. For instance, to cross a road safely,we need to recognize patterns associated with moving objectsand/or engine noise but are not required to understand whichtype of car or motorbike is approaching. This intuitivereductionist approach to pattern recognition is clearly reflectedby decision-making by medicinal chemists, as further discussedbelow.Selecting key patterns regardless of whether they are

mathematically defined or expressed in terms of vague

conscious or subconscious mental constructs is the mostcrucial element in any assessment of molecular similarity. Thekey patterns used by humans or computers will generally varyfrom individual to individual or from algorithm to algorithm, asituation that most likely will yield results with varying degreesof agreement for the same set of data. This follows because therepresentations used by humans and by computers, which mostlikely are significantly different, are crucial components indetermining what can be understood about relationships ofobjects to each other, whether they are physical objects,concepts, ideas, or compounds. Despite the common search forkey patterns, the use of representations to determine similarityin machine computation compared to human perception ofsimilarity by medicinal chemists differs significantly,2 asschematically illustrated in Figure 4.In the case of machine computation, algorithms have been

developed for constructing suitable representations of thestructural information in compounds and for evaluatingsimilarity functions or coefficients associated with theserepresentations.4,24,25 However, since there is no unique orinvariant way to represent molecular and chemical information,constructing representations suitable for a given task or goaldepends on what is the task or goal.As noted earlier and discussed further below, mathematical

functions that are designed to reflect the degree of molecularsimilarity typically yield values that lie on the unit interval [0, 1]of the real line. But as is also discussed below, the form of thesefunctions also influences the similarity values because theyusually differ even when identical representations are used,although in some cases they are linearly or monotonicallyrelated.2

Role of Chemical Intuition and Experience. Althoughwell-defined, computed values may not account for the degreeof similarity in a way that is consistent with the perceptions ofmedicinal chemists because human perception of similarity is amuch more complicated, varied, and subtle task (vide supra).Moreover, the “cognitive algorithms” by which medicinalchemists perceive similarity are largely unknown, althoughsome recent work has begun to address this question.26−28

These studies clearly show that chemical intuition and

Figure 4. Similarity assessment through pattern recognition. Exemplary computer- and human-based pattern recognition processes for similarityassessment are illustrated.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043190

Page 6: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

experience play major roles in decision making in medicinalchemistry. Surprisingly, there is typically little consensusbetween experienced medicinal chemists in judging preferredcompounds and assessing favorable or unfavorable molecularfeatures.26−28 Furthermore, it has been shown that perceptionof molecular structures is strongly context-dependent; i.e.,depending on the order in which we view compounds and howthey are grouped, different conclusions are drawn.27 This pointsto a potential advantage of computational similarity assessmentbecause compound representations or patterns are constantand context-independent. It has also been shown that medicinalchemists often have difficulties comprehending the nature andmeaning of the parameters they might have considered and thescientific criteria upon which decisions on compounds arebased.28 Medicinal chemists typically base their compounddecisions on very few patterns or parameters, fewer than theybelieve,28 a fact that clearly reflects the pattern-reductionapproach referred to above. Decision parameters generallyresult from feature reduction and pattern reduction, which alsoprovides a foundation of machine learning approaches.22,23

Computational methods such as neural networks,29 or supportvector machines,30 are essentially designed for pattern-basedsimilarity assessment, which requires training data the use ofwhich also renders these computational modeling effortscontext-dependent. The resulting computational models havethe often cited “box black character”, which means that theycannot be interpreted in chemical terms. In some ways, thisprovides an interesting analogy to medicinal chemists who donot realize upon which parameters their compound decisionsmight be based.28

Although it may not be possible to rationalize our judgments,we are typically more content with our own decisions thanthose obtained computationally that, in many cases, can bedifficult to interpret. Accordingly, machine learning methodssuch as decision trees31 or emerging chemical patterns32 areoften favored in practice because they yield interpretablepatterns, even though they may be based on rather abstractrepresentations of molecular and chemical information. In lightof the above, it is clear that judgments of molecular similaritycan be influenced by a number of cognitive aspects. Lastly, withregard to the SPP, it should be re-emphasized that mereassessment of molecular similarity is generally not the ultimategoal. Rather, in many cases, it is the identification of similarcompounds that, based on the SPP, are presumed to havesimilar properties (especially biological activities) to knownreference or target compounds. This adds additional layers ofcomplexity to our perception of similarity and can furthercomplicate our judgments.Similarity Coefficients. The question then arises as to

whether it is reasonable to assume that any “rationalization” ofsimilarity, or that any consistent computational representationand comparison of compounds that yields a numerical readout,will increase our own consensus and be superior to subjectivedecisions. The Tanimoto coefficient (Tc)24,33 is introduced tohelp answer this and related questions and to provide anillustration of how molecular similarity can be quantified.Although it may not be the best procedure, it is by far the mostpopular and, because of its ease of implementation and speed, isin widespread use today in chemical informatics and computa-tional medicinal chemistry. As detailed in the sequel, a varietyof other similarity measures,24,25 most of which did notoriginate in chemical informatics but in other scientific fields

(such as statistics, ecology, and psychology), have also beenused to compare specific molecular representations.The Tanimoto coefficient is generally defined by

=+ −

ca b c

Tc(A, B)(1)

where a and b are the number of features present incompounds A and B, respectively, and c is the number offeatures shared by A and B. Hence, Tc quantifies the fraction offeatures common to A and B to the total number of features ofA or B, where the c term in the denominator corrects fordouble counting of the features.Another perhaps more intuitive way to interpret Tanimoto

similarity is based on an alternative form of the denominator oneq 1, i.e.,

+ − = − + − +a b c a c b c c( ) ( ) (2)

Here the terms (a − c) and (b − c) are the number of featuresunique to A or B, respectively. Substituting eq 2 into eq 1 yieldsthe numerically equivalent form of Tc,

=− + − +

ca c b c c

Tc(A, B)( ) ( ) (3)

Dividing numerator and denominator by (a − c) + (b − c)gives

=+R a b c

R a b cTc(A, B)

( , , )1 ( , , ) (4)

where

=− + −

R a b cc

a c b c( , , )

( ) ( ) (5)

which can be interpreted as the ratio of the number of featuresshared by A and B to the number of their unique features.As A and B become more similar, the number of shared

features approaches the number of features in A and B (i.e., c→a,b) and the number of unique features in both compoundsapproaches zero (i.e., (a − c) → 0 and (b − c) → 0) because inthe limit the number of shared features and number of featuresin A and B become equal (i.e., a = b = c). Thus, their ratio goesto infinity, (i.e., R(a,b,c) → ∞), which in the limit givesTc(A,B) = 1. Conversely, as A and B become less similar, thenumber of shared features approaches zero and consequentlyall of the features of A and B are unique, and thus, the ratio ofthese features also goes to zero (i.e., c → 0, (a − c) → a, (b −c) → b, and R(a,b,c) → 0); thus, in the limit, Tc(A,B) = 0. Inthe intermediate region where the number of shared features isgreater than zero but less than the lesser of the number offeatures in A and B (i.e., 0 < c < min(a,b)) and where thenumber of unique features is less than the total number ofpossible features (i.e., (a − c) + (b − c) < a + b), the Tanimotosimilarity will lie between the extremes of the unit interval ofthe real line, i.e., 0 < Tc(A,B) < 1. One way to think about thisis to note that as the number of shared features between twocompounds increases, their number of unique features mustcorrespondingly decrease. Thus, there is interplay between thenumber of shared features and the number of unique featuresexemplified by their ratio R(a,b,c).The calculation of Tanimoto similarity is typically based on

representations called “molecular fingerprints”,4,6,7 which canbe viewed as classical sets or binary vectors whose elementshave values of “1” or “0” corresponding, respectively, to thepresence or absence of specific features (e.g., molecular

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043191

Page 7: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

fragments). In some cases, elements with value “1” are called“on-bits“ and those with value “0” are called “off-bits“, hence,the description of molecular fingerprints as “bit strings” or “bitvectors”. Note that the molecular fingerprints described abovedo not account for multiple occurrences of the differentfeatures, only whether they occur at least once in a givencompound. However, feature counts can be added tofingerprints by using integer values to represent featuresinstead of a binary format. Fingerprints of different design andcomplexity are available,7 as further discussed below. Forsimilarity searching, fingerprints are among the original and tothis date most popular descriptors.Dissimilarity can be quantified in a complementary manner

such that small values indicate similarity and large valuesdissimilarity. Accordingly, a dissimilarity measure can bederived from the Tc by taking the appropriate complementknown as the Soergel distance (Sg),24 i.e.,

= − = −+ −

ca b c

Sg(A, B) 1 Tc(A, B) 1(6)

that can be rewritten as

= −+ −

= + − −+ −

= − + −+ −

ca b c

a b c ca b c

a c b ca b c

Sg(A, B) 1( )

( ) ( )(7)

As noted above, the denominators in eqs 1 and 3, a + b − c and(a − c) + (b − c) + c, respectively, represent the number offeatures that occur in either A or B, and the Tc can then berationalized as the percentage of shared features, whereas theSoergel distance corresponds to the percentage of featuresunique to A or B given by (a − c) and (b − c), respectively.Another similarity measure that is growing in usage is the

Tversky coefficient (Tv),34 which is given by

α β=

− + − +α βc

a c b c cTv (A, B)

( ) ( ),(8)

The denominator is closely related to that given for Tanimotosimilarity in eq 3 except for the two parameters α and β thatweight the number of features unique to A or B, (a − c) and (b− c), respectively. As defined by Tversky,34 α and β are non-negative. In chemical informatics and computational medicinalchemistry applications, these parameters are typically chosen tolie within the unit interval [0, 1] of the real line. In either case,zero and unity bound the value of Tv. The larger α is comparedto β, the more weight is put on the unique features of referencecompound A and the less on database compound B and viceversa. Thus, in the case of Tv, whose values also range from 0to 1, the similarity values change as the two weights vary. Thismakes it possible to study the relative importance of commonand unique features for compound ranking with respect to thereference and database compounds.As discussed further below, the weighting scheme can be

applied to introduce asymmetry into similarity calculations. Forthe special case α = β = 1, where the unique features of bothcompounds are weighted equally, Tv is identical to Tc. In thecase where α = β = 0.5, Tv is identical to the Dice coefficient(Dc)24

=+c

a bDc(A, B)

( )12 (9)

written here in a form that clearly shows that the denominatoris the arithmetic mean of the number of features in A and B.Since 1/2(a + b) ≤ (a + b) − c, it follows that Tc(A,B) ≤Dc(A,B), as illustrated by the distributions depicted in Figure 6.Both similarity coefficients are symmetric, since the similarity

of A with respect to B is the same as the similarity of B withrespect to A. In fact, any Tv in which α = β yields a symmetricsimilarity coefficient such that Tvα=β(A,B) = Tvα=β(B,A).Tversky similarity coefficients with two unequal weightingfactors (α ≠ β) are, on the other hand, asymmetric, their degreeof asymmetry depending on the relative magnitudes of theweighting factors.Similarity coefficients can be classified according to their

compound ranking characteristics. Coefficients that alwaysproduce the same ranking of compounds, although theirabsolute similarity values might differ, are said to be monotonic.For example, Tvα,β(A,B) and Tvα′,β′(A,B) are monotonicallyrelated if the parameters have the same ratio so that α′ = kαand β′ = kβ. These coefficients can be converted into eachother by the monotonic function

=+ −

α′ β′

α βk

Tv (A, B)1

1k,

Tv (A, B), (10)

which can be verified by elementary algebraic transformations.Thus, normalization of the parameters imposes no restrictionon the ranking and, hence, the generality of Tv.In the following, the sum of the weighting parameter values is

restricted to unity, i.e., α + β = 1. Replacing β in eq 8 by 1 − αyields

α α

α α

=− + − − +

=+ −

αc

a c b c cc

a b

Tv (A, B)( ) (1 )( )

(1 ) (11)

Thus, Tversky similarity now only depends on the singleparameter α. Note that differences in the numerical distributionof the normalized Tv and Tc are to a large extent due to thefact that the Tc corresponds to a non-normalized Tv under thecondition α + β = 2. Furthermore, as clearly shown in Figure 6,

= ≤

=α β α β= = = =Tv (A, B) Tc(A, B) Tv (A, B)

Dc(A, B)

1, 1 1/2, 1/2

(12)

An extreme form of Tv occurs when the reference compound Ais weighted (α = 1) and the database compound is not (β = 0),in which case eq 8 becomes

=α β= =ca

Tv (A, B)1, 0 (13)

In this case, the Tversky similarity coefficient provides ameasure of how similar A is to B, which can be interpreted asthe fraction of the features in the reference compound A thatare matched by database compound B. Interchanging the valuesof the weighting factors so that now α = 0 and β = 1 places theentire weighting on the database compound B and gives

=α β= =cb

Tv (A, B)0, 1 (14)

which in this case can be interpreted as the fraction of thedatabase compound B that is similar to the referencecompound A. These two forms of Tv represent extremeforms of Tversky similarity coefficients.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043192

Page 8: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

Increasing molecular size or complexity generally leads toincreasing fingerprint bit densities, which are defined for a givencompound A as

ρ = ‐(A)

number of on bitstotal number of fingerprint bitsFP

(15)

Such increases in the bit density ρFP(A) have a statisticaltendency to yield higher similarity values for larger com-pounds,35 a well-known complication in similarity searching7

and a cause of apparent asymmetry in distributions of similarityvalues.36 Molecular complexity effects can be balanced oreliminated in different ways, for example, by equally taking intoaccount bits that are set on or off in similarity calculations37,38

or by combining binary fingerprint representations with theircomplements, i.e., adding the complement to the original bitstring, thereby producing a constant fingerprint bit density forcompounds of any size.39

Calculating Tanimoto, Tversky, or Dice similarity has anassumed advantage that numerical values can now be used todistinguish similarity relationships in a consistent manner. Howdoes this numerical approach from chemical informatics relateto, and perhaps influence, the more subjective assessment ofsimilarity in medicinal chemistry? Are calculated similarityvalues suitable to replace chemical intuition and judgment?Computed versus Intuitive Similarity. There are a

number of issues that arise when comparing computedsimilarity values with those assigned by medicinal chemists.One issue is that the similarity scale employed by medicinalchemists is not uniform. The following argument, whichdepends on the complementary nature of similarity anddissimilarity, illustrates this point. In computations the degreeof dissimilarity is typically taken as the complement ofsimilarity:

= −dissimilarity 1 similarity

Hence, the more dissimilar two compounds are, the less similarthey are to each other and vice versa. Importantly, suchcomplementary behavior between computed similarity anddissimilarity values does not, however, apply in the case ofhuman perception. For example, humans can better assesssimilarity the more similar compared objects are to each other.By contrast, as objects become less and less similar, a point isreached where it is generally difficult for humans to assess theirdegree of similarity or dissimilarity. Recall that in the formercase one is dealing with features that are common to bothcompounds, whereas in the latter case one is dealing featuresthat are unique to each of the compounds. This follows fromthe basic psychophysics of human perception because it iseasier for humans to make comparative judgments of objectswith common features than between objects whose features areunique.Since computed similarity values do not suffer from these

problems, a divergence between human perceptions andcomputed values of similarity likely arises. In most cases, thisis not a problem for medicinal chemists who typically want tosynthesize and test compounds that are similar to knownactives. Then, high calculated similarity values have an intuitivemeaning. However, if similarity values are decreasing in size,boundaries between similarity and dissimilar become ratherdiffuse and one is often unable to interpret such values.The question of symmetry vs asymmetry of similarity, as

formally discussed above, should also be considered from an

intuitive perspective. Tversky similarity, which originated inpsychology (not informatics), is conceptually based on anumber of asymmetric characteristics that are associated withhuman perceptions of similarity. An example given by Tverskyinvolves a comparison of Korea and China; the similarity ofKorea to China is usually considered to be greater than thesimilarity of China to Korea. This view, which is rather general,suggests that relative size, however accounted for, has asignificant influence on the perceived asymmetry of thesimilarity of entities, including compounds, when comparedby humans. Moreover, this can also be interpreted in terms ofeqs 13 and 14, since the “fraction” of Korea that is similar toChina is definitely not the same as the “fraction” of China thatis similar to Korea. Often it is not considered that the Tverskysimilarity coefficient is parametrized to account for asymmetricaspects of similarity by capturing the asymmetric characteristicsinherent in many different types of objects under comparison.To understand, in light of the above, how human perception

of the similarity of two compounds might be asymmetric, it isnecessary to distinguish the compounds being compared. Let usconsider an ordered pair in which A is a reference compoundand B a database compound. If the reference A is a smallcompound and a substructure of a larger compound, A is rathersimilar to B. This follows because A is a close match to a part ofB. However, if the situation is reversed, i.e., B is now used as thereference and A is the database compound, the similarity will belower because most of B differs from A. This is a molecularexample of the size effect described above in the case of theperceived asymmetric similarity comparisons of Korea andChina. Equations 13 and 14 and the accompanying discussionfully support this analysis.In Tc calculations, this perceived asymmetric similarity

relationship is not reflected, but Tv calculations offer thispossibility as a consequence of appropriate weighting.Importantly, perceived relative size-dependent asymmetricsimilarity is distinct from representation-dependent molecularsize or complexity effects mentioned above, which systemati-cally bias similarity calculations by producing large values forlarger and topologically more complex compounds.

Human Perception. The assessment of similarity on thebasis of human perception is considerably more complicatedthan reflected by the examples given above because a numberof other conscious and subconscious factors also play a role.For example, a key factor in similarity assessment is the abilityof humans, in general, and medicinal chemists, in particular, tointuitively reduce the complexity of the problem at hand (videsupra). This need to reduce complexity largely depends on thefact, as pointed out by numerous psychologists, that humanscan only hold a relatively small number of things in theirworking memory at any point in time.40,41 Working memory isthat part of memory that actively holds multiple pieces oftransitory information that can be manipulated by verbal andnonverbal tasks, such as reasoning and comprehension, andmakes the results of these tasks available for furtherinformation-processing. In the case of medicinal chemists thismeans that only structural features perceived to be mostessential, or some simplified representation of them, might beretained and considered for similarity assessment, veryconsistent with the results obtained by Kutchukian et al.,28

indicating the partly unconscious use of only one or twochemical parameters by medicinal chemist in compoundevaluation and decision making. Understanding these criteria,which will undoubtedly differ from medicinal chemist to

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043193

Page 9: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

medicinal chemist, is a nontrivial task. Thus, computedsimilarity values and judgments by medicinal chemists areboth influenced by dependencies on molecular size andcomplexity, but the effect is much more pronounced anddifficult to predict in the case of medicinal chemists’assessments of similarity. The inconsistency of humans whenconfronted with complex decision tasks42 is well reflected bygenerally observed changes in medicinal chemists’ judgmentabout the quality of the same compounds when presented indifferent orders (vide supra).27 It is evident that medicinalchemists are often left with conscious or subconscious“impressions”, which they fold into their assessments ofsimilarity in some implicit way, being intuitively aware of thecomplexity of the problem at hand, which then automaticallyleads to a reductionist approach in decision making. It istherefore not surprising that similarity calculations are attractivein medicinal chemistry because they reduce complex molecularcomparisons to a simple numerical readout. Then, however, thekey question becomes what such computed values actuallymean.

■ CHARACTERISTICS OF SIMILARITYCALCULATIONS

In the following section, we highlight opportunities andlimitations of similarity calculations in light of the abovediscussion. Thereby, we evaluate the apparent attractiveness ofnumerical similarity measures as a complement, or replacement,of human perception and study relationships betweencalculated and perceived similarities.Similarity Property Principle Revisited. A critically

important aspect to realize is that most similarity methods donot explicitly take biological activity into account. Thus,similarity values generally reflect the similarity of chosenmolecular representations. Yet this is hardly of interest inmedicinal chemistry. Instead, chemoinformaticians and medic-inal chemists typically attempt to bridge between calculatedsimilarity and biological activity, well in accord with the SPPdiscussed above. In fact, the key question asked in this contexttypically is “Which Tc value reliably indicates that compound Bhas the same activity as reference compound A?” In otherwords, “How similar must A and B be to have the sameactivity?” This is the major attraction of reducing complexsimilarity relationships to simple numbers and the source ofsome profound misunderstandings of similarity calculation.The 0.85 Myth. In a seminal study quantifying chemical

neighborhood behavior, investigators from Tripos established,

using their in-house fingerprints and sets of active compounds,that a Tc value of 0.85 reflected a high probability that twocompounds shared the same activity.43 For more than 15 years,this Tc value has propagated in the literature as a generalthreshold for bioactivity and has been applied in many practicalapplications, although the value is not reliable when othermolecular representations are used for similarity calcula-tions.4,7,44 Neighborhood behavior and calculated similarityvalues are strongly dependent on chosen molecular representa-tions and similarity measures.4 While this is generally well-known, it is often underappreciated in medicinal chemistryeven today. The often-observed use of putative Tc thresholdvalues of biological activity reflects common misunderstandingsof similarity calculations. In the following, we present anddiscuss exemplary similarity calculations to highlight severalcharacteristic features.

Fingerprints of Different Design. In the following, twoconceptually different fingerprints are compared that arepopular in computational medicinal chemistry. The molecularaccess system (MACCS) fingerprint,13 also termed MACCSstructural keys, is a prototypic fragment-based fingerprint thatconsists of 166 structural fragments with 1−10 non-hydrogenatoms and is one of the original and most popular similaritysearch tools.6,7 Its design is simple. Each bit position is assignedto one particular structural fragment or key and its presence orabsence in a compound is detected.By contrast, we use the extended connectivity fingerprint

(ECFP) with bond diameter four (ECFP4) that currently isone of the most popular fingerprints for similarity searching.14

ECFPs account for the local bond topologies, which describethe connectivity of atoms in the neighborhood of each non-hydrogen atom in a molecule. The size of the neighborhooddepends on the so-called bond diameter given by the maximumnumber of bonds considered. The ECFP design is much morecomplex than MACCS because many different atom environ-ment features can be generated. Different from MACCS,ECFP4 consists of sets of compound-specific features whoseoverlap is quantified as a measure of molecular similarity.Although many different atom environments can in principleexist, feature sets derived for individual compounds are oftenrelatively small (e.g., containing less than 100 features),depending on their topology.

Similarity Value Distributions. Although the definition ofTc yields an interpretable value as “the percentage offingerprint features shared between two compounds”, it isvery difficult to judge whether a given Tc value indicates the

Figure 5. Frequency of fingerprint features. The relative frequency of occurrence of the 150 most frequent features of (a) MACCS and (b) ECFP4 iscalculated for a random subset of 1 million ZINC database compounds.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043194

Page 10: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

presence of “significant similarity” or not. This is the casebecause the coefficient value does not tell us anything about thespecific features under comparison. For instance, manyMACCS bit positions refer to structural features that areoften found in compounds, whereas ECFP4 systematicallyencodes atom environments, many of which are infrequentlyfound in compound data sets. For this reason, ECFP4 Tc valuesare generally smaller than MACCS Tc values. This difference infeature frequencies is illustrated in Figure 5 that reports therelative frequencies of the 150 most frequently detectedMACCS and ECFP4 features in 1 000 000 compoundsrandomly selected from the ZINC (version 12) database.45

MACCS and ECFP4 fingerprints were calculated with theMolecular Operating Environment (MOE).46 Overall theZINC subset contained 183 476 different ECFP4 features, butonly 632 of these features occurred in more than 1% of thecompounds. Considering the sparseness of most ECFP4features, it is not surprising that some molecules that arestructurally similar contain a significant number of unique

features. Importantly, the differences in feature distributionbetween MACCS and ECFP4 lead to very differentdistributions of similarity coefficient values. To illustrate thesedifferences 10 000 000 similarity values were calculated forrandomly chosen pairs of ZINC compounds. The results areshown for MACCS and ECFP4 Tc and Dc calculations inFigure 6, where it is clear that the Dc distributions are shiftedtoward higher values and are less symmetrical than thecomparable Tc distributions. These effects are due to thenormalization (α + β = 1) of the Dc and can be rationalizedbased on the discussions associated with eqs 9 and 11. Similareffects are, in general, observed for Tv, yielding distributionsvery similar to those of the Dc, regardless of the value of theparameter α. The figure shows that different combinations offingerprints and similarity coefficients produce differentsimilarity value distributions, further emphasizing the criticallyimportant point that calculated similarity has no absolutemeaning.

Figure 6. Similarity coefficient distributions. Distributions of similarity values resulting from 10 million comparisons of randomly chosen ZINCcompounds are reported for the Tanimoto and Dice coefficient and the (a) MACCS and (b) ECFP4 fingerprint.

Figure 7. Comparison of similarity coefficients. For two thrombin inhibitors Dice, Tanimoto, and Tversky coefficients are compared using MACCSand ECFP4. Tversky similarity calculations were carried out using different parameter settings.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043195

Page 11: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

Although the global distribution of Tv values does notsignificantly depend on the settings of α, this parameterdetermines how similarity relative to a given reference moleculeis perceived. If more weight is put on features (bit settings) ofthe reference molecule (i.e., if α > 0.5), different similarityrelationships evolve. Compounds that contain most of thereference features plus some additional ones are considered tobe more similar to the reference molecule than compounds thatcontain fewer of the reference features but also fewer additionalfeatures, although the percentage of shared features might bethe same for both molecules. How different representations andsimilarity coefficients affect computed similarity values isillustrated in Figure 7, using two exemplary thrombin inhibitorstaken from the ChEMBL (version 15)47 database. Bothmolecules contain more ECFP4 features than MACCS features,but the number of shared features is lower for ECFP4, asexpected on the basis of the feature distributions discussedabove. Consequently, the different coefficients producesignificantly lower similarity values for ECFP4. Dc is increasedcompared to Tc as shown in the discussions related to eqs 9and 12. Because Dc is identical to normalized Tv with α = 1, itsvalue can be numerically compared to the asymmetrical Tvvalues with parameters α = 0.1 and α = 0.9, respectively. It canbe observed that Tv decreases for α = 0.1 and increases for α =0.9. In the first case, more weight is put on the featuresexclusive to molecule B, and in the second case, less weight isput on these features. Thus, the influence of these features onthe similarity value is either increasing or decreasing comparedto Dc. Changing α has an effect on computed similarity values.More importantly, however, the parameter also influences howsimilarity is perceived in a search when database compoundsare ranked in the order of decreasing similarity to a referencemolecule. Here, the absolute value of similarity is not ofinterest, especially if the value cannot be interpreted in ameaningful way. Rather, the rank positions of compounds withthe desired properties determine the usefulness of a similaritycoefficient. Figure 8 illustrates the effect that the choice ofdifferent similarity coefficients has on the ranking ofcompounds in a similarity search. Molecule A in Figure 7 wastaken as a reference, and 1 000 000 ZINC compounds togetherwith molecule B and 24 other thrombin inhibitors were

searched and ranked using different similarity coefficients. InFigure 8, the ranks are displayed on the x-axis from low to highranks on a logarithmic scale. On the y-axis, the correspondingcoefficient values are reported. For each similarity coefficient,the position of the 25 thrombin inhibitors is marked. Thegraphs illustrate that compound ranks significantly varydepending on the coefficient and representation used. In thisexample, MACCS in combination with Tv and α = 0.9 yieldsthe largest number of thrombin inhibitors within the top 1000database compounds (corresponding to 0.1% of the screeneddatabase). However, it is stressed that no general conclusionsabout the relative performance of individual coefficients andfingerprints can be drawn from a single example given thestrong compound class dependence of similarity calculations(vide infra).

Similarity Threshold Values. Considering the globaldistributions of similarity values, it is of interest to derivethreshold values that indicate a statistically significant level ofsimilarity. Significance analysis of similarity values can be used,for instance, to determine if similarities between compoundssharing a property like biological activity might simply occur bychance or if compound similarity is likely to be associated withthe shared property. For this purpose, conventional p-valuescan be calculated. For example, a Tc threshold value at asignificance level of p = 0.01 would indicate a probability of 1%that the Tc value calculated for two randomly chosencompounds meets or exceeds the threshold. Threshold valuescan be estimated from the distribution of a large sample ofsimilarity values obtained by randomly selecting pairs ofcompounds and calculating their similarity coefficient. Thecumulative distribution function F(t) of the values then relatesa similarity value t to the ratio of similarity values less than orequal to t, and the significance is given by p = 1 − F(t). If suchthreshold values are generally applicable in the context ofsimilarity searching, i.e., if a similarity value exceeding athreshold value is a rare event and thus indicates significantsimilarity, they must be largely independent of the selectedreference compound. It is emphasized at this point that onlycalculated similarity values and their statistics are considered;accounting for compound activity according to the SPP isaddressed in the next subsection.

Figure 8. Similarity searching using different fingerprints and similarity coefficients. By use of compound A from Figure 7 as a reference, similarityvalues were calculated for 1 million ZINC compounds and 25 thrombin inhibitors (including compound B from Figure 7) using the Tanimoto andTversky (α = 0.1 and α = 0.9) coefficients and the (a) MACCS and (b) ECFP4 fingerprints. The similarity coefficient is plotted as a function of therank (reported on on a logarithmic scale). The positions of the 25 thrombin inhibitors are marked on each curve.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043196

Page 12: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

To illustrate the influence of different reference compoundson similarity calculations, search profiles were generated for 100compounds randomly chosen from the ZINC database. In each

case, their similarity to all remaining ZINC compounds in aZINC sample was calculated. Not surprisingly, search profilesfor individual reference compounds generally differed. On the

Figure 9. Threshold values of similarity coefficients versus significance levels. Cumulative distribution functions were generated for differentsimilarity coefficients and two fingerprints by selecting 100 random reference compounds from the ZINC database and calculating the similarity tothe remaining ZINC compounds from the subset of selected compounds according to Figure 5. The graphs on the left show the median as well asfirst and third quartile cumulative distribution function F(t) derived from the 100 sampled distributions. On the right, threshold values (y-axis) areshown depending on different levels of significance (x-axis) on a logarithmic scale. The median threshold values as well as the first and third quartilethreshold values are reported: (a) MACCS and Tc; (b) MACCS and Tv(α=0.9); (c) ECFP4 and Tc; (d) ECFP4 and Tv(α=0.9).

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043197

Page 13: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

basis of these profiles, a significance level (p-value), given bythe ratio of the number of compounds whose similarity valueswith respect to the reference compound exceed the giventhreshold, was assigned to every reference compound for eachthreshold value in the range 0−1. This yielded 100 curvesrelating threshold values to p-values. Figure 9 reportscumulative distribution functions and threshold values as afunction of the significance level for different similaritycoefficients with respect to the MACCS and ECFP4 finger-prints. The graphs on the left depict the median as well as firstand third quartile sampled cumulative distribution functions,while the graphs on the right report the Tc threshold value as afunction of the p-value. These graphs are obtained from thecumulative distribution function by exchanging the x- and y-axisand by plotting the p-values on a logarithmic scale in order toenhance the visual resolution for low p-values indicating highsignificance. Shown are the median threshold values and theinterquartile ranges of the thresholds obtained from the original100 curves. From the curves, it is apparent that statisticallysignificant similarity threshold values strongly depend on thefingerprint representation and the similarity coefficient that areused. In addition, there are large variations in threshold valuedepending on the reference compounds. Thus, althoughthreshold values might be associated with statistically significantsimilarity, without taking activity into account, they are nottransferable and are associated with large margins of error, dueto the dependence on reference compounds, as illustrated inFigure 9.

Do Activity-Relevant Similarity Threshold ValuesExist? Although the above considerations highlight theprincipal limitations of similarity calculations from a statisticalpoint of view, they do not consider similarity from theperspective of a medicinal chemist. In this case, the SPP takescenter stage and raises the issue of whether calculatedsimilarities can serve as indicators of activity similarity. Thisdirectly relates to the “0.85 myth” discussed above andrepresents one of the most important applications ofquantitative molecular similarity analysis in medicinal chem-istry.To address this question, similarity calculations must be

carried out for compounds having different specific activities.Therefore, 10 exemplary compound activity classes were takenfrom ChEMBL (version 15).47 Each compound was required tohave a pKi value of at least 7 for its designated target (thuslimiting the analysis to potent compounds with available high-confidence activity measurements). Similarity values were thencalculated for all pairs of compounds sharing the same activity.The results of these calculations are reported in Figure 10.Regardless of the fingerprint representations and similaritycoefficients used, the observed similarity value distributions foractive compounds strongly depended on the compound activityclass. For example, median MACCS Tc values varied from ∼0.3to ∼0.75 depending on the class. As shown in Figure 9a, aMACCS Tc threshold value of ∼0.65 corresponds to astatistically significant similarity at the level of p = 0.01. Itfollows that most compounds active against a given target

Figure 10. Similarity value distributions for active compounds. The distribution of similarity coefficient values for compounds sharing the sameactivity is reported for 10 exemplary compound activity classes taken from the ChEMBL database. The boxplot representations provide quartile andmedian similarity values for all compound comparisons. The whiskers represent the most extreme data points within the 1.5 interquartile range forthe lower and upper quartiles, respectively. Data points falling outside this range are not shown. On the x-axis, the ChEMBL target identifiers (Ids)are provided for each class: 11, thrombin; 43, β-2 adrenergic receptor; 72, dopamine D2 receptor; 86, monoamine oxidase A; 194, coagulation factorX; 214, muscarinic acetylcholine receptor M4; 10 498, cathepsin L;11 003, melanocortin receptor 3; 11 060, carbonic anhydrase VII; 11 627, acylcoenzyme cholesterol acyltransferase.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043198

Page 14: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

yielded similarity values that varied greatly and were notstatistically significant. Equivalent observations were made for

all combinations of fingerprints and similarity coefficients.These findings illustrate that generally applicable similaritythreshold values as a potential indicator of activity similarity donot exist. Such values also cannot be derived with any certaintyfor individual compound classes, as revealed by the variability ofsimilarity values and lack of general statistical significance.

■ PRACTICAL CONSIDERATIONS

Calculated similarities, regardless of how we perceive themfrom a medicinal chemistry perspective, strongly depend on thecompound classes under study as well as the molecularrepresentations (descriptors) and similarity measures used.48,49

If multiple reference compounds are employed, the results ofsimilarity calculations must be combined in some ways,typically through the application of data fusion techniques,50

which further complicates matters. The results discussed aboveillustrate that calculated similarity values do not enable us torelate molecular and activity similarity in a meaningful way toeach other and that it is impossible at present to establishgenerally applicable threshold values indicating that twocompounds share the same activity. Does all this mean thatsimilarity calculations have no utility in medicinal chemistry?The answer is no. The key issue is to understand what similaritycalculations can and cannot provide for. As long as one believesthat the magnitude of computed similarity measures has

Figure 11. Average Tc threshold values for scaffold recall rates. ForMACCS (blue) and ECFP4 (red), the average Tc threshold valuerequired to achieve a specified scaffold recall rate is reported. Thevariations of these Tc values across all trials are reported as error barsfor recall rates of 25%, 50%, and 75%. Numbers next to the error barsgive the median database selection set size for which the recall rate isachieved. The figure was adapted from ref 53.

Figure 12. Early enrichment of active compounds with different scaffolds. Two exemplary reference compounds and a set of active compoundshaving different scaffolds are shown that were found in the 100 top-ranked database compounds (individual ranks are reported). At the top, κ opioidreceptor ligands are shown and at the bottom human immunodeficiency virus type 1 protease inhibitors.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043199

Page 15: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

immediate implications for activity and that their values scale, inone way or another, with a probability of activity, little can beexpected. Meaningful applications of similarity calculations can,however, be considered if one is aware of these limitations.Computed Similarities on a Relative Scale. One of the

major applications of similarity analysis is ligand-based virtualscreening, where one or more active reference compounds areused to search databases to identify other compounds withsimilar structures and, by the SPP, hopefully, with similaractivities.4−7 Such searches can be carried out on the basis oflocal or global similarity methods. For example, pharmacophoresearching51 is based on local similarity and attempts to identifyall database compounds that match a predefined pharmaco-phore query, regardless of the remaining substructures. Suchcalculations can be carried out to identify structurally diversecompounds having similar activities, a procedure commonlyreferred to as scaffold hopping.52 The horizontal compoundrelationship in Figure 2 represents an example of a scaffold hop.Pharmacophore searching typically produces a “pass−fail”readout and identifies a set of compounds that match thequery. It is assumed that close resemblance of pharmacophore

elements corresponds to a high probability that reference anddatabase compounds share the same activity.In contrast to pharmacophore-based searching, fingerprint

similarity searching, which is based on a whole-moleculeassessment of similarity, does not require pharmacophorehypotheses or specific knowledge about activity-relevantfeatures of compounds. It is applicable when very little isknown except the activities of the reference compound(s) usedin the search. No activity information associated with specificsubstructural features in the fingerprints is required, only theassumption that the SPP is applicable. Importantly, similaritysearching produces a ranking of database compounds in theorder of decreasing computed similarity values relative to thereference compound(s). In this case, absolute similarity valuesare not relevant except on a relative scale for ranking ofcompounds.A database ranking starts with compounds that are most

similar to the reference compound(s), typically closely relatedanalogues, and as we proceed further down the ranking,database compounds become increasingly dissimilar but mightnonetheless be active. In a study designed to assess the scaffold

Figure 13. Compound ranks in virtual screening. Results from virtual screening trials are shown leading to the identification of new inhibitors of theSec-7 domain of cytohesins (Secin 16, 87, and 144). Two reference compounds are shown. For each of the three hits, rank positions are reported forfour alternative search strategies including support vector machine (SVM) calculations with two fingerprints (FP 1 and FP 2) as descriptors as well assimilarity searching with two fingerprints using a single reference compound (FP 1, reference 1 and FP 3, reference 2).

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043200

Page 16: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

hopping potential of similarity searching,53 it has been shownthat Tc threshold values cannot be determined that indicate asignificant enrichment of structurally diverse active compoundsin database rankings. Figure 11 summarizes search results forMACCS and ECFP4 over different activity classes.53 AverageTc values are reported for the fraction of active databasecompound series with distinct scaffolds for which at least oneactive compound was detected. Error bars are shown for Tcvalues at which compounds represented by 25%, 50%, or 75%of all “active scaffolds” were detected. The numbers at the errorbars indicate the median ranks of these active compounds. Forexample, to detect active compounds for 25% of all availablescaffolds, ∼1% (5488) of all database compounds had to beselected on average for MACCS and ∼0.5% (2,360) forECFP4. The large error bars indicate that it was not possible todefine Tc threshold values for the retrieval of structurallydiverse active compounds across different compound classes. Inessentially all calculations, however, a few active compoundswith scaffolds different from the reference compounds werefound at relatively high rank positions, as shown in Figure 12.Thus, the calculations show that scaffold hops can be detected,although large numbers of other database compounds had to beselected to achieve a significant scaffold recall of 25% or more.These findings illustrate the resolution limits of whole-moleculesimilarity searching. Nevertheless, similarity searching isrelevant for many practical applications.The attractiveness of similarity-based compound rankings in

medicinal chemistry is that they provide a continuum ofcompound similarity relationships that can be intuitivelyassessed. Although we do not know precisely where activecompounds with different scaffolds might be found insimilarity-based ranked lists, inspecting the rankings enablescompound selection on the basis of chemical intuition andexperience. In this case, the chemical informatics and medicinalchemistry perspectives meet.Figure 13 shows the results of a practical virtual screening

application54 that exemplifies the opportunities of similaritysearching. The study was designed to identify new inhibitors ofcytohesins,55 a family of small guanine nucleotide exchangefactors, by virtually screening a large compound databasecontaining 3.7 million compounds. Three newly discoveredstructurally diverse inhibitors54 and their database ranksproduced by four related yet distinct search strategies arereported. The positions of the inhibitors in the databaserankings show a remarkable spread. Two of these activecompounds were highly ranked by one search strategy (ranks 7and 35, respectively) but vanished in the database backgroundwhen the others were applied. The highest rank obtained forthe third inhibitor was 354, and this compound could only beselected on the basis of visual inspection of rankings andintuition because a total of only 145 compounds taken fromdifferent rankings were experimentally tested.54

Nearest Neighbor Analysis. Another application ofsimilarity calculations that is relevant to medicinal chemistryand is also independent of absolute similarity values is themapping of the chemical neighborhood of compounds.Similarity calculations can easily retrieve the k-nearest nearestneighbors, i.e., k most similar compounds, to a given compoundfrom any collection.50 The similarity radius, i.e., the range ofsimilarity values considered with respect to a specific referencecompound, can be easily adjusted, thereby increasing ordecreasing the number of compounds for inspection. Suchnearest neighbor calculations enable chemical interpretation of

limited numbers of similarity relationships and are useful, forexample, in support of hit expansion studies or in thegeneration of focused compound libraries. Since the mappingof chemical neighborhoods does not require sophisticatedmolecular representations, simple fragment-based fingerprintscan be used effectively.

Rendering Fingerprint Calculations Comparable.Although similarity threshold values of activity do not exist, itis possible to determine corresponding Tc or other relatedcoefficient values for different fingerprints that are met orexceeded by the same proportion of compound pairs in largedatabases. For example, in systematic similarity-based searchcalculations on 128 compound data sets taken from ChEMBL,12% of all possible compound pairs reached or exceeded aMACCS Tc value of 0.70.56 The same proportion of compoundpairs was obtained for an ECFP4 Tc of 0.31, thus establishingan approximate correspondence of these Tc values for the twofingerprints. Following this approach, it is possible to mapcorresponding Tc values for different fingerprints that select thesame percentage of compound pairs.56 Such correspondencesdepend to some extent on the composition of the compoundcollection under study. Figure 14 reports correspondence

between MACCS Tc and ECFP4 Tc values established on thebasis of the randomly sampled distributions shown in Figure 6.Selected points on the curve are highlighted that correspond tocertain fractions of compound comparisons meeting orexceeding the corresponding MACCS Tc (x-axis) or ECFP4Tc (y-axis) values. The curve illustrates the representationdependence of similarity values. Furthermore, it provides aguideline for assessing the significance of similarity valuesobtained for a newly introduced fingerprint on the basis of astandard fingerprint (such as MACCS) with which manyinvestigators are familiar.

Dissimilarity Selection. The selection of compounds thatare most dissimilar to those of an existing collection has a longhistory in compound acquisition in the pharmaceuticalindustry.57,58 It is also a meaningful application of similarity/dissimilarity calculations. In this case, the interest is in the

Figure 14. Corresponding Tc values for MACCS and ECFP4.Distributions of MACCS and ECFP4 Tc values were determined byconducting 10 million comparisons between randomly selected ZINCcompounds (according to Figure 6). Correspondence betweenMACCS and ECFP4 Tc values was established by relating those Tcvalues to each other that were met or exceeded by the same percentageof comparisons (indicated as labeled points on the curve).

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043201

Page 17: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

extreme values of a distribution of similarity values, not thelargest ones as in the case of nearest neighbor analysis butrather the smallest ones because of the complementaryrelationship between similarities and dissimilarities. Differentalgorithms have been produced for dissimilarity selection.57,58

Regardless of their specific details, many of these methods arebased upon pairwise similarity calculations of library andexternal candidate compounds.

■ CONCLUDING REMARKSThe present review provides an overview of the foundations ofmolecular similarity analysis and describes a number of differentsimilarity-based concepts relevant to medicinal chemistry. As iswell-known, the principal difficulty associated with similarityanalysis is that similarity itself is an inherently subjectiveconcept so that absolute standards do not exist. Nonetheless, awide variety of computational approaches have been developedin an attempt to account for molecular similarity in a formallyconsistent and unbiased manner. Although this may be adaunting task, it remains a critically important endeavorbecause of the power that the concept of molecular similaritybrings to the practice of chemistry in general and to medicinalchemistry in particular. Long before computational methods fortreating molecular similarity were developed, chemistsemployed similarity in a number of areas of chemistry, aparticularly noteworthy example being the development of theperiodic table.59

The similarity concept provides a framework, albeit animperfect one, for assessing the similarity of compounds, whichis one of the central tasks in medicinal chemistry. Since anindividual’s capacity to judge similarity relationships is limitedto fairly small numbers of relatively simple compounds,computational approaches are indeed essential in modernmedicinal chemistry, despite their limitations. This raises a keyissue, namely, how medicinal chemists perceive molecularsimilarity and how this perception relates to similarity evaluatedcomputationally. A brief discussion is provided here describingsome of the cognitive aspects of similarity perception and itsstrong association with human pattern recognition andreduction because they affect the subjective decisions ofmedicinal chemists. Clearly, similarity considerations stronglyinfluence which compounds are made, and these compoundsthen essentially reflect our views of similarity. This might oftenlimit the spectrum of compounds that are considered andprevent the exploration of chemically unusual ones that falloutside our similarity perception. On the other hand, for manytherapeutic targets there is a large number of structurallydiverse active compounds available,60 a knowledge base that isoften more considered in chemical informatics than medicinalchemistry.Given the medicinal chemistry focus of our presentation, we

have based our methodological considerations on 2D similaritycalculations. However, from a computational perspective, 3Dsimilarity methods are of course equally relevant.61,62

Regardless of the methods used, however, 3D similarityassessment in drug design remains affected by the uncertaintiesassociated with extrapolating from computed to often unknownbioactive compound conformations.Without doubt, similarity is often viewed differently in

chemical informatics and medicinal chemistry. This isexemplified by global and local comparisons of compounds.We have rationalized that similarity relationships mightfundamentally change depending on whether one applies a

whole-molecule view, as is often done in chemical informatics,or focuses on pharmacophores or functional groups (i.e., localmolecular information), as is typically the case in medicinalchemistry. Furthermore, the modeling of activity landscapes,which integrates compound similarity and activity relation-ships,63 often leads to rather different interpretations bycomputational and medicinal chemists when calculatedsimilarity values are used. Moreover, attempts to interpretcalculated similarity values and differences between them instructural terms might often cause confusion. Similaritycalculations are nevertheless of considerable interest inmedicinal chemistry. In addition, because human assessmentdepends significantly on the knowledge and experience ofmedicinal chemists, it is not surprising that calculated similarityvalues are often seen as an attractive means of decision support.However, we also note that many similarity search andbenchmark studies reported in the computational literaturelack proper statistical assessment, which complicates thecomparison and interpretation of calculated similarity values.Probably the largest conceptual roadblock to computational

similarity analysis is that the quantitation of chemical ormolecular similarity is generally not of interest per se but ratherthe extrapolation from calculated similarity values to othermolecular properties, in particular, biological activity. There areno well-defined relationships between calculated similarity andactivity similarity and no similarity threshold values that reliablyindicate whether a test compound shares the activity of areference compound, a situation that is further confounded bythe presence, albeit rare, of activity cliffs.10,11 These issuesfrequently gives rise to misunderstandings in medicinalchemistry. Moreover, similarity calculations are stronglydependent on compound classes, molecular representations,and similarity measures, which complicates their interpretationand practical application. If one is aware of these caveats,computational similarity analysis provides a number ofmeaningful and useful medicinal chemistry-relevant applica-tions. For example, similarity calculations often aid incompound selection if the focus is not on the absolutemagnitude of the similarity values but rather on their relativemagnitudes, which determines the ranking of compounds and isdecidedly more robust to differences in similarity values thatarise from the use of different similarity measures.Despite current limitations, computational similarity analysis

has its place in drug development, if applied in a consideratemanner, to complement and further expand medicinal chemists’perception(s) of molecular similarity. For fundamental reasons,it is not possible to eliminate subjective elements fromsimilarity assessments, which puts strong emphasis on thecareful interpretation of computational results. The develop-ment of computational similarity methods with reducedcompound class dependence will be an important topic forfuture research. In addition, the exploration of new concepts toaccount for biological similarity of small compounds will beequally attractive.

■ AUTHOR INFORMATIONCorresponding Authors*G.M.: phone, 520-405-4736; e-mail, [email protected].*J.B.: phone, 49-228-2699-306; e-mail, [email protected] authors declare no competing financial interest.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043202

Page 18: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

Biographies

Gerald Maggiora studied chemistry and biophysics at the Universityof California at Davis, earning a Ph.D. in biophysics. He spent morethan 20 years as Professor of Chemistry and Biochemistry, Universityof Kansas, and Professor of Pharmaceutical Sciences, University ofArizona. He spent an equal amount of time in the pharmaceuticalindustry as a Director of Computer-Aided Drug Discovery and SeniorResearch Scientist. His interests include molecular and mathematicalmodeling, scientific applications of computer-aided decision making,drug design, and applications of fuzzy mathematics and rough settheory to biological and medical problems. For more than 2 decadeshe has focused on chemical informatics and molecular similarity. In2008 he received the Herman Skolnik Award, Division of ChemicalInformation of the American Chemical Society.

Martin Vogt studied mathematics and computer science at theUniversity of Bonn, Germany, and holds a degree in computer science.He currently is a Research Associate in the Department of Life ScienceInformatics at the University of Bonn where he also completed hisdoctoral thesis on Bayesian methods for virtual screening under theguidance of Prof. Jurgen Bajorath. Previously, he was employed at theFraunhofer Institute for Applied Information Technology (FIT) wherehe worked on image recognition algorithms for bioinformaticsapplications. His research interests include algorithmic methoddevelopment in chemoinformatics, especially focusing on data miningand machine learning methods.

Dagmar Stumpfe studied biology at the University of Bonn, Germany.In 2006, she joined the Department of Life Science Informatics at theUniversity of Bonn headed by Prof. Jurgen Bajorath for her Ph.D.thesis, where she worked on methods for computer-aided chemicalbiology with a focus on the exploration of compound selectivity. Since2009, Dagmar has been working as a Postdoctoral Fellow in thedepartment, and her current research interests include computationalchemical biology and large-scale structure−activity relationshipanalysis.

Ju rgen Bajorath studied biochemistry at the Free University, Berlin.Beginning with postdoctoral studies in San Diego, CA, he spent morethan 15 years in the United States. He currently is Professor and Chairof Life Science Informatics at the University of Bonn, Germany. He isalso an Affiliate Professor in the Department of Biological Structure atthe University of Washington, Seattle. His research interests includedrug discovery, computer-aided medicinal chemistry and chemicalbiology, and chemoinformatics (http://www.lifescienceinformatics.uni-bonn.de).

■ ACKNOWLEDGMENTS

D.S. is supported by Sonderforschungsbereich 704 of theDeutsche Forschungsgemeinschaft.

■ ABBREVIATIONS USED

COX, cyclooxygenase; Dc, Dice coefficient; ECFP,extendedconnectivity fingerprint; FP, fingerprint; HSL, hormone-sensitive lipase; MACCS, molecular access system; SAR,structure−activity relationship; Sg, Soergel distance; SPP,similarity property principle; SVM, support vector machine;Tc, Tanimoto coefficient; Tv, Tversky coefficient; 2D, two-dimensional; 3D, three-dimensional

■ REFERENCES(1) Bender, A.; Glen, R. B. Molecular Similarity: A Key Technique inMolecular Informatics. Org. Biomol. Chem. 2004, 2, 3204−3218.

(2) Medina-Franco, J. L.; Maggiora, G. M. Molecular SimilarityAnalysis. In Chemoinformatics for Drug Discovery; Bajorath, J., Ed.; JohnWiley and Sons: Hoboken, NJ, in press.(3) Kubinyi, H. Similarity and Dissimilarity: A Medicinal Chemist’sView. Perspect. Drug Discovery Des. 1998, 9−11, 225−232.(4) Eckert, H.; Bajorath, J. Molecular Similarity Analysis in VirtualScreening: Foundations, Limitations and Novel Approaches. DrugDiscovery Today 2007, 12, 225−233.(5) Koeppen, H. Virtual ScreeningWhat Does It give Us? Curr.Opin. Drug Discovery Dev. 2009, 12, 397−407.(6) Willett, P. Similarity-Based Virtual Screening Using 2DFingerprints. Drug Discovery Today 2006, 11, 1046−1053.(7) Stumpfe, D.; Bajorath, J. Similarity Searching. Wiley Interdiscip.Rev.: Comput. Mol. Sci. 2011, 1, 260−282.(8) Johnson, M.; Maggiora, G. M., Eds. Concepts and Applications ofMolecular Similarity; John Wiley & Sons: New York, 1990.(9) Maggiora, G. M. On Outliers and Activity CliffsWhy QSAROften Disappoints. J. Chem. Inf. Model. 2006, 46, 1535−1535.(10) Stumpfe, D.; Bajorath, J. Exploring Activity Cliffs in MedicinalChemistry. J. Med. Chem. 2012, 55, 2932−2942.(11) Stumpfe; D.; Hu,Y.; Dimova, D.; Bajorath, J. Recent Progress inUnderstanding Activity Cliffs and their Utility in Medicinal Chemistry.J. Med. Chem. [Online early access]. DOI: 10.1021/jm401120g.Published Online: Aug 27, 2013.(12) Raymond, J. W.; Willett, P. Maximum Common SubgraphIsomorphism Algorithms for the Matching of Chemical Structures. J.Comput.-Aided Mol. Des. 2002, 16, 521−533.(13) MACCS Structural Keys; Accelrys: San Diego, CA.(14) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J.Chem. Inf. Model. 2010, 50, 742−754.(15) Good, A. C.; Richards, W. G. Explicit Calculation of 3DMolecular Similarity. Perspect. Drug Discovery Des. 1998, 9−11, 321−338.(16) Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A. A Shape-Based3-D Scaffold Hopping Method and Its Application to a BacterialProtein−Protein Interaction. J. Med. Chem. 2005, 48, 1489−1495.(17) Brown, R. D.; Martin, Y. C. The Information Content of 2D and3D Structural Descriptors Relevant to Ligand−Receptor Binding. J.Chem. Inf. Model. 1997, 37, 1−9.(18) McGaughey, G. B.; Sheridan, R. P.; Bayly, C. I.; Culberson, J. C.;Kreatsoulas, C.; Lindsley, S.; Maiorov, V.; Truchon, J.-F.; Cornell, W.D. Comparison of Topological, Shape, and Docking Methods inVirtual Screening. J. Chem. Inf. Model. 2007, 47, 1504−1519.(19) Fliri, A.; Loging, W.; Thadeio, P. F; Volkmann, R. BiologicalSpectra Analysis: Linking Biological Activity Profiles to MolecularStructure. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 261−266.(20) Petrone, P. M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kuthukian,P.; Cornett, A.; Deng, Z.; Davies, J. W.; Jenkins, J. L.; Glick, M.Rethinking Molecular Similarity: Comparing Compounds on the Basisof Biological Activity. ACS Chem. Biol. 2012, 7, 1399−1409.(21) Hu, Y.; Bajorath, J. Compound Promiscuity: What Can WeLearn from Current Data? Drug Discovery Today 2013, 18, 644−650.(22) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification;Wiley: New York, 2001.(23) Bishop, C. M. Pattern Recognition and Machine Learning;Springer: Berlin, 2006.(24) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical SimilaritySearching. J. Chem. Inf. Comput. Sci. 1998, 38, 983−996.(25) Maggiora, G. M.; Shanmugasundaram, V. Molecular SimilarityMeasures. Methods Mol. Biol. 2004, 275, 1−50.(26) Takaoka, Y.; Endo, Y.; Yamanobe, S.; Kakinuma, H.; Okubo, T.;Shimazaki, Y.; Ota, T.; Sumiya, S.; Yoshikawa, K. Development of aMethod for Evaluating Drug-likeness and Ease of Synthesis Using aDataset in Which Compounds Are Assigned Scores Based onChemists’ Intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269−1275.(27) Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V.Assessment of the Consistency of Medicinal Chemists in ReviewingSets of Compounds. J. Med. Chem. 2004, 47, 4891−4896.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043203

Page 19: Molecular Similarity in Medicinal Chemistry · Molecular Similarity in Medicinal Chemistry Miniperspective Gerald Maggiora,*,†,‡ Martin Vogt,§ Dagmar Stumpfe,§ and Jürgen Bajorath

(28) Kutchukian, P. S.; Vasilyeva, N. Y.; Xu, J.; Lindvall, M. K.;Dillon, M. P.; Glick, M.; Cooley, J. D.; Brooijmans, N. Inside the Mindof a Medicinal Chemist: The Role of Human Bias in CompoundPrioritization during Drug Discovery. PLoS One 2012, 7, e48476.(29) Gasteiger, J.; Teckentrup, A.; Terfloth, L.; Spycher, S. NeuralNetworks as Data Mining Tools in Drug Design. J. Phys. Org. Chem.2003, 16, 232−245.(30) Burges, C. J. C. A Tutorial on Support Vector Machines forPattern Recognition. Data Min. Knowl. Discovery 1998, 2, 121−167.(31) Rusinko, A., III; Farmen, M. W.; Lambert, C. G.; Brown, P. L.;Young, S. S. Analysis of a Large Structure/Biological Activity Data SetUsing Recursive Partitioning. J. Chem. Inf. Comput. Sci. 1999, 39,1017−1026.(32) Auer, J.; Bajorath, J. Emerging Chemical Patterns: A NewMethodology for Molecular Classification and Compound Selection. J.Chem. Inf. Model. 2006, 46, 2502−2514.(33) Tanimoto, T. T. IBM Internal Report; IBM Corporation:Armonk, NY, Nov 17, 1957.(34) Tversky, A. Features of Similarity. Psychol. Rev. 1977, 84, 327−352.(35) Flower, D. R. On the Properties of Bit String-Based Measures ofChemical Similarity. J. Chem. Comput. Sci. 1998, 38, 379−386.(36) Wang, Y.; Eckert, H.; Bajorath, J. Apparent Asymmetry inFingerprint Similarity Searching Is a Direct Consequence ofDifferences in Bit Densities and Molecular Size. ChemMedChem2007, 2, 1037−1042.(37) Fligner, M.; Verducci, J.; Blower, P. A Modification of theJaccard−Tanimoto Similarity Index for Diverse Selection of ChemicalCompounds Using Binary Strings. Technometrics 2002, 44, 110−119.(38) Wang, Y.; Bajorath, J. Advanced Fingerprint Methods forSimilarity Searching: Balancing Molecular Size Effects. Comb. Chem.High Throughput Screening 2010, 13, 220−228.(39) Nisius, B.; Bajorath, J. Rendering Conventional MolecularFingerprints for Virtual Screening Independent of MolecularComplexity and Size Effects. ChemMedChem 2010, 5, 859−868.(40) Becker, J. T.; Morris, R. G. Working Memory(s). Brain Cognit.1999, 41, 1−8.(41) Cowan, N. What Are the Differences between Long-Term,Short-Term, and Working Memory? Prog. Brain Res. 2008, 169, 323−338.(42) Hodgetts, C. J.; Hahn, U. Similarity-Based Asymmetries inPerceptual Matching. Acta Psychol. 2012, 139, 291−299.(43) Patterson, D. E.; Cramer, R. D.; Ferguson, A. M.; Clark, R. D.;Weinberger, L. E. Neighborhood BehaviorA Useful Concept forValidation of Molecular Diversity Descriptors. J. Med. Chem. 1996, 39,3049−3059.(44) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do StructurallySimilar Compounds Have Similar Biological Activity? J. Med. Chem.2002, 45, 4350−4358.(45) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.;Coleman, R. G. ZINC: A Free Tool To Discover Chemistry forBiology. J. Chem. Inf. Model. 2012, 52, 1757−1768.(46) Molecular Operating Environment (MOE); Chemical ComputingGroup Inc.: Montreal, Quebec, Canada.(47) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.;Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.;Overington, J. P. ChEMBL: A Large-Scale Bioactivity Database forDrug Discovery. Nucleic Acids Res. 2011, 40, D1100−D1107.(48) Bender, A. How Similar Are Those Molecules After All? UseTwo Descriptors and You Will Have Three Different Answers. ExpertOpin. Drug Discovery 2010, 5, 1141−1151.(49) Sheridan, R. P. Similarity Searching: When Is ComplexityJustified? Expert Opin. Drug Discovery 2007, 2, 423−430.(50) Willett, P. Combinations of Similarity Rankings Using DataFusion. J. Chem. Inf. Model. 2013, 53, 1−10.(51) Mason, J. S.; Good, A. C.; Martin, E. J. 3-D Pharmacophores inDrug Discovery. Curr. Pharm. Des. 2001, 7, 567−597.(52) Renner, S.; Schneider, G. Scaffold-Hopping Potential of Ligand-Based Similarity Concepts. ChemMedChem 2006, 1, 181−185.

(53) Vogt, M.; Stumpfe, D.; Geppert, H.; Bajorath, J. ScaffoldHopping Using Two-Dimensional Fingerprints: True Potential, BlackMagic, or a Hopeless Endeavor? Guidelines for Virtual Screening. J.Med. Chem. 2010, 53, 5707−5715.(54) Stumpfe, D.; Bill, A.; Novak, N.; Loch, G.; Blockus, H.; Geppert,H.; Becker, T.; Hoch, M.; Schmitz, A.; Kolanus, W.; Famulok, M.;Bajorath, J. Targeting Multi-Functional Proteins by Virtual Screening:Structurally Diverse Cytohesin Inhibitors with Differentiated Bio-logical Functions. ACS Chem. Biol. 2010, 5, 839−849.(55) Kolanus, W. Guanine Nucleotide Exchange Factors of theCytohesin Family and Their Roles in Signal Transduction. Immunol.Rev. 2007, 218, 102−113.(56) Dimova, D.; Stumpfe, D.; Bajorath, J. Quantifying theFingerprint Descriptor Dependence of Structure−Activity Relation-ship Information on a Large Scale. J. Chem. Inf. Model. 2013, 53,2275−2281.(57) Lajiness, M. S. Dissimilarity-Based Compound SelectionTechniques. Perspect. Drug Discovery Des. 1997, 7−8, 65−84.(58) Gillet, V. J. Diversity Selection Algorithms. Wiley Interdiscip.Rev.: Comput. Mol. Sci. 2011, 1, 580−589.(59) Rouvray, D. H. The Evolution of the Concept of MolecularSimilarity. In Concepts and Applications of Molecular Similarity;Johnson, M., Maggiora, G. M., Eds.; John Wiley & Sons: New York,1990; pp 15−42.(60) Hu, Y.; Bajorath, J. Global Assessment of Scaffold HoppingPotential for Current Pharmaceutical Targets. Med. Chem. Commun.2010, 1, 339−344.(61) Moffat, K.; Gillet, V. J.; Whittle, M.; Bravi, G.; Leach, A. R. AComparison of Field-Based Similarity Searching Methods: CatShape,FBSS, and ROCS. J. Chem. Inf. Model. 2008, 48, 719−729.(62) Tresadern, G.; Bemporad, D. Modeling Approaches for Ligand-Based 3D Similarity. Future Med. Chem. 2010, 2, 1547−1561.(63) Wassermann, A. M.; Wawer, M.; Bajorath, J. Activity LandscapeRepresentations for Structure−Activity Relationship Analysis. J. Med.Chem. 2010, 53, 8209−8223.

Journal of Medicinal Chemistry Perspective

dx.doi.org/10.1021/jm401411z | J. Med. Chem. 2014, 57, 3186−32043204