In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore...

159
Research Collection Doctoral Thesis Receptor-Based Pharmacophores in Dynamic Protein Models Author(s): Kunze, Jens Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010551034 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Transcript of In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore...

Page 1: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Research Collection

Doctoral Thesis

Receptor-Based Pharmacophores in Dynamic Protein Models

Author(s): Kunze, Jens

Publication Date: 2015

Permanent Link: https://doi.org/10.3929/ethz-a-010551034

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

DISS. ETH NO. 22741

Receptor-Based Pharmacophores in Dynamic Protein Models

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH

(Dr. sc. ETH Zurich)

presented by

JENS KUNZE

Dipl.-Bioinf., Goethe University Frankfurt am Main

born on 17.11.1986

citizen of Germany

accepted on the recommendation of

Prof. Dr. Gisbert Schneider, examiner

Prof. Dr. Gerd Folkers, co-examiner

2015

Page 3: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software
Page 4: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

To my family.

Page 5: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software
Page 6: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Table of Contents

I

Table of Contents

TABLE OF CONTENTS ....................................................................................................................................... I

ABBREVIATIONS ............................................................................................................................................. III

SUMMARY ...........................................................................................................................................................V

ZUSAMMENFASSUNG ................................................................................................................................... VII

1 INTRODUCTION .......................................................................................................................................... 1

1.1 VIRTUAL SCREENING.................................................................................................................................... 2 1.1.1 LIGAND-BASED VIRTUAL SCREENING ................................................................................................................... 3 1.1.2 RECEPTOR-BASED VIRTUAL SCREENING .............................................................................................................. 9 1.2 POCKET DETECTION AND COMPARISON .................................................................................................. 23 1.2.1 GEOMETRY-BASED ................................................................................................................................................ 24 1.2.2 ENERGY-BASED ...................................................................................................................................................... 27 1.2.3 SEQUENCE-BASED ................................................................................................................................................. 28 1.2.4 TEMPLATE-BASED ................................................................................................................................................. 28 1.2.5 CHARACTERIZATION OF A BINDING SITE ........................................................................................................... 30 1.2.6 BINDING SITE COMPARISON ................................................................................................................................. 30 1.2.7 BINDING POCKET FLEXIBILITY ............................................................................................................................ 32 1.3 PHARMACOPHORES, SHAPE AND ALIGNMENTS ...................................................................................... 34 1.3.1 MOLECULAR ALIGNMENTS AND SHAPE COMPARISON .................................................................................... 35 1.3.2 PHARMACOPHORE SEARCHES ............................................................................................................................. 36 1.4 X-RAY CRYSTALLOGRAPHY ...................................................................................................................... 47 1.5 GOALS OF THIS THESIS ............................................................................................................................. 50

2 MATERIALS AND METHODS ................................................................................................................ 51



Page 7: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

II Table of Contents

3 RESULTS ..................................................................................................................................................... 67

3.1 INTRODUCTION OF THE RBVS TOOL BASED ON A SHOWCASE .............................................................. 67 3.1.1 PARAMETER FILE................................................................................................................................................... 69 3.1.2 INITIATION ............................................................................................................................................................. 69 3.1.3 POCKET ADJUSTMENT .......................................................................................................................................... 72 3.1.4 POTENTIAL PHARMACOPHORE POINT DEFINITION .......................................................................................... 75 3.1.5 POTENTIAL PHARMACOPHORE POINT HOTSPOTS ............................................................................................ 77 3.1.6 PHARMACOPHORE DESCRIPTOR CALCULATION ................................................................................................ 81 3.2 RETROSPECTIVE ANALYSIS FOR THE DUD-E DATABASE ...................................................................... 82 3.3 PROSPECTIVE STUDIES ............................................................................................................................. 89 3.3.1 HIV-1 PROTEASE .................................................................................................................................................. 89 3.3.2 A. THALIANA ISPD ................................................................................................................................................. 97

4 DISCUSSION .............................................................................................................................................106



5 CONCLUSION AND OUTLOOK ...........................................................................................................117

6 ACKNOWLEDGMENTS .........................................................................................................................119

7 REFERENCES ..........................................................................................................................................121

8 APPENDIX ..................................................................................................................................................... I

APPENDIX I. FEATURE DEFINITION FILE FOR LIGAND-BASED PHARMACOPHORES:...................................................... I APPENDIX II: PARAMETER FILE FOR THE SHOWCASE OF THE PHARMACOPHORE SEARCH WORKFLOW ................ IV APPENDIX III: LIPOPHILIC CUT-OFF VALUES FOR THE DUD-E EVALUATIONS ............................................................. V APPENDIX IV: GEOMETRIC INTERACTION RULES FOR PPP DESCRIPTION .................................................................. VI

Page 8: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Abbreviations

III

Abbreviations

1D One-Dimensional 2D Two-Dimensional 3D Three-Dimensional AIDS Acquired Immunodeficiency Syndrome ANN Artificial Neural Networks ATIspD Arabidopsis thaliana IspD AUC Area Under the Curve BEDROC Boltzmann-Enhanced Discrimination of ROC CADD Computer Assisted Drug Design CASP Critical Assessment of protein Structure Prediction CATS Chemically Advanced Template Search

CDD Cambridge Crystallographic Database CDE-ME 4-diphosphotctidyl-2C-metyhl-D-eryththritol CoMFA Comparative molecule Field Analysis CTP Cytosine Triphosphate DUD-E Database of Useful Decoys: Enhanced EF Enrichment factor FDeF Feature Definition File FEP Free Energy Pertubation FLAP Fingerprints for Ligands And Proteins FP False Positve GASP Genetic Algorithm Superposition Program GP Gaussian Process

GPCR G Protein-Coupled Receptor HAART Highly Active Antiretroviral Therapy HBA Hydrogen-Bridge Acceptor HBD Hydrogen-Bridge Donor HIV-1 Human Immunodeficiency Virus 1 HTS High-Throughput Screening IspD 4-Diphosphcytidyl-2C-methyl-D-erythritol synthase ITC Isothermal Titration Calorimetry IUPAC International Union of Pure and Applied Chemistry KNIME Konstanz Information Miner LBVS Ligand-Based Virtual Screening LIQUID Ligand-based Quantification of Interaction Distributions

LUDI Let Us Design Inhibitors MCS Maximum Common Subgraph MCSMD Multi Copy Stochastic Molecular Dynamics Simulations MCSS Multi Copy Simultaneous Search MD-simulation Molecular Dynamic Simulation MEP 2C-methyl-D-erythritol-4-phosphate MIF Molecular Interaction Field MOE Molecular Operating Environment MSA Multiple Sequence Alignment MSCS Multi Solvent Crystal Structures

Page 9: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

IV

NCE New Chemical Entity

NIH National Institute of Health NMA Normal Mode Analysis PAINS Pan Assay Interference Compounds PASS Putative Active Site with Spheres PCA Principle Component Analysis PDB Protein Data Bank PESD Property-Encoded Shape Distributions PLA Pocket-Linings Atoms PPP Potential Pharmacophore Point PSP Protein-Solvent-Protein RBVS Receptor-Based Virtual Screening REOS Rapid Elimination Of Swill

ROC Receiver Operating Characteristic ROCS Rapid Overlay of Chemical Structures RoF Rule of Five SBVS Structure-Based Virtual Screening

SIFt Structural Interaction Fingerprint SR-BI scavenger receptor class B, type I SVM Support Vector Machine TI Thermodynamic Integration TP True Positive TS Tabu Search UnitProt Universal Protein Resource USR Ultrafast Shape Recognitions

vdW Van der Waals VS Virtual Screening WHO World Health Organization

Page 10: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

V

Summary

The concept of pharmacophores has a long-standing history in pharmaceutical research

and still remains one of most applied abstractions for receptor-ligand interactions. Beside

numerous other approaches successfully applied in computer assisted drug design the

pharmacophore modeling creates interpretable, illustrative representations of the

results. Therefore pharmacophore models are popular among researchers of various

disciplines, as critical discussions are not limited to computational experts, and

refinements are often reasonable and intuitive. Recent developments in binding site

detection, pharmacophore feature point definition and pharmacophoric descriptor

calculation are mostly driven by computational efficiency or focused on target specific

performance. Therefore it becomes harder for non-computational users to understand,

interpret or influence the results.

In this work a novel modular receptor-based pharmacophore workflow is presented, in

order to include as much information as possible and applying the algorithms suited for

the processed tasks. Furthermore uncertainties in the receptor atom positions due to

crystallographic modeling errors or observed flexibility can be included and uncertainty-

corrected pharmacophore models can be generated.

The workflow is interactive; all individual steps are visualized in the same PyMOL session

and the user is asked to provide additional information during the calculations, while the

given input is processed on the fly. Structured data files for the pocket- and potential

pharmacophore point description are presented to facilitate the interchange of the

individual workflow segments. Potential binding pockets are calculated by software tools

like PocketPicker or MDpocket and may be adjusted according to the chosen uncertainty

measurement. The potential pharmacophore points are generated based on geometric

interaction rules evaluated on every pocket grid point, and the receptor flexibility can be

reflected at this stage as well. After a semi-automated hotspot selection the PPPs are

transformed into a "virtual ligand", the corresponding LIQUID descriptor is calculated and

applied as query for pharmacophore-based virtual screenings.

Page 11: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

VI Summary

The workflow was retrospectively evaluated with the DUD-E dataset. A parameter

optimization for the pharmacophore descriptor calculation was performed revealing the

difficulties of generalizing software parameters over a divers set of proteins. In a proof-

of-concept study, the applicability of structure-based pharmacophore searches for

transient binding pockets is presented taking HIV-1 protease as an example. A follow-up

study on the HIV-1 protease was conducted, including the receptor flexibility for the

pharmacophore feature definitions based on MD simulations. One out of two active

substances found inhibits the protease with an IC50 of 80 µM. For a known allosteric

pocket of isoprenoid synthase D (IspD) all three flexibility levels (none, crystallographic

temperature-factor and MD-simulation) are applied and an additional transient binding

pocket is detected, predicted to be ligandable and therefore chosen as a prospective test

case.

Page 12: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Zusammenfassung

VII

Zusammenfassung

Das Konzept der Pharmacophore wird bereits seit längerer Zeit erfolgreich in den

pharmazeutischen Wissenschaften zur Wirkstofffindung eingesetzt und ist auch heute

eines der meistverwendeten Methoden zur Abstraktion von Rezeptor-Liganden

Interaktionen. Im Gegensatz zu vielen anderen, ebenso erfolgreich angewendeten

Methoden zur computergestützten Wirkstoffentwicklung sind Pharmakophormodelle

leicht interpretierbar und lassen sich gut veranschaulichen. Diese Eigenschaften

ermöglichen die Kommunikation von Wissenschaftlern verschiedenster Disziplinen,

besprochene Veränderungen erscheinen oft intuitiv und motiviert. Viele der aktuell

entwickelten computergestützten Verfahren für die Vorhersage von Bindetaschen, die

Beschreibung von potentiellen Pharmakophorpunkten und die Berechnung der

Deskriptoren arbeiten heute sehr effizient und sind auf spezielle Problemstellungen

zurechtgeschnitten. Das Fokussieren auf bestimmte Proteine oder komplexe Algorithmen

erschweren es dem Benutzer die Zusammenhänge zwischen der gewählten Methode und

deren Ergebnis zu verstehen. Der Einfluss des Anwenders auf das Ergebnis und dessen

Interpretierbarkeit sinken.

In dieser Arbeit wird ein neuer modular strukturierter Arbeitsablauf vorgestellt, der so

konzipiert ist, dass möglichst viele Informationen in die Berechnungen der

Pharmakophorpunkte einfließen. Weiter soll dem Benutzer die Wahl der Algorithmen für

die jeweiligen Arbeitsschritte ermöglicht werden. Ein neuer Ansatz erlaubt die

Berücksichtigung von Fehlern in der Beschreibung von Atompositionen des Rezeptors.

Bei diesen Fehler kann es sich um kristallographische Modellfehler oder durch bei

Simulationen beobachtete Flexibilität handeln, die dazu verwendet werden

fehlerkorrigierte Modelle zu erzeugen.

Alle Schritte des Arbeitsablaufes werden mit PyMOL visualisiert. Die von Anwender in

PyMOL durchgeführten Anpassungen und zusätzliche eingefügte Informationen werden

in Echtzeit bearbeitet und in die folgenden Berechnungen integriert. Es werden

übersichtliche Dateistrukturen für die Beschreibung von Bindetasche- und

Pharmacophorpunkten vorgestellt. Die jeweiligen Schritte im Arbeitsablauf können

Page 13: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

VIII Zusammenfassung

darauf basierend ausgetauscht werden. In der Software werden die Bindetaschen mit

PocketPicker oder MDpocket extrahiert und wahlweise mit dem berechneten

Fehlerwerten der Kristallograühie oder der Simulationen angepasst. Geometrie basierte

Regeln werden zwischen jedem Protein und jedem Taschenpunkt evaluiert und beim

Einhalten der Grenzwerte werden potentielle Pharmakophorpunkte an diesen Stellen

erzeugt. Diese Punkte können ebenfalls mit den Fehlerwerten der Atompositionen

korrigiert werden. Daraufhin folgt eine semi-automatische Priorisierung der Punkte in

dessen Anschluss ein "virtueller Ligand" der Bindetasche berechnet wird. Dieser LIQUID

Deskriptor wird für die weitere pharmakophorbasierte Ähnlichkeitssuche nach

potentiellen Bindern für die ausgewählte Tasche verwendet.

Der Workflow wurde mit dem DUD-E Datensatz evaluiert und es wurde eine Optimierung

der Deskriptor Parameter durchgeführt. Diese zeigt die Vielfältigkeit der Proteine und die

damit verbundenen Probleme einer Generalisierung von Softwareparametern auf. In

einer Vorstudie zur HIV-1 Protease wurde die Anwendbarkeit von

pharmakophorebasierten virtuellen Ähnlichkeitssuchen für transiente Bindetaschen

getestet. Die darauf aufbauende Studie zeigt, wie die durch Simulationen abgeleitete

Flexibilität des Rezeptors mit Hilfe des entwickelten Verfahrens erfolgreich in die Suche

integriert werden kann. Der IC50 für eine der beiden inhibierenden Substanzen liegt bei

80µM. An einer bekannten allosterischen Bindestelle der Isopreonidsynthase D (IspD)

wurden die drei Möglichkeiten des Workflows zur Integration der Fehler in den

Atompositionen getestet (ohne Fehlerberücksichtigung, mit kristallographischem Fehler,

mit MD Flexibilität). Darüber hinaus wurde eine transiente Tasche detektiert, deren

Eigenschaft zur potentiellen Ligandenbindung vorhergesagt und als weitere prospektive

Anwendung bearbeitet.

Page 14: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

1

1 Introduction

Computer-assisted drug design (CADD) is an interdisciplinary research field linking the

natural and life sciences to complement purely experimental methodologies and to

support the drug development process. Mathematical and statistical approaches are

applied for the discovery of New Chemical Entities (NCEs) that have a desired therapeutic

effect [1,2,3]. The technical advances in combinatorial chemistry, genomic science, and

biological and biophysical assay technology allow for directed drug development

compared to the times, where drug discovery was mainly driven by serendipity. The state-

of-the-art method for detecting potential chemical starting points is automated

biochemical High-Throughput Screening (HTS), whereby readily available compound

libraries are tested [4].

Virtually every important biochemical process in living organisms is affected by proteins.

Proteins are crucial for the catalysis of chemical reaction, intercellular signaling,

molecular transport, biochemical engines, structural elements etc. [5]. Out of the four

major macromolecular classes, namely proteins, polysaccharides, lipids, and nucleic

acids, proteins are currently the most successfully targeted by drugs. The reasons for this

lie in the diversity of protein functions and associated therapeutic opportunities, while

additional difficulties in ligand development for the remaining classes have been reported

[6].

Historically, Berezelius and Mulder introduced the term “protein” in the 1830’s. Around

ninety years later, Summer was the first to show the catalytic activity of urease and its

ability to crystallize [7]. In 1953, the first complete protein sequence was published,

followed by the first protein structure of hemoglobin at an atomic resolution by Kendrew

in 1958 [8-10]. The number of newly solved protein structures per year is continuously

increasing (Figure 1), which has led to increased efforts for receptor-based drug design

projects (See Chapter 1.2.2) [11]. One of the largest compilations for structural data is the

Protein Data Bank (PDB) [12], with more than 108,124 entries to this day (Figure 1).

A recently observed loss of the efficacy of pharmaceutical R&D suggests that drug

discovery by HTS has reached its limit and new technologies are required [13].

Computational methods have been developed [3,14,15] based on the knowledge of known

active entities and / or receptor information and often result in complementary findings

Page 15: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

2

compared to the bench experiments. Some of the basic ideas will be discussed in the

following chapters.

Figure 1. Development of Protein Data Bank (PDB) entries over time. A) Deposited structures in the PDB

starting with 13 structures in 1976. As of the 20th of April 2015, 108”124 structures are indexed. B) New

published structures per year. More than half of the structures in the PDB were published within the last

seven years.

1.1 Virtual screening

Virtual screening (VS) is the computational counterpart to HTS and represents alternative

for hit and lead compound identification [3,16,17]. Although automated approaches have

been developed to improve reproducibility and time-consumption of HTS campaigns, HTS

is suffering from the fact that the chemical space is too vast to be covered for financial,

logistic, and compound availability reasons [18]. VS is working on virtual representations

of the molecules and therefore also academic groups and small research organizations are

able to build up their own virtual screening libraries to perform VS campaigns and

overcome some of the mentioned problems of HTS [19]. The first step in a VS run is to

remove compounds with unwanted chemical structures or undesired properties from the

screening library and is called negative design [20]. One approach to remove undesired

reactive groups is the Rapid Elimination of Swill (REOS) [21] filter, whereas the detection

of promiscuous binders or “frequent hitters” [22] can be done by the Pan Assay

Year

Pro

tein

str

uctu

res in P

DB

(x 1

000

)

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

01

530

45

60

75

90

105

A B

Page 16: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

3

Interference Compounds (PAINS) substructure filters [23]. The Lipinski’s Rule-of-Five

(RoF) [24] to increase the chance of oral bioavailability for drug-like compounds or the

Rule-of-Three [25] for lead-like fragments are widely applied property filters to tailor the

screening library towards lead-like properties [26]. The RoF is often applied as filter for

drug-like molecules but its general validity has been questioned [27]. It has been reported

that many approved drugs do not pass the filter [28]. The RoF should thus be seen as a

guideline, instead of a strict set of rules, to remove potentially purely oral bioavailable

compounds from the database. Adding target-specific information is the starting point for

the positive design [3, p. 163, Figure 4.9] and at that point virtual screening can be divided

into two principal approaches:

1. Ligand-based virtual screening (LBVS), if at least one reference ligand is

known.

2. Receptor-based virtual screening (RBVS), also referred to as "structure-based

virtual screening", when structural target information is available.

1.1.1 Ligand-based virtual screening

LBVS is based on the assumption of the Chemical Similarity Principle introduced to the

field of CADD by Johnson and Maggiora in 1990 [24]. They propose that compounds,

which are structurally similar, should have an increased probability to exhibit similar

properties. This concept is often transferred to biological activity in the way that small

structural changes should most likely only have little influence on the biological activity.

Several studies came to the conclusion that the Chemical Similarity Principle does not hold

for every case [30,31]. The idea of so-called “activity islands” came up, describing a set of

bioactive molecules with high similarity according to the chosen description [347].

Leaving an “activity island” and observing a drop in activity although the investigated

compounds are chemically similar is referred to as “activity cliff” [32]. This effect can

sometimes be caused by adding or removing a single methyl group and is therefore

known as the “magic methyl rule” [33, p. 116]. It would be illusive to believe in a perfect

correlation between biological activity and structural similarity, but it is reasoned with

many examples that the correlation is high enough to enrich active molecules via

Page 17: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

4

substructure-based LBVS, even when individual compounds might be biological inactive

[3,32]. An additional dimension is added to the process by changing the chosen chemical

representation of the molecules, which also includes a change of the neighborhood

relationship in chemical space [34]. Maggiora described this as “the lack of invariance of

chemical space” [32]. The choice of a motivated, context-dependent molecular

representation is crucial for a successful LBVS campaign [32].

Molecular descriptors

A molecular descriptor represents any chosen set of molecular properties. In 2000,

Todeschini and Consonni defined descriptors as “the final result of a logic and

mathematical procedure which transforms chemical information encoded within a

symbolic representation of a molecule into a useful number or the result of some

standardized experiment.” [35]. Based on this definition, molecular descriptors can be

classified into two main categories (I) experimental measurements (like melting point,

logP, dipole moment and light absorbance), and (II) theoretical molecular descriptors

mainly representing physicochemical properties derived from symbolic molecular

representations. The Handbook of Molecular Descriptors contains an extensive overview

of more than 2000 theoretical molecular descriptors [35]. Alternatively to the

classification proposed by Todeschini and Consonni, descriptors may also be divided into

three classes, each class representing the dimensionality in which the chemical structure

is analyzed (Table 1) [36]. Molecular weight and atom counts are one-dimensional (1D)

descriptions of global molecular properties and can be derived from the molecular

formula. The most populated group of descriptors for virtual screening contains the two-

dimensional (2D) variants. They are based on the topological graph or the connectivity

table and translate into various descriptor types, e.g. topological indices single valued

descriptors [37-39], topological fingerprints decoding the presence and absence of

features [40], or topological correlation descriptors in real valued vector space [39,41].

Three-dimensional (3D) descriptors are calculated from molecule conformations and add

an additional dimension of information. One- and two-dimensional descriptors work with

defined molecular representations so the results for the same molecule should always be

the same, whereas molecules can have multiple conformations that can result in greatly

varying 3D descriptor values. The lowest energy conformation is not necessarily the

Page 18: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

5

bioactive conformation [42,43]. This fact, combined with the knowledge about the poor

consideration of the dynamic, time-dependent characteristic of ligand-receptor

interactions can be seen as one reason for the observation that one- and two-dimensional

descriptors can outperform 3D variants although the nature of the ligand binding process

is, in fact, three-dimensional [19,44,45]. One example of three-dimensional descriptors

are pharmacophore descriptors, which reduce the structure of the molecules to potential

protein-ligand interaction points [46]. The large variety of ideas and algorithms will be

discussed in detail together with molecular shape descriptors in chapter 1.3.

Table 1. Molecular descriptor categories (adapted from Ref. 36, 47)

Dimension Type Examples

One [1D] Global

Molecular weight, dipole moment, atom and

bond counts (e.g. number hydrogen-bond

donors/acceptors, number of rings, number of

carbons, log P)

Two [2D] Topological Topological and connectivity indices,

substructures (e.g. maximum common

substructures), topological fingerprints (e.g.

structural keys)

Three [3D] Conformational 3- or 4-point pharmacophores, molecular shape,

3D fingerprints

Molecular similarity searching

Applying the idea of the Chemical Similarity Principle to pairs of molecules represented in

terms of a numerical molecular descriptor requires a way to determine the similarity

between the molecules according to the chosen descriptor. Leach and Gillet summarized

several recent chemical similarity indices and metrics [48]. While the Tanimoto

coefficient (Eq. 1) is widely applied for binary fingerprints descriptors [16,49,50], for real-

valued descriptor vectors the Euclidian (Eq. 3) and Manhattan (Eq. 2) distances are typical

examples [44,46].

Page 19: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

6

𝑇𝐴,𝐵 =∑ 𝑥𝑖𝐴𝑥𝑖𝐵

𝑛𝑖=1

∑ 𝑥𝑖𝐴𝑛𝑖=1 +∑ 𝑥𝑖𝐵

2𝑛𝑖=1 −∑ 𝑥𝑖𝐴−𝑥𝑖𝐵

𝑛𝑖=1

,

𝐷𝐴,𝐵 = ∑ |𝑥𝑖𝐴 − 𝑥𝑖𝐵|𝑛𝑖=1 ,

𝐷𝐴,𝐵 = (∑ (𝑥𝑖𝐴 − 𝑥𝑖𝐵)2 𝑛𝑖=1 )0.5,

where A and B are molecules, x is a molecular descriptor matrix (binary fingerprints for

Eq.1, continuous vectors for Eq. 2 and Eq. 3), n is the number of compared descriptor

features, and 𝑥𝑖𝐴 is the descriptor value for the ith feature describing molecule A. 𝑇𝐴,𝐵

describes the similarity between molecule A and molecule B with 𝑇 ∈ [0,1] for binary

fingerprints and non-negative attributes descriptors. 𝐷𝐴,𝐵 is the sum over all the absolute

descriptor feature value differences (Eq.2) or the square root of the sum over all squared

feature value differences (Eq.3) and is positive real-valued 𝐷𝐴,𝐵 ∈ ℝ0+.

Applying similarity measurements during a virtual screening campaign will assign a value

describing the similarity to each molecule pair. Ranking the molecules according to this

score is the next logical step. One of the challenges in virtual screening is to distinguish

scores assigned to random molecules from those ones archived by bioactive ones.

Comparing virtual screening results retrospectively is an active research area focusing on

different aspects concerning the molecule ranking [51,52]. Transferring the problem into

a binary classification problem (active, inactive) allows applying of numerous statistical

methods to compare the performance of various methods. The Receiver Operating

Characteristic (ROC) [53] plots the true positive (TP) rate against the false positive (FP)

rate of an approach trying to solve the binary classification problem. ROC plots are

analyzed calculating the area under the ROC curve (AUC). The integral will become one

when all bioactive molecules are found at the beginning of the list, while 0.5 indicates that

the approach’s discrimination equals a random classification. ROC curve analysis can be

challenged by the fact that in most projects only a small fraction of the compound

database has actually been tested experimentally [52]. As ROC is looking for the overall

enrichment of actives, it is not necessarily the case that bioactive molecules are enriched

in the early fraction of the ranked list. Additional evaluation systems have been

(1)

(2)

(3)

Page 20: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

7

introduced to follow the idea of early enrichment [51,52]. The Enrichment factor (EF)

[51], for example, measures the enrichment of bioactives in a specified fraction of the

ranked list. The simplicity of EF evaluation does not take into account the exact position

of active molecules in chosen fraction additional to the neglected ratio between active and

inactive molecules [51]. the Boltzmann-Enhanced Discrimination of Receiver Operating

Characteristic (BEDROC) [52] metric (Eq. 4) overcomes the named weaknesses of the EF.

It complements the ROC AUC by adding an exponential weighting term dependent on the

rank of the actives.

𝐵𝐸𝐷𝑅𝑂𝐶 = 𝑅𝐼𝐸 × 𝑛+𝑛

𝑠𝑖𝑛ℎ(𝛼

2)

𝑐𝑜𝑠ℎ(𝛼

2)−𝑐𝑜𝑠ℎ(

𝛼

2−𝛼

𝑛+𝑛

)+

1

1−𝑒𝛼(1−

𝑛+𝑛

),

where 𝑅𝐼𝐸 = ∑ 𝑒−𝛼𝑥𝑖𝑛+𝑖=1

𝑛+

𝑛(1−𝑒−𝛼

𝑒𝛼𝑛 −1

)⁄ , with 𝑥𝑖 being the relative rank of the ith active and α

is the introduced tuning parameter. Commonly the α value is set to 20, so that 80% of the

final BEDROC score is based on the first 8% of the ranked list. Comparing different VS

methods using the BEDROC metric is always coupled with a justification of the choice of

the α value, which adds an additional degree of freedom to the equation.

LBVS can be further differentiated based on the required number of reference or training

data. Similarity searching is applicable already with only one known reference. Similarity

fusion approaches also take a single reference compound, but combine results measured

when different similarity metrics [16,47,54] or alternative molecular descriptors are

applied.

Taking into account two or more reference compounds is also referred to as multi-

reference LBVS. Several studies have shown an improved virtual screening performance

of multi-reference campaigns over single-reference variants [16,49], while the

computational effort is not limited to descriptor calculations. Considering more than one

ligand raises the question whether all of them are binding to the same binding site. But

also the overlay of the structures for two- and three-dimensional descriptors is not

straightforward, and alignment methods will be discussed in chapter 1.3. Multi-reference

methods are not always simple average values of the descriptors over all references. Any

(4)

Page 21: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

8

logical combination can applied, e.g. looking for a common side-chain or even including

negative (inactive) molecules to avoid undesired properties (Figure 2) [49,55].

Descriptor Value (mol A) Value (mol B)

# Rings 2 0

C=O group yes no

Molecular Weight 260 g / mol 214 g / mol

Figure 2. Multi-reference LBVS. Several descriptors are calculated and can be combined to a query. These

two molecules are not binding the same binding site [331].

Built on the multi-reference idea, machine-learning classification methods working on a

set of reference molecules can be applied. Several machine-learning algorithms are

known, to solve virtual screening needs including: nearest neighbor analysis, Support

Vector Machine (SVM), Gaussian Process (GP), naïve Bayes classifier, ensemble learning

and Artificial Neural Network (ANN) (Figure 3) [20,47,56]. Compared to multi-reference

approaches, those methods are capable of finding nonlinear relationships between

descriptor variables and the activity annotation in the training set. One of the main

properties of machine-learning algorithms is their potential to generalize and become

robust against noisy data [47]. Therefore, molecules within the training data which are

expected to increase the model complexity describing the already learned general trends

are discarded.

Page 22: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

9

Figure 3. Classification with support vector machine (A) and k-Nearest Neighbors (B). Black dots are

referring to inactive, white dots to active molecules. A) Linear separation in input space is impossible.

Transformation to a feature space allows for linear separation in feature space. The hyperplane (blue line)

indicates the chosen linear separator, which maximizes the margin (dashed blue line). The dots (molecules)

defining the hyperplane are located on the blue dashed line and are called "support vectors". Linear

classification in feature space translates into a nonlinear separation curve in input space (adapted from Ref.

47). B) A k-Nearest Neighbor analysis for k = 1, 3, 4. leads to the classification of the red encircled dot as

"inactive" for k = 1, and as "active" when considering up to four neighbors.

1.1.2 Receptor-based virtual screening

RBVS is an umbrella term for various approaches following the same basic idea. They

share the idea of using structural target information to identify with bioactive molecules.

Traditionally, receptor-based methods have not been as popular as ligand-based

techniques because of the low quality of three-dimensional structure data in terms of

resolution [57]. The interest in RBVS evolved with the increasing number of solved 3D

structures (Figure 1), brought about by projects like the Structural Genomic Consortium

(www.thesgc.org) and the Protein Structure Initiative [12]. The combination of

computational tools like sequence alignments followed up with homology modeling is one

of the alternatives, whenever NMR or X-ray crystallography data is unavailable. A broad

A input space feature space input space

transformation

B k=1 k=3 k=4

Page 23: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

10

community developed, conduct modeling competitions like the Critical Assessment of

protein Structure Prediction (CASP) on unpublished X-ray structures, to assess the

predictive power of modeling tools [58]. Generally, receptor-based methods can diversify

into two main directions:

1. Receptor-receptor comparison, where the similarity principle is applied on

different levels of abstraction (primary-, secondary-, tertiary- and quaternary-

structure).

2. Receptor-ligand comparison, where complementarity between receptor and its

ligands is the main working hypothesis for the prediction of ligand binding .

1.1.2.1 Receptor-receptor comparison

Comparing the primary structure (amino acid sequence) by alignments is one of the key

elements in structural bioinformatics and has been applied to identify structural or

evolutionary similarity [59]. Also in biology the phrase “form follows function” holds true,

as for proteins the structural similarity is more conserved than the sequence identity [60].

Translated to CADD and using the assumption that similar binding motifs in proteins are

more likely to bind similar ligands, receptor-receptor comparison on binding site level

can help to rationalize side effects and aid in drug repurposing [61]. Local structural

similarity comparisons based on shape, volume, pharmacophores, and local roughness

are popular in structure-structure methods and are also partially applied in receptor-

ligand based methodologies described in the following chapters. Waldmann and

coworkers focus on the conservation of secondary structure elements among different

proteins. The number of potential secondary structure elements is limited and a binding

event observed in a structure is transferred to domains with similar arrangements [360].

A further notable research field in this area is the comparison, prediction, analysis, and

evaluation of protein-protein interactions. Those methods, working on the quaternary-

structure of proteins, elaborate on the differences between protein-ligand and protein-

protein binding by adjusting force field parameters for docking studies or predicting

protein-protein interface binding sites on protein, widely differing from small-molecule

binding sites (Figure 4) [62,63].

Page 24: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

11

A Seq A D S E D K F M P P -

| * | | | | |

Seq B D E E - K F M - P A

Figure 4. Examples of structure-structure comparisons. A) Primary structure comparison of two protein

sequences by sequence alignment. Matching amino acids are indicated by “|”, gaps are shown as “-”. Stars

describing amino acid pair positions with known point mutations. B) Protein-protein comparison on

binding site level. Similar pockets are supposed to bind similar ligands, in this case ATP (center). ATP

binding sites of PDB:4pla [341] (left) and PDB:4rrv [342] (right) were extracted with PocketPicker [138] C)

Protein-protein interface formed between PDZK1 and SR-BI (PDB:3ngh) [343]. The protein-protein

interface is build by two beta-sheets (left) and forms a narrow pocket (right).

C

B

Page 25: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

12

1.1.2.2 Receptor-ligand comparison

Receptor-ligand complexes are involved in controlling virtually all biochemical processes

in living organisms. Numerous receptor classes with a variety in addressed functions are

described in literature, while the majority of complexes are protein-ligand interactions

[5]. Out of the approximately 30,000 genes in the human genome expressing proteins,

only a small fraction is considered to be involved in modifying diseases, whereas another

equally sized subset, around 10 – 14%, is known as the “druggable genome” [6]. The

intersection of theses subsets is estimated to contain 600 to 1,500 drug targets for

pharmaceutical research (Figure 5).

Figure 5. Human drug targets. The number of potential drug targets can be estimated by the intersection

of the “druggable” genome subset and the number disease linked genes. (Adapted from Ref. 6)

Compared to the portion of 15% of the predicted druggable genome, G Protein-coupled

Receptors (GPCRs) are represented poorly in the PDB with only 144 structures so far. It is

estimated that a comprehensive coverage of UniProt targets by the PDB can be expected

in around 12 years [64,65]. This circumstance is referable to the nature of GPCRs. GPCRs

are seven-transmembrane domain receptors, reacting on stimuli like hormones outside

of the cell by activating an intracellular signal transduction cascade. Structural

clarification methods like X-ray crystallography struggle membrane proteins: separating

the protein from the membrane without changing the conformation and preserve the

Human genome ~ 30,000

Disease-modifying genes ~ 3,000

Druggable

genome ~ 3,000 Drug targets ~ 600 – 1,500

Page 26: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

13

bioactivity is very challenging. Coupled with the high structural similarity of GPCR binding

sites the risk on failing projects is rather high [66]. The stabilised receptor (StaR) [348]

technology by Heptares Therapeutics, a state-of-the-art method in protein engineering

increases the thermostability of GPCRs with a small number of point mutations without

changing its biological activity. The modified proteins can be purified easier than the

wildtype versions and enable the usage of structure-based drug design concepts.

A second class of proteins essential for the life as known today are enzymes. Enzymes are

biomacromolecules catalyzing chemical reactions in biological systems by providing a

favorable reaction environment [5]. Supported by co-factors or co-enzymes, the reactants

are brought together in specific orientations, high energetic conformations are stabilized

or steric barriers are broken to speed up the reactions by a factor of 105 to 107. The

simplified representations of 1:1 binding stoichiometry between enzyme and substrate,

as well as enzyme and inhibitor are chosen to demonstrate different inhibition modes. A

prominent example exceeding those representations can be found in hemoglobin [349] .

𝐸 + 𝑆 ⇄ 𝐸𝑆 ⇄ 𝐸𝑃 ⇄ 𝐸 + 𝑃,

where E is the enzyme, S the substrate and P the product. Enzyme and substrate can build

the enzyme-substrate complex. When the reaction takes place, the enzyme-product

complex accrues and in a final step the product is released from the enzyme, which stays

unchanged and is ready for a new cycle. Note that the enzyme is not changing the reaction

equilibrium, but is only changing the activation energy. Therefore the enzyme can also

invert the reaction and catalyze the reaction from the former product to the former

substrate. In biological systems, the product is often consumed or delocalized to direct

the protein activity.

Malfunctions in enzymatic pathways are often related to observed diseases. Modulating

the enzymatic activity therefore is an active research field in the pharmaceutical sciences.

Small-molecule binding to a protein (enzyme) proofs the “ligandability” of the chosen

target, whereas a therapeutic effect is needed to confirm the “druggability” (Figure 6)

[67,68]. Molecules modulating the activity of an enzyme can cause a loss of catalytic

activity; those molecules are also referred to as inhibitors, or increase the catalytic activity

even further and are called activators (Figure 6).

(5)

Page 27: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

14

Protein-ligand binding takes place at specific sites on the protein surface. The so-called

binding sites are specialized patches on the protein where the physical and chemical

properties of the ligand and receptor are complementary to each other. In 1894, Emil

Fischer introduced the lock and key model [69] proposing that ligand and binding site can

be compared with a lock and its fitting key. While having in mind the structural flexibility

of the receptor, Koshland [70] presented his idea of the “induced fit”, where the protein

changes its conformation when the ligand is brought more in line with the binding site.

Nowadays, the protein is seen as highly flexible and present in many conformations. The

ligand does not introduce a conformational change, but is selecting the conformation for

binding out of the receptor conformational ensemble, which allows for the lowest energy

complex. Binding a ligand can be seen as shift in the conformational equilibrium towards

the “ligandable” conformation [71]. A variety of computational methods to predict ligand

binding sites have been developed ever since and will be discussed in detail in chapter

1.2.

Apart from ligand binding site prediction and definition, the positioning of the protein

pockets in relation to the catalytic site is important. While the substrate has a binding site

on the protein, the activity modulating molecules can be separated into two classes.

Modulators binding the same pockets as the natural substrate are called orthosteric

binders and compete with substrate for binding the receptor. Reviewing Equation 5 and

including the inhibitor I translates into Equation 6 and is called competitive inhibition:

𝐸𝐼 + 𝑆 ⇆ 𝑬 + 𝑺 + 𝑰 ⇄ 𝐸𝑆 + 𝐼 ⇄ 𝐸𝑃 + 𝐼 ⇄ 𝐸 + 𝑃 + 𝐼,

where we start with free enzyme, substrate and inhibitor (bold). The inhibitor can form

an inhibitor-enzyme complex, so that the enzyme is blocked and the turnover number for

the enzyme decreases. As the inhibitor cannot bind the ES complex, increasing the

substrate concentration will reduce the inhibition effect. For covalent binding inhibitors

the maximal turnover rate is decreased over time, as the EI complex is stable.

In non-competitive inhibition the protein is able to bind the substrate and the inhibitor at

the same time and form the ESI-complex. This complex is less active but can release the

substrate or the inhibitor to form the EI- or ES-complex respectively. In non-competitive

inhibition, increasing the substrate concentration will not displace the inhibitor, which

(6)

Page 28: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

15

indicates different binding positions for both molecules. This is possible due to different

binding modes in the orthosteric pocket, where both molecules bind simultaneously, but

the presence of a second, allosteric binding site is often the case. The third variant is called

uncompetitive inhibition; here, the inhibitor binding site is only present when the ES-

complex is already formed.

Figure 6. Enzyme activity regulation mechanisms. The enzyme is shown in gray, the substrate in black, the

competitive inhibitor in blue, the orthosteric activator in green and the allosteric inhibitor in red. A) The

enzyme in apo-structure, defining the orthosteric and allosteric binding site. B) Competitive inhibition of

the enzyme. The substrate (black) and inhibitor (blue) are competing to bind the enzyme. C) The orthosteric

activator (green) allows the substrate to better fit into the binding site. D) Binding of an allosteric inhibitor

changes the active site conformation and the substrate is unable to bind.

Allosteric modulation was discovered when studying metabolic and transcriptional

pathways, where products of one reaction are blocking alternative routes or work as

feedback loops, avoiding the extensive usage of a resource by a single enzyme [72].

Allosteric modulation allows targeting proteins with highly conserved orthosteric

binding sites but is not necessarily coupled to binding site restructuring [73,355]. The

Enzyme

Enzyme

Enzyme

Enzyme

orthosteric

site

allosteric site

A B

C D

Page 29: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

16

effect of the binding can be communicated over larger distances within the protein as

shown by Tsai and coworkers in 2009 [74]. The effect of allosteric modulation is

saturable, as the effect is maximized when all binding sites are occupied [355]. Local or

tissue dependent concentration differences of the substrate can be ignored.

For receptor-ligand comparison, medicinal chemists, pharmacists, bioinformaticians, and

physicists are pooling their knowledge to describe and understand the process of binding.

Energetically spoken, forming the receptor-ligand complex (Eq. 5, ES-complex) occurs

freely, when the Gibbs free energy G of the complex is lower than the energy of the

unbound components. Several ideas are implemented, tested and further refined to

discover ligands forming low energy complexes [3, p. 36]. The prediction of ligand binding

is difficult as the contributions of the individual properties to binding differ from target

to target [75]. This is already true when focusing on enthalpic contributions to the Gibbs

energy (Eq. 7). Enthalpic contributions to ligand binding include the formation and

disruption of [3, p. 38, 76]:

hydrogen-bridges (also referred to as hydrogen-bonds)

ionic and polar interactions

arene-arene interactions (“aromatic” face to face, edge to face, π – stacking)

halogen-bonds

dispersive interactions (e.g. van der Waal interactions)

metal coordination.

In contrast to enthalpy driven interactions, the entropic contributions to binding are

rarely included in virtual screening approaches. Entropic contributions can be described

as changes in the degrees of freedom of the whole system. Binding to the receptor “traps”

the molecule and the receptor in the bioactive conformation. This loss in entropy is

counter-productive for the aim of creating low energy complexes (Δ𝐺 increases when Δ𝐻

is negative, Eq. 7), whereas, e.g., water displacement from hydrophobic surface areas

increases their degree of freedom, which contributes positively to ligand binding [3 pp39-

40]. While enthalpic contributions can be physically measured by Isothermal Titration

Calorimetry (ITC), do computational approaches have difficulties with the calculation of

these values. Virtual systems tend to leave out solvent molecules to reduce computing

time and proteins are frequently approximated as rigid structures. Therefore, current

Page 30: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

17

state-of-the-art methods are focusing on enthalpy-driven binding events. Entropic

penalties for reducing the flexibility are avoided by minimizing the number of rotatable

bonds (e.g. Lipinski’s Rule-of-Five) while utilizing scoring functions to predict potentially

displaceable water molecules in crystals (crystallographic waters) or Molecular Dynamic

Simulations (MD-simulations) is an growing research area [77].

Prominent computational approaches for the calculation of Δ𝐺 are the “Free Energy

Perturbation” (FEP) and the “Thermodynamic Integration” (TI) [3 pp. 41-42]. Introduced

into drug design by Jorgensen and coworkers, the FEP splits up the receptor-ligand

complex formation, which is demonstrated in the BOMB software published in 2006

[350]. The sum over all energy differences calculated on force field potentials during the

simulation estimates the free energy.

Computational methods on receptor-ligand interactions are - although aiming at the same

results - diverse and motivated by different basic ideas. Ignoring the atom typing

completely, shape-based methods are only looking for structural complementarity. Since

a complementary shape to the binding site is favorable to increase the potential

interaction points for enthalpic interactions, while decreasing the chance of trapping

water molecules between receptor and ligand which would decrease the overall Gibbs

energy by reducing the degrees of freedom, it is known that ligands do only occupy around

one third of the binding site [78].

Δ𝐺 = Δ H − TΔS ,

where ΔG is the free energy change upon receptor-ligand binding dependent on the

change in enthalpy ΔH, temperature 𝑇 and the change in entropy ΔS.

Combined approaches are calculating ligand-based shape overlaps and include receptor-

based exclusion spheres [79] to score the ligands. Calculating the molecule surface as a

function of the electrons surrounding the single atoms is computationally demanding [3,

p. 2]. However, simplified models representing the molecules as set of spheres and

surrounding the atom centers with van-der-Waals-radii (vdW) have been introduced [80].

Derived from those spheres are molecular surface representations like the Lee-Richard

surface [81,82], which is defined by the center of the solvent probe rolling over the van-

(7)

Page 31: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

18

der-Waals surface. The Connolly surface, however, is a smoothing of the originally vdW

surface by also applying a rolling sphere but taking the sphere surface instead of the

center.

While those simplistic models are performing well in terms of size filtering and even as

virtual screening ranking criteria [83], many tools are developed including calculations of

enthalpic contributions to boost the overall performance. Receptor-derived

pharmacophores, alignment methods and combinational approaches are described in

detail in chapter 1.3.

Molecular docking

Compared to high-throughput crystallography [84], where the HTS idea is adapted for

receptor-ligand complex structure determination, molecular docking is the

computational counterpart. In high-throughput crystallography the crystallographic

conditions for a protein are well known and co-crystallization or soaking experiments are

performed with a set of potential modulators. The idea to confirm the binding to the

protein by crystallography is turned around. Especially for the pharmaceutical industry

at some point a crystal structure is most welcome and therefore the idea came up to start

with this crucial step. Molecular docking is intended to predict the receptor-ligand

complex together with an estimate of the binding free energy [85], starting with free

ligand and protein. Those ambitious aims have not been reached so far, although more

than 30 scoring schemes integrated in over 60 docking suites have been reported until

2008 [86]. In “A Critical Assessment of Docking Programs and Scoring Functions” Warren

and coworkers critically reviewed docking performances. As docking is a computational

combination approach joining both the prediction of the potential binding-mode, also

referred to as ligand pose prediction, together with the prediction of binding potency by

a fitness function, the docking results can be assessed individually [87]. According to

Warren’s test, including 10 docking programs and 37 scoring functions, the correct

positioning of the ligand is detected successfully and mainly fails when the conformation

generator itself is unable to reproduce the active conformation. On the other hand, scoring

functions are unreliable and for the chosen targets there is no correlation to be found

between ligand affinities and docking scores [88,89]. Docking programs need to work

Page 32: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

19

with numerous approximations to become computationally feasible, which decreases the

reliability and active addition of information by the medicinal chemist is needed. The

assessment showed that docking is still to be improved and can hardly be generalized, as

there is no docking program outstandingly performing on all targets. Rescoring the poses

with additional scoring functions to obtain additional opinions on the binding potency has

been suggested to be beneficial for virtual screening [90,91].

The lowest energy conformation is not necessarily the ligand conformation found in

complexed protein structures [42,43]. A study by Perola and Charifson demonstrates that

over 60% of the testes ligands do not bind in local minimum conformation. Strain energies

lower than 5 kcal/mol are found in around 60% of the ligands, while strain energies over

9 kcal/mol are found in at least 10% of all cases [42]. For this reason, docking programs

need to sample the conformational space while testing the binding potential using scoring

functions. In rigid docking approaches, ligand conformations are formed based on

rotamer libraries and the fit of shape, Potential Pharmacophore Points (PPPs) and

geometrical features with the potential binding site [92]. The computation times for

rotamer sampling and feature matching are low and allow the tools to be applicable in

virtual screening campaigns, but neglecting the protein influences on the ligand

conformation disables the algorithms to adapt to pocket characteristics [47]. Many

docking programs therefore include a ligand-conformation generator and calculate the

conformations on the fly. Anchor-based methods break down the molecules in fragments

to reduce conformational space and reduce the computational complexity. Only

fragments are flexibly docked into their preferable binding position and the molecule is

re-assembled by building up the molecule on the docked position, again applying rotamer

libraries [94].

Whenever exhaustive sampling is impossible, stochastic sampling methods are proposed

to sample the investigated space. One advantage of stochastic search strategies in docking

algorithms is that the scoring result can be fed back to the sampling engine on the fly to

influence succeeding generations of conformations. As the conformation sampling can be

interpreted as an optimization problem, maximizing the docking score while searching

the conformation space, Monte Carlo Sampling [95], Simulated Annealing [96] and genetic

algorithms [97] are prominent and frequently applied algorithms to solve this problem.

In Monte Carlo and Simulated Annealing algorithms, the structures are randomly

modified via bond-rotation, translation or applying mathematical functions to producing

Page 33: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

20

valid conformations and the docking score is evaluated. Changes causing better scores are

accepted immediately, while negative changes are accepted by specific criterions so that

the algorithm is able to escape local optima. Genetic algorithms are based on natural

selection strategies discovered in genetics (REFs). Molecules are encoded as

chromosomes, each pose is defined by values such as bond-angles, translations and stored

in chromosomes as genes. Classical gene modifications like mutation, crossover and

migration are applied on the parent poses to create a new generation of conformations. A

top scoring subset is chosen as parents for the next round of modifications [98,99]. Tabu

search (TS) algorithms, as implemented in Pro_Leads_docking [100,102], work with

restrictions stored in a tabu list to avoid focusing on a small, localized search space.

Scoring functions are used for evaluating the predicted poses and are meant to distinguish

active from inactive molecules by assigning fitness scores. In theory, those functions are

expected to be able to rank the molecules according to their affinities [87] with an

approximation of the free binding energy. The scoring methods differ significantly in

speed and accuracy. While the differences in binding energy between single ligand and

queries can be calculated very precisely by methods like free energy perturbation [101],

the computing time for VS scale needs to be much faster and therefore the applied scoring

functions are less accurate. Scoring functions can be classified (Table 2) into three main

classes:

1. Force fields, mathematical functions mimicking and the describing physics.

2. Empirical scoring functions, extrapolating from known affinities.

3. Knowledge-based scoring functions, trained on observed atom pair distances.

Force-field-based Scoring Functions

Force-field based scoring functions are working with the non-bonded interaction

schemes of classical molecular mechanics force fields (e.g. AMBER, CHARMM) [151,263],

as the bonded terms are irrelevant for non-covalent interactions. For the protein ligand

interactions, the van der Waals interactions can be modeled with a Lennard-Jones

potential, while the electrostatic interactions are approximated by the Coloumb energy.

Page 34: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

21

The sum over all considered interactions results in the total energy (Eq. 8) of the binding

event and can be translated into a docking score [102].

𝐸 = ∑ ∑ [𝐴𝑖𝑗

𝑟𝑖𝑗12 −

𝐵𝑖𝑗

𝑟𝑖𝑗𝑟6 + 332

𝑞𝑖𝑞𝑗

𝐷𝑟𝑖𝑗]

𝑙𝑖𝑔𝑎𝑛𝑑𝑗=1

𝑟𝑒𝑐𝑒𝑝𝑡𝑜𝑟𝑖=1 ,

where 𝐴𝑖𝑗 and 𝐵𝑖𝑗 are the vdW repulsion and attraction parameters with Euclidean

distance 𝑟𝑖𝑗 for atoms 𝑖 and 𝑗. The Coloumb term is calculated by using point charges 𝑞𝑖

and 𝑞𝑗 with 𝐷 being the dielectric function and the factor 332 is used to convert the

electrostatic energy into kilocalories per mol. Entropic contributions to binding are not

included in binding energy calculation, which needs to be considered while interpreting

the results.

Empirical scoring functions

Empirical scoring functions make use of known protein-ligand complexes, where the

binding affinity is also known, to construct a form of a master equation, capable to predict

the binding affinity of docked ligands. Regressions on general terms for polar interactions

(e.g. hydrogen bond acceptor/ donor, salt bridges), apolar interactions in form of

aromatic- and lipophilic interactions, entropy changes (change in degree of freedom,

water displacement) are calculated and employed as weighting variables for the scoring

function. An example for the empirical scoring functions is implemented in FlexX (Eq. 9)

[94], based on the function proposed by Böhm in his LUDI [103] de novo design approach:

[102, pp. 196-197].

Δ𝐺 = Δ𝐺0 + Δ𝐺𝑟𝑜𝑡 ∗ 𝑁𝑟𝑜𝑡 + Δ𝐺ℎ𝑏 ∑𝑓(Δ𝑅, Δ𝛼) + Δ𝐺𝑖𝑜 ∑𝑓(Δ𝑅, Δ𝛼) +

Δ𝐺𝑎𝑟𝑜𝑓(Δ𝑅, Δ𝛼) + Δ𝐺𝑙𝑖𝑝𝑜𝑓∗(Δ𝑅),

where the Δ𝐺 coefficients are the variables fitted by linear regression. Δ𝐺𝑟𝑜𝑡 takes into

account the change in degrees of freedom with 𝑁𝑟𝑜𝑡 as number of rotatable bonds. The

penalty function 𝑓 handles deviations in radius ΔR and angle Δα from the ideal

interaction geometries for hydrogen- and salt-bridges ( Δ𝐺ℎ𝑏 and Δ𝐺𝑖𝑜 ) as well as

(8)

(9)

Page 35: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Virtual screening

22

aromatic interactions (Δ𝐺𝑎𝑟𝑜). The second function 𝑓∗ is responsible for penalizing close

lipophilic interactions.

All terms shown are adding to the binding energy, as they are attractive interactions

obtained by crystal ligand complexes. To increase the predictive power, repulsive terms

for clashes or charges need to be added [100]. Furthermore the master equation depends

on the training data set and therefore the performance is expected to only work on

proteins similar to the proteins within the training set [104].

Knowledge-based scoring functions

The performance of empirical scoring functions on unknown structures in the training set

cannot be predicted. Analyzing the meaningful physical contributions to binding by

basing it explicitly on the assumption of additivity [105] is questionable, especially when

considering entropic contributions. One way to avoid these problems is shown in the

development of knowledge-based scoring functions. The simple idea behind those

functions is that atom pairs frequently found at a certain distance should form overall

favorable interactions. The interaction free energy 𝐴(𝑟) of an atom pair with distance r is

dependent on its frequency and can be described applying the inverse Boltzmann relation

[1], which is shown in Equation 10 [102]:

𝐴(𝑟) = −𝑘𝐵𝑇 ln𝑝𝑖𝑗

𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑(𝑟)

𝑝𝑖𝑗𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑

(𝑟),

where 𝑘𝐵 is the Boltzmann constant, 𝑇 is the absolute temperature and 𝑝𝑖𝑗 is the expected

and observed frequencies for atom pair 𝑖𝑗 at distance r. The expected 𝑝𝑖𝑗 values are

calculated from occurrences in a protein-ligand database. The final score is derived by

summing over all observed interactions within a given cutoff radius.

While several successful docking studies have been consistently published over the last

decades [14,120], molecular docking remains a trade-off between accuracy and compute

time. The steady increase in computational power allows docking tools to become

applicable to virtual screening library scale. High-throughput docking [121] or fragment-

(10)

Page 36: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

23

based high-throughput docking [122] is recently applied to assists the drug discovery

process.

Table 2. Molecular docking scoring functions. (Adapted from Ref 102)

Name Scoring function Year Reference

AutoDock Force field 1998 106

Dock Force field 2001 107

Goldscore Force field 1997 99

Chemscore Empirical 1997 108

FlexX Empirical 1996 94

Fresno Empirical 1999 109

Glidescore Empirical 2004 91

Hint Empirical 2002 110

Ligscore Empirical 2005 111

Ludi Empirical 1994 112

PLP Empirical 1995 113

Screenscore Empirical 2001 114

X-Score Empirical 2002 115

Bleep Knowledge-based 1999 116

Drugscore Knowledge-based 2000 117

Pmf Knowledge-based 1999 118

SmoG Knowledge-based 2002 119

1.2 Pocket detection and comparison

Following the “form follows function” concept, the ligand binding site of the protein can,

in many cases, already reveal some information about its function, as many binding sites

for the same ligand present similar features [123]. Therefore, the three-dimensional

structure of the protein and the active site in particular are of interest for receptor-based

drug discovery [124,125]. The large number of new small molecule leads derived by RBVS

today, together with the increased reports of solved macromolecular structures [126-

Page 37: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pocket detection and comparison

24

128], motivated the refinement and development of concepts for binding site detection

and description on various levels of abstraction [78]. In the early 1990s, pockets were

loosely defined as concavities with a shape fitting the ligand, while having complementary

chemo-physical properties. A paradigm shift occurred with binding sites in the focus of

drug design projects [78,129,130], differentiating “druggable” from “ligandable” pockets

or focusing on binding site specific chemical subspaces for the comparison of conserved

pockets found in multiple protein classes [131,132]. Following this idea, pocket derived

focused libraries or scoring functions can help predict off-target binding or the function

of unknown binding sites [133]. A study by Weisel et al. in 2010 showed the limited

number of pocket topologies and revealed protein family specific pocket features

interesting for druggability prediction of subpockets [134]. More than 30 computational

approaches on binding site detection are described in the literature (Table 3) and can be

roughly divided in four basic classes:

1. Geometry-based approaches, focusing on the geometry of the molecular surface.

2. Energy-based approaches, where the interaction energies between probes or

small chemical probes are calculated to define favorable binding spots.

3. Evolutionary approaches, mostly working with multiple-sequence alignments to

detect conserved amino acids.

4. Template-based methods, comparing the receptor with known binding sites

1.2.1 Geometry-based

Geometry-based methods make use of the fact that receptor-ligand interactions tend to

occur in concave regions of the protein surface. Those clefts are mathematically described

and ranked afterwards. Remarkably, drugs have been observed to often bind into the

largest surface pocket [135,136], so that the ranking of all detected clefts according to the

volume is sufficient for a successful prediction of the position of the active site in most of

the cases. Laskowski presented a grid-free approach called SURFNET [137] in 1995,

where spheres are placed between the vdW surfaces between all pairs of atoms and

Page 38: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

25

downscaled until all clashes with additional atoms are avoided. Pockets are defined as

sets of spheres with radii greater than 1 Å (Figure 7).

The CAST program [139,140] applies the idea of alpha-shapes for pocket detection. The

space between atoms is divided with a Voronoi diagram [141] and a Delaunay

triangulation [142]. The discrete flow method joins neighboring triangulated segments at

atomic level and describes the potential binding sites. The Putative Active Site with Spheres

(PASS) algorithm by Brady and Stouten [143] combines the description of the surface by

spheres with the necessity of increased buriedness (= number of receptor atoms within 8

Å) of those spheres in cleft regions (Figure 8). Additional layers of spheres are added on

the kept spheres of the initial step to determine active site points.

The second group of geometry-based methods embeds the receptor into a grid and

evaluates geometric functions for each grid point. Initially, the program POCKET [144]

was published. It searches for Protein-Solvent-Protein (PSP) events, clusters of points that

are protein solvent accessible and surrounded by the protein. The POCKET algorithm

does not check the orthogonal axes of the grid and because of this, fails on pockets rotated

45° towards the coordinate system. Many derivatives of the POCKET algorithm have been

reported (for example LIGSITE, LIGSITEcsc) [145,146] to overcome the downsides and

adopt the PSP idea to Surface-Solvent-Surface events instead.

Page 39: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pocket detection and comparison

26

Figure 7 Geometry-based pocket identification. A) The SURFNET algorithm simplified for three atoms. The

center (blue dot) between two atoms is computed (left), a sphere (red) is build up on this center not

interfering with the van-der-Waals surface (middle) and then reduced in size until all clashes with

additional atoms disappear (right). The algorithm continues for all atom pairs and keeps all spheres

exceeding a cutoff value of 1 Å. B) Visualization of the PASS algorithm. In the first stage (left) virtual spheres

(black and white dots) are distributed over the protein surface. Only dots with enough neighboring protein

A

B

C D

Page 40: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

27

atoms (black dots) are kept. In the second stage (right) additional layers of spheres are added to fill up the

potential binding pocket. The central spheres are highlighted as red dots. C) POCKET and LIGSITE approach

for pocket identification. In POCKET, only the main axes were searched for PSP events (blue lines), twisted

pockets on orthogonal axes (red lines) are only detected by the LIGSITE algorithm. D) PocketPicker

depiction. Red points are far of the protein and receive a low buriedness value, blue points lie within the

protein and are also removed. The remaining black points will be clustered and describe the three detected

potential binding sites. (Adapted from Ref 138)

A different way to define the binding site via grid points was suggested by Weisel and his

tool PocketPicker in 2007 [138]. In PocketPicker, 30 approximately equidistant rays are

sent out on each grid point to scan the environment. Each ray has a defined length of 10

Å and a width of 0.9 Å and the so-called “buriedness value” of the grid point is increased

by one when at least one protein atom is encountered [138]. A buriedness value of 0

indicates that the grid point lies more than 10 Å away from any receptor atom, while a

value of 30 indicates the complete surrounding with protein atoms and therefore a fully

encapsulated binding site. An acceptance range for buriedness values is applied and the

remaining grid points are clustered and represent the receptor pocketome. As

PocketPicker is reported to be one of the best performing geometric algorithms [75] it

served as the standard pocket detection tool for this work.

1.2.2 Energy-based

Energy-based pocket detection methods assume that a binding site can be characterized

by energetic properties [147]. For example, the GRID program computes semi-empirical

interaction energies between the receptor and several chemical groups mimicking

chemical fragments of interest for drug design [148]. The output of a GRID calculation can

be translated into a Molecular Interaction Field (MIF) [149] and the information of

multiple MIFs can define a potential binding site together with some information about

their properties. Q-SiteFinder [150], for example, clusters the energy value of a methyl

probe (-CH3) to define favorable regions. Additional approaches have been published

based on the same idea but applying state-of-the-art force fields (AMBER, GROMOS)

[151,152] for the interaction energy calculation and more complex cluster algorithms.

AutoLigand [153] is one example that is applying the AutoDock [106] force field.

Page 41: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pocket detection and comparison

28

In the Multi Solvent Crystal Structures (MSCS) method the crystal is soaked with different

organic solvents to identify regions that are binding specific solvents [154]. The

computational counterpart is reported by Vajda and Guarniere [155]. An advantage of

these multiple probes lies in the preliminary characterization of the binding site [147].

1.2.3 Sequence-based

Sequence-based methods apply Multiple Sequence Alignments (MSAs) to identify

conserved residues [156]. Functional and catalytic residues should be more conserved, as

the loss of function due to mutations is unfavorable for the organism. While sequence-

based methods are applicable to receptors without known three-dimensional structure,

there are several known drawbacks. The conservation of residues is not always explained

by the activity, but also can be caused to maintain stability. There is no way to define

volume, shape or potential interaction points using sequence-based methods without

adding explicit structural information through structure determination or homology

modeling and applying template-based methods.

1.2.4 Template-based

Template-based methods can be used to compare the investigated receptor with known

binding sites. Identifying 3D patterns of residue side chains [157,158] or comparing the

arrangements of residues [160] are two possible approaches for template-based pocket

detection. When the three dimensional structure of the receptor is unknown, methods

like FINDSITE [161,162] and 3DLigandSite [163] can be applied to build comparative

“homology” models. Those models are compared with similar PDB entries afterwards to

predict the protein function [161].

Page 42: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

29

Table 3. Pocket detection algorithms. (Adapted from Ref 147, 202)

Method type Name Year Reference

Geometric

CavitySearch 1990 180

POCKET 1992 144

method by Delaney 1992 181

method by Del Caprio et al. 1993 182

method by Xie and Bourne 2007 159

VOIDOO 1994 183

SURFNET 1995 137

APROPOS 1996 164

LIGSITE 1997 145

CAST (Castp) 2006 166

DOCK 1982 92

Surface patches 2000 184

PASS 2000 143

LigandFit 2003 185

Screen 2006 176

TravelDepth 2006 186

PocketDepth 2008 175

PockerPicker 2007 138

VisGrid 2008 187

VICE 2010 188

Fpocket 2009 170

CAVER 2006 167

GHECOM 2009 171

McVol 2010 173

SplitPocket 2009 179

Energy

GRID 1985 148

method by Ruppert 1997 189

vdW-FFT 1998 190

CS-Map 2003 191

DrugSite 2004 192

Q-SiteFinder 2005 150

PockerFinder 2005 193

Binding response 2007 165

SITEHOUND 2009 152

ICM-PocketFinder 2005 172

AutoLigand 2008 153

Sequence- and Template-based

method by Casari et al. 1995 194

method by Rinaldis et al 1998 195

method by Aloy et al 2001 196

ConSurf 2001 197

Rate4Site 2002 198

Evolutionary trace method 1996 169

Page 43: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pocket detection and comparison

30

PFIND 2011 174

TESS 1997 160

3DLigandSite 2010 163

Combined methods

LigSiteCSC 2006 146

SURFNETCSC 2006 199

SiteMap 2007 177, 178

ConCavity 2009 168

MetaPocket 2009 200

SiteIdentify 2009 201

FINDSITE 2009 161,162

DoGSite 2010 202

1.2.5 Characterization of a binding site

Loosely speaking, the binding site is simply an area on the protein surface where ligands

can bind. There is no definition where the binding site starts or ends, so that the exact

positioning stays somehow subjective [78]. Descriptions of the same binding area can

therefore vary in their amino acid composition, but also in size, depth, volume, and shape.

Protein-protein binding sites tend to be flat and unstructured [136], showing a dual

character according to structural stability [203]. Receptor active sites for ligand binding

are large and deep, while the actual shape differs a lot. Spherical cavities are found in

endonucleases, while the ribonuclease binding site forms an elongated groove [204].

Larger proteins tend to form multiple potential binding sites, while the maximum pocket

volume does not change necessarily [205]. The pocket is rarely completely occupied by

the bound ligand, more often the binding site triples the ligand volume [130]. Reviews on

pocket detection tools state that geometrical complementary is not sufficient to describe

receptor binding, whereas the prediction capabilities are proven [78,204].

1.2.6 Binding site comparison

Once the binding site is detected, different approaches can be applied to predict ligand

binding, ligand side effects, pocket druggability and pocket flexibility. Receptor-ligand

interactions are discussed in the receptor-based virtual screening section (1.1.2) as well

Page 44: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

31

as in chapter 1.3 focusing on shape and potential pharmacophore points. The concept of

the “magic bullet” [206], a drug selectively targeting a single target, is weakened by

studies suggesting the presence of six to eleven targets per drug [207,354]. Therefore the

prediction of potential targets is heavily investigated. Ligand-based polypharmacology

methods are recently applied to predict the side effects of new drug candidates or even

applied to detect additional targets for drug repurposing of known drugs [75]. Structure-

based methods compare binding sites to detect similar pockets with potentially known

ligands to start VS campaigns with pocket derived chemical libraries [78]. Predicting the

“druggability” is helps prioritize targets in pharmaceutical industry [75]. As around 60%

of all small molecule drug discovery projects fail due to the fact that the chosen target is

not druggable [202,209] and the costs to establish a new target in industry is rather high,

the prediction of “druggability” is an ongoing research area [192,202,210].

Binding site comparison is a difficult task and comparison programs aim to perform well

on several tasks [211]:

Applicability for broad spectrum of targets

Focus on important properties for small ligand binging

Reliability; few false-positives and false negatives

Computation time; number of new PDB entries is high

Include conformational flexibility

Produce human understandable output

The methods can be classified according to the binding site detection methods but can be

further differentiated by (I) the underlying mathematical model, (II) the algorithm for

model comparison and (III) the differences in protein abstraction levels. For the latter,

while one level of abstraction works on the binding site forming residues as single atoms

or “pseudo centers”, a second approach is less restrictive and projects properties onto the

surface. A third class models the pocket’s volume and calculates potential interaction

points to mimic the properties of the ligands (potential pharmacophore point approaches

are further discussed in chapter 1.3). The underlying mathematical model is either

alignment-based to retain as much three dimensional information as possible or vector-

based to stay computationally feasible for big data sets [131]. The program CavBase [212]

Page 45: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pocket detection and comparison

32

represents the most popular group of algorithms working with graph-theoretical

methods like the NP-complete Maximum Common Subgraph (MCS) or the related

Maximum Clique Detection problem. Approximations and heuristics for computationally

expensive methods are implemented to speed up the comparisons [211,213,214].

Geometric hashing algorithms are prominent alternatives to the graph-based methods

[215] and are heavily used in the receptor-ligand comparison methods described in

Chapter 1.3.

Alignment free methods employ various mathematical approaches to describe the

binding site. The calculation of more than 400 descriptors followed by Principle

Component Analysis (PCA) to narrow down the resulting vector showed the importance

of volume and shape for differentiating binding sites [216]. Several studies focus on the

detection of key descriptors for pocket comparison and are thoroughly reviewed in

literature [78,210]. Mathematics driven approaches approximate the shape and potential

interactions by sets of individual configurable basis functions. Methods like the “Binding

Balls” [217] based on the theory of spherical harmonics [218] or 3D Zernike descriptor

[219] approaches code the distance of proteins as differences in the function coefficients

and have been reviewed recently [220].

1.2.7 Binding pocket flexibility

Receptor and ligand flexibility are crucial factors for the prediction of binding site

properties and the understanding of ligand binding. The original “lock and key” concept

holds true for some receptor-ligand complexes, while the later proposed “induced-fit”

theory was further generalized in terms of the conformational selection theory [221].

Most of the proteins are flexible structures, undergoing side-chain rotations or even

structural rearrangements frequently [204,222,223]. The ligand binds to the protein

conformation that results in the most complementary and energetically favorable protein

ligand complex. The ligand itself might not necessarily be present in the lowest energy

conformation and at the same time, the receptor conformational equilibrium defined by

the conformational energies is shifted towards the binding conformation. These

dependencies imply that the pocket shape and size changes for different ligands and the

pocket cannot be analyzed independently from the ligand [224]. While it is valuable to

Page 46: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

33

consider protein flexibility for known binding sites, analyzing the conformational changes

of the complete receptor can result in the detection of “transient” pockets. As many

enzymes are validated drug targets and undergo conformational changes for substrate

binding or catalytic activity [74] transient pockets might be detected during these

motions and allow for the developments of allosteric inhibitors. The diverse

conformations can be derived experimentally by X-ray crystallography and NMR studies

or by using computational approaches like MD simulation, Normal Mode Analysis (NMA)

or graph-theoretical based methods [223,225,226]. Capturing the protein flexibility

computationally is discussed in several studies and is mostly done using a combination of

conformation generation by MD simulation tools, followed by a motivated selection of

snapshots according to chosen criteria [227,228]. The program F-DycoBlock by Zhu and

coworkers performs Multi Copy Stochastic Molecular Dynamics Simulations (MCSMD) on

fragments and protein conformations to detect favorable binding sites [229,230]. Pocket-

based methods like the EPOSBP [231] method run the PASS pocket detection method on a

series of snapshots and track the behavior of so-called Pocket-Linings Atoms (PLA). A

second approach is MDpocket [232], running the Fpocket algorithm [179] on a sequence

of pre-aligned conformations. The detected pockets for each snapshot are assigned to grid

points and the frequency of assigned pockets per grid point is translated into a density

grid. This makes it possible to search for grid points lying within pockets in at least x% of

the snapshots.

In a preliminary study, we demonstrated the capabilities of MD simulations to predict

potential allosteric binding sites when applying pocket cluster algorithms to detect

conserved pockets over a whole MD simulation and performed receptor-based

pharmacophore modeling to search for allosteric modulators of the HIV-1 protease [233].

Here, we extend our approach to longer MD simulations and apply a “ligandability”

prediction tool based on the local roughness of protein surfaces [234]. Instead of single

pocket conformations, an MDpocket analysis was performed to detect transient pockets

and a local pocket alignment computes atom position inaccuracies later included in the

potential pharmacophore point prediction.

Page 47: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

34

1.3 Pharmacophores, Shape and Alignments

The concept of pharmacophores is meant as a generalization of the key interaction points

for receptor-ligand binding. The first definition of the pharmacophore has been given by

Ehrlich (1909) as “… a molecular framework that carries (phoros) the essential features

responsible for a drug’s (pharmacon’s) biological activity”. As reported by Van Drie [235],

Kier refined the concept in several papers in the 1960s and 1970s [236-238]. A revised

definition has been given by Gund in 1977 [239] stating pharmacophores as “… a set of

structural features in a molecule that is recognized at a receptor site and is responsible for

that molecule’s biological activity”. Later on the International Union of Pure and Applied

Chemistry (IUPAC) published a glossary of terms for medicinal chemistry including an

entry for pharmacophore or pharmacophore pattern: “A pharmacophore is then ensemble

of steric and electronic features that is necessary to ensure the optimal supramolecular

interactions with a specific biological target structure and to trigger (or to block) its

biological response.” [3, p. 59,240]. Below is a summary of the important concepts [241,

p.3]:

The pharmacophore is a set of points necessary for an optimal receptor-ligand

interaction describing the essential steric and electronic, function-determining

regions.

The pharmacophore is an abstraction of common molecular interaction points

important for receptor binding, no real molecules or substructures.

Pharmacophores are neither specific functional groups (e.g. ketone) nor parts of

molecules (e.g. benzene ring).

Despite this clear definition, the term pharmacophore is consistently misused in

medicinal chemistry. In many cases a pharmacophore is understood synonymously to a

privileged structure, a molecular motif associated with more frequent high biological

activity than other motifs [242] For computational chemists, a pharmacophore models

the key interactions of receptor-ligand interactions and can be applied to VS, scaffold

hopping or combined with additional pharmacophore-based approaches [241, p. 11].

While the early studies worked mainly on manually derived pharmacophores, assisted by

simple molecular graphics software, in the last decades the concept of pharmacophores

Page 48: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

35

attracted several research groups and resulted in various sophisticated computer

programs [236]. Several reviews, book chapters and even complete books on

pharmacophore searches exist not only because of the success, but also because of the

human interpretability of the output, which is important for the interface between

chemists and computer scientists [236].

1.3.1 Molecular Alignments and Shape Comparison

The comparison of pharmacophores of different molecules is mostly done based on an

overlay of the structures. Such overlays or alignments are meant to produce a set of

plausible relative superimpositions, approximating the binding geometry [241, p. 18]. As

the bioactive conformation is not always the lowest energy conformation [42,43],

conformational flexibility is one of the main issues, which has to be solved during an

alignment. Rigid alignment methods ignore ligand flexibility, semi-flexible methods make

use of pre-calculated conformations, and flexible methods calculate the conformations on

th -fly but are computationally more expensive. Further alignment methods can be

classified as point- or property-based [241, p 19]. Point-based algorithms superimpose

pairs or sets of atoms or potential pharmacophoric features and store the applied

transformation as potential overlay of the structures. Many of the pocket comparison

methods working on MCS as described in chapter 1.2 are point-based methods.

Property-based (also known as field-based) methods are comparable to energy-based

pocket detection methods. In this case, the grid is generated by embedding the ligand and

various descriptors can be calculated for every grid point. The ligand is then represented

by a set of spheres or Gaussian functions representing the chosen descriptors. A set of

randomly or systematically sampled conformations is generated based on the considered

degree of freedom. Local structure optimizations are conducted to maximize the

molecular overlap. Gaussian representations allow for a grid-free and fast computation

due to the nature of Gaussian functions and therefore replace most of the sphere and grid

dependent methods [241, p. 19]. The calculation and consideration of molecular shape

has a long history in pharmaceutical research and has been reviewed recently [243,244].

One state-of-the-art software tool today is called Rapid Overlay of Chemical Structures

(ROCS) [83], based on atom-centered Gaussians to represent the volume. The Gaussian

Page 49: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

36

functions for molecular shape description, introduced by Grant and Pickup in 1995, are

smoother than discrete “inside/outside” descriptions and therefore it becomes easier to

find the global optimum during shape comparison [243,245,246]. Numerous

computational methods have been developed to solve or avoid the alignment problem for

molecular shape comparison. The applied computational concepts are omnipresent in

computer assisted drug design and will pop up in various chapters of this thesis. Internal

coordinate systems to avoid alignments are shown in programs like the Ultrafast Shape

Recognitions (USR), while combinatorial methods are presented in the ROCScolor version,

taking into account atom types for interaction modeling or the ShaEP method, combining

molecular shape and electrostatic potential calculations [247-249]. Storing and

comparing molecular shapes with fingerprints [250] is presented as well as binding site

comparison with Property-Encoded Shape Distributions (PESD) [251] and shape-based

virtual screening campaigns. (252) Similar to docking ideas, divide-and-conquer

approaches break down the molecules in fragments to save computation time [253]. In

2007 Andrew Good published a paper, where he adjusted the DOCK docking program for

molecular shape matching [254].

1.3.2 Pharmacophore Searches

Pharmacophore description, search and comparison have a long history in CADD.

Manifold programs on different levels, data requirements and complexity have been

developed for individual tasks. Potential Pharmacophore Points (PPPs), points or areas

with a given potential for interactions (e.g. hydrogen-bridge acceptor/donor), can be

derived from the ligand’s topology in 2D or any three dimensional representation or from

the receptor’s binding site perspective. Based on the available knowledge, different

methods come into play as described in the virtual screening chapter 1.1. In the following

part the concepts of Ligand-Based Pharmacophore Search (LBPS) and Receptor-Based

Pharmacophore Search (RBPS) will be discussed in detail.

Page 50: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

37

1.3.2.1 Ligand-Based Pharmacophore Searching

In LBPS, a pharmacophore is an ensemble of physicochemical descriptors associated with

a biological target [57]. Pharmacophore points are positions of potential chemical

interactions between ligand and receptor. According to McGregor [255] the most common

descriptors are: hydrogen-bind acceptor (HBA), hydrogen-bond donor (HBD), positive-

and negative charge, hydrophobic and aromatic interactions [57]. Additionally excluded

volumes, electrostatic regions and steric constrains are consistently included. A review of

Wolber from 2008 shows the limitations of pharmacophore descriptions concerning

universality and specificity [256]. He notices the trend of generalizing pharmacophoric

features in current software packages compared to the early methods (e.g. the active

analog approach) [257], where features could contain any fragment or atom type. As the

concept of pharmacophores is accepted and applied broadly, computational efforts and

the feature abstraction level, representation and customization within different modeling

applications are shown to vary a lot. Even when working with the same reference

molecule, the pharmacophoric representation can differ and lead to diverse results in

virtual screening campaigns [256]. In most of the use-cases, a set of known actives is fed

into the software to receive a pharmacophore model that represents the hypothesized

characteristics of receptor-ligand binding. Virtual screening considers a molecule to be

potentially active, when there is a low-energy conformation which matches the

pharmacophore model. According to Wallach [57], LBVS consist of these four

fundamental components:

1. Conformation sampling of active ligands

2. Alignment of the known ligands

3. Matching of functional groups

4. Evaluation of the PPPs quality

Several such algorithms have been developed [236,241] and a selection will be presented

in the following section. The Genetic Algorithm Superposition Program (GASP) applies a

genetic algorithm for pharmacophore identification [241,258] in a similar fashion

compared to the docking software GOLD [99] described in chapter 1.1.2. GASP is utilizing

the reference containing the least PPPs as the base molecule. All additional reference

Page 51: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

38

conformations are calculated on the fly, based on a single low energy conformation that

is modified using random rotations and translations. The applied fitness function can be

weighted individually by the user, while the core program does not allow any changes in

the pharmacophore description rules. The score is a calculated based on mapping

features, the volume overlap and the ligand internal steric energy. Genetic operations like

crossover or mutations are applied on the chromosomes coding the conformational

information and the feature mapping of the references to the base structure, to search for

better alignments [259]. GASP is non-deterministic, so multiple runs and visual inspection

is recommended to find a suitable model [262]. The GALAHAD [261] software implements

a modified genetic algorithm to reduce the bias towards the chosen base molecule, allows

for partial matching and presents multiple results from each run.

Catalyst® [262] was released in 1992 and constitutes a pharmacophore modeling suite

which uses a CHARMM [263] force field-based conformation generator using Monte Carlo

sampling as described in chapter 1.1.2. Two algorithms for the pharmacophore search

were integrated. The first one, HipHop [264], starts with the smallest possible match of

two PPPs and builds the model up incrementally. The second one is HypoGen [265], which

incorporates binding affinities into the calculation, resulting in a model trying to explain

the binding mode and the binding affinities. The predicted affinity for screened ligands is

correlated with the number of satisfied pharmacophore points of the model [57].

Graph-based methods are implemented in the DISCO program [265], while Inbar

introduced geometric hashing on multiple flexible alignments for pharmacophore

detection in 2007 [57].

Another popular pharmacophore generation package called “Phase” was developed by

Schrödinger LLC [267, 268]. Conformations are generated within the Lig-Prep application

of the Maestro modeling environment [241,269]. A torsion search can be combined with

a Monte Carlo search. A set of minimized structures can be manually selected as the

reference dataset. The pharmacophore features are encoded as SMARTS [272], which

allows for manual modification. The pharmacophore generation works with a tree-based

partitioning algorithm. The number of features can be controlled during preparation as

well as changed afterwards. Furthermore, the user can add exclusion volumes to add

information received, for example from inactive molecules [241, pp. 32-33]. The scoring

is a user-weighted scoring function, considering the alignment RMSD, the penalties for

angle variations and the volume overlap of heave atoms.

Page 52: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

39

The Pharao program [270] presented by Silicos NV is similar to the ROCS color version

[247]. It is applying Gaussians to represent the pharmacophores. The conformation

generation is performed by external programs like CORINA (Molecular Networks GmbH,

Germany) and OMEGA (OpenEye Scientific Software, USA). Pharmacophore features are

represented as Gaussian 3D volume defined by coordinates µ and spread 𝜎. The features

are mapped and aligned, starting with an overlay of the geometric centers and the

principle axes. Rotation around the axes maximizes the volume overlap.

ALADDIN [271] was one of the first pharmacophore programs describing the features as

substructures in SMARTS [272]. In 1994, Greene and coworkers [273] presented their

work combining Boolean logic and substructures to define pharmacophore points. A

knowledge-based approach is featured in Superstar [274], where the Cambridge

Crystallographic Database (CDD) is analyzed by the Isostar program to define ligand-

based pharmacophores.

Field-based methods represent a second class of pharmacophore descriptors.

Comparative Molecular Field Analysis (CoMFA) is one of the most prominent methods for

3D structure-activity relationship characterization and has been reviewed recently [241,

p. 36]. The idea of Molecular Interaction Fields (MIFs), probes are put on a grid to define

favorable interaction points, was introduced by Goodford in 1985 [148]. Descriptors like

GRIND or VolSurf are prominent examples for the implementation of MIFs [275,276].

Working on force field derived potential pharmacophore points 3D descriptors based on

molecular field extrema are presented [277].

Vector-based pharmacophores

As the alignment of the molecules in pharmacophore search is a time consuming and

quality critical step, several alignment-free methods have been developed. In

pharmacophore fingerprints, the information about the presence of certain

pharmacophores is stored in a binary form, for example signaling the absence or presence

of a feature or the absence or presence of a feature pair at a certain distance. The main

focus for the latter is on two-, three- and four-point pharmacophore fingerprints,

describing a line, a triangle or a quadrangle, respectively. The calculation of nine-point

pharmacophores have been described in literature [241,279], whereas the complexity to

store all possible combinations blows up the dimensionality of the descriptor vector and

Page 53: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

40

thereby increases computation time. For two-point fingerprints, for example, the

distances between all possible combinations are binned and for feature pair {f1f2} the

distance bin {d12} is set to one (Figure 8). For triplets, three distances and for

quadrangles, six distances are required [50]. Additionally, the configuration has to be

considered when different feature types are combined.

Figure 8. Pharmacophore fingerprint. For two point pharmacophores the distance d12 between two

features f1 and f2 is stored in vector. For three-point pharmacophores all three distances have to be stored.

The vector enlarges for three point fingerprints, as each possible combination of feature triplets needs to

be considered (e.g. acceptor, acceptor, acceptor or donor, donor, acceptor).

Geometric hashing is one idea to overcome the increasing length of three- or four-point

descriptors and is further explained exemplary for triplets [279]. Each feature triplet is

stored in a hash table using hash keys. The hashing function encodes the feature distance

triplets into keys. Hash tables, or look-up tables, are optimized for checking the presence

or absence of queried hash keys. For pharmacophore-based searches, the hash keys of a

molecule can be searched in the hash tables of a database. Matching hash keys indicate a

common subset of pharmacophore points. An alignment of the triplets can be transferred

to the molecules and give a starting point for the molecular alignment (Figure 9). This

approach, combined with a shape overlap calculation is currently presented in the

program SHAFTS [280].

Page 54: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

41

Figure 9. Geometric pharmacophore triplet hashing. In the first step the potential pharmacophore points

of the query molecule are determined. All possible pharmacophore triplets are calculated and the put into

a hash table. The pharmacophore triplet properties serve as hash keys. In the second phase the hash-table

is compared with a pre-calculated database (e.g. known drugs or screening databases). The resulting

molecules share pharmacophoric features with the query and an alignment based on the triplet overlay can

be performed.

Real-valued vector approaches are a complementary option for molecular description.

Despite the idea of spectra, Correlation Vectors (CVs) have a long-standing history in

molecular modeling and pharmacophore description and a well-written historical

perspective can be found elsewhere [241, pp. 49-52]. The autocorrelation is a

quantitative measure of the probability to find objects or defined object properties within

a certain distance [281]. Cross-correlation vectors also consider the appearance of

different properties. In the descriptor Chemically Advanced Template Search (CATS)

[41,282] the cross-correlation vector for pharmacophoric features is calculated and

stored. In the 2D version, the topologic distance is applied, while also 3D versions utilizing

spatial atom distances or surface derived descriptions are presented.

To overcome the limitations of hard spheres to represent pharmacophore points,

Schneider and coworkers presented the concept of “fuzzy” pharmacophores. One

Page 55: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

42

downside of classical hard sphere models is the sensitivity to small conformational

differences. The bin of a feature pair can change upon a small rotation and lead to a

mismatch in following vector comparisons. The resulting software is called LIQUID [46]

(Ligand-based Quantification of Interaction Distributions) and describes the potential

pharmacophore points as set of trivariate Gaussians encoded as correlation vectors.

Features of the same type occurring within a user-defined radius are clustered and the

trivariate Gaussian is calculated based on the principle components (Figure 10). Detailed

information about the algorithm and the history of the fuzzy pharmacophores can be

found elsewhere [3, pp. 70-73, 283]

Figure 10. LIQUID cross-correlation vector calculation. For a given molecule (C: orange, H: white, N: blue,

O: red, S: yellow) all potential pharmacophore points are calculated (hydroxyl groups can act as HBD and

HBA, so both PPPs are overlapping) and are represented as dots (Aromatic: yellow, HBA: red, HBD: blue).

In the next step potential pharmacophore points are clustered according the chosen cluster radii (here 1 Å

for HBA and aromatic, 4 Å for HBD) and trivariate Gaussians are calculated. In the last step the cross-

correlation for all pharmacophore pairs are calculated and stored in a distance binned vector. (A: acceptor,

D: donor, 1-5 and 5-10 represent two distance bins)

Page 56: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

43

1.3.2.2 Receptor-based pharmacophore search

Structure-based pharmacophore modeling, in principle, works with a single three

dimensional protein structure experimentally derived or homology modeled. The number

of solved protein structure deposited in the PDB is steadily increasing. It is not only the

count, but also the increasing number of covered gene and protein families that is

beneficial for structure-based design methods [284]. In their review on protein-based

pharmacophore modeling, Sanders and coworkers show up the four main steps of SBPS

[11]:

1. Protein preparation in terms of protonation and conformation

2. Binding site detection (Chapter 1.2)

3. Pharmacophore feature definition and generation

4. Hot spot selection

Protein preparation

Protein structures taken from the PDB require some preparation in order to become

applicable for RBPS. In most of the cases, solved protein structures contain non-protein

groups like crystallographic water or other solvents, cofactors or ions. Most of the

structures are solved by crystallography and the protonation has to be done separately

and according functions are provided in many software packages like MOE [285]. During

the protonation, alternative side-chain orientations as well as the present pH and

predicted pKa values are considered to optimize the hydrogen bridge network [11,286].

Additional problems arise from the historical grow of the databank. Not all entries fulfill

accepted standards; especially old entries show geometric problems and contain missing

bonds or atoms. LigandScout, for example, combines several published procedures to

repair existing errors [241, p. 132,287].

Another question concerning the protein preparation is the consideration of protein

flexibility. There are different ways to incorporate protein flexibility into structure-based

methods. In the first class, protein flexibility is based on a single protein conformation.

Page 57: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

44

The rotation of side chains or rearrangements of the backbone are considered. Flexible

docking methods like FlexE [288] are prominent examples. An alignment of multiple

docking simulations with a flexible receptor to generate pharmacophores has been

reviewed by Sanders and coworkers [11]. The second class of algorithms works on

multiple protein conformations, generated by MD simulations, NMR, or taken from

different crystal structures. Clustering of ligand-based pharmacophores based on

snapshots of a receptor-ligand MD simulation is one way to apply simulations [289].

Performing MD simulations in presence of small molecules requires a parameterization

of the small molecule. Although the simulation is still computationally expensive and

needs to be performed for each small molecule individually, the simulations are often

conducted on compute clusters or supercomputers and the number of publications in this

field is increasing. MD-simulations on apo-structures can function as conformation

generators to describe the protein flexibility by a set of snapshots. Those simulations have

recently been applied in CADD research [233] and allow for the detection of alternative

binding site conformation or transient binding pockets. RBPS tools can either work on

individual structures and add some kind of flexibility by combining the VS results derived

by multiple pharmacophore models, or generate a combined pharmacophore model

trying to include the flexibility already in the model itself.

Pharmacophore feature definition

Pharmacophore feature definition in protein pockets can be archived in multiple ways.

The definition of geometric rules based on observed protein-ligand interactions were

presented by Böhm for his software LUDI and extended by Klebe in 1996 [94,103].

Reconsideration of these rules as well as the addition of new findings on halogen-bridges,

water molecules etc. was done by Stahl in 2010 [76]. As these methods produce feature

points in space similar to the atom-centered pharmacophore descriptions of ligands, the

comparison methods for ligand-based pharmacophore searches can be applied with a

structure-based pharmacophore query. Energy-based methods to generate MIFs are also

transferable to receptor binging-sites as shown in several studies [210]. Additional ways

of pharmacophore description take into the account the receptor and the ligand. In

software packages like LigandScout or MOE, a co-crystallized ligand is needed, to define

the pharmacophore essential for the binding event (Figure 11). An overlay of multiple

Page 58: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

45

receptor-ligand complexes can be created to focus on key interactions. The docking of

drug-like molecules or Multi Copy Simultaneous Search (MCSS) as applied in MUSIC [290]

and Schrödinger [268] can also reveal favorable binding spots and be translated into

pharmacophore points.

As the protein pocket describes the complete binding site, knowledge about the

inaccessible regions occupied by the receptor can be included in the pharmacophore

description. Exclusive volumes or exclusion spheres are generated and can be checked

during the pharmacophore matching process [291,292].

Figure 11. Interaction potential calculated with MOE for the herbicidal target IspD PDB:2ycm [340]. The

interaction potential of all receptor atoms nearby the ligand are calculated with three probes (N: red dash,

OH2: blue dash, DRY: green dash). The receptor is shown as cartoon (blue: loop, red: 𝛼-helix, yellow: 𝛽-

sheed) with additional lines for the amino acid side-chains. The stick model in the center represents the

ligand. Atoms are colored according to their atom type (O: red, C: gray, N: blue, Cl: green).

Page 59: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Pharmacophores, Shape and Alignments

46

Hot Spot Selection

Small-molecule drugs tend to bind in binding pockets that are three-times their volume

[130]. Viewed from the receptor perspective, too many potential pharmacophore points

are generated filling up the complete pocket. As it is impossible for a ligand to meet all the

requirements presented by a receptor based pharmacophore, the reduction of PPPs to

promising interaction points is called "hot spot" prediction and is one of the main

challenges in structure-based design. Different approaches on interaction energy,

protein-ligand interaction information and amino acid sequence variations were

reviewed by Sanders et al. [11] and program examples will presented here.

The FLAP (Fingerprints for Ligands And Proteins) software [293,294] applies a

combination of multiple MIFs calculated by GRID and condenses these into discrete hot

spots. Pocket v.2 [295] proceeds similarly. Probes are placed on a grid surrounding the

known ligand and only sets of high scoring features complementary to the ligand atoms

are kept as PPPs. The program HS-Pharm is based on machine learning algorithms trained

with fingerprints of known ligand-binding pockets to predict binding site atoms that are

important for ligand binding. In the Structural Interaction Fingerprint (SIFt) [296,297]

each residue is represented by a seven-bit descriptor [241, chapter. 10] combined in a

fingerprint in sequence order. Profiling several SIFt descriptors for a set of known ligand

binding poses can help to detect conserved interactions.

The concept of pharmacophores is not restricted to VS campaigns, but also allow for

ligand binding mode prediction or binding site comparison. The combinatorial potential

of binding site detection, pharmacophore search and comparison leads to a steady release

of new SBPS tools. As the number of false positive VS hits can be decreased by a factor of

two to five [11,298] by adding shape information, the combination of shape and

pharmacophore search is presented in Shape4 [299]. To tackle the poor crystallographic

success on GPCRs, Snooker [66] was published in 2012 and presents a SBPS platform

based on homology models of class A GPCRs. A combinatorial approach to speed up

binding site comparison without losing the interpretability of alignment methods is

presented by Desaphy and coworkers [64]. The program KRIPO (Key Representations of

Interactions in Pockets) [50] applies a pharmacophore fingerprint based approach to

detect similarities in protein subpockets. The concept of a permissive three point

pharmacophore fingerprint showed the best results [50]. In 2011, Löwer et al. presented

Page 60: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

47

the concept of fuzzy pharmacophores, adapted from the ligand-based approach LIQUID

[300]. Herein the potential binding site is calculated with PocketPicker and geometry-

based pharmacophore rules are evaluated on each grid point. Each valid evaluation is

translated into a PPP centered on the according pocket grid point and the feature is

complementary to match the receptor offered interactions. The resulting set of PPPs is

called the “Virtual Ligand” and the LIQUID correlation vector can be calculated.

Several publications state that ligand- and receptor-based methods complementing each

other, and that it can be beneficial to combine both concepts to have a broader view on

the problem [301-303]. While RBVS retrospectively is often shown to be less accurate

according to enrichments, it also comes with a set of advantages [11,304]:

(I) the identification of novel scaffolds is less biased towards existing chemotypes, so the

“scaffold hopping” potential is increased.

(II) RBPs is capable of elucidating protein-ligand binding mode hypotheses in the

presence of the three-dimensional structure [11,305] which enables structure-based

ligand optimization.

(III) RBPs cause a better understanding of ligand-binding sites. This can be advantageous

in finding ligands for orphan receptors, predicting cross-pharmacology and repurposing

for existing drugs [11].

1.4 X-ray crystallography

X-ray crystallography is the most commonly used method to determine the 3D structure

of proteins [263]. Around 89% of the 108`263 entries in the PDB (Figure 1) are solved by

X-ray crystallography. The main steps in macromolecular structure determination by X-

ray crystallography are the following:

Purify the protein, grow and screen crystals

Examine the diffraction pattern (space group, resolution)

Measure the intensities and solve the phase problem

Build a model in the electron density and refine it

Page 61: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

48

In spectroscopic methods the wavelength 𝜆 of the irradiated beam equals the resolution

obtainable with the method. Visible light therefore is not feasible to determine pictures in

atomic resolution. X-rays however are photons with wavelengths in the needed range of

0.1 Å – 100 Å [362]. In contrast to visible light X-rays cannot be focused by a lens. X-rays

are electromagnetic waves that are scattered by their interactions with charged particles

in the examined material. The X-rays causes an oscillation of the charged particles, re-

emitting the absorbed radiation as point source. The diffraction pattern of interfering re-

emitted beams is recorded in the diffraction space. The angles at which those beams are

diffracted from parallel planes in a crystal are calculated by Bragg’s law (Eq. 11) [362]

and the intensity of the reflection depends on the electron density of the examined planes.

2𝑑ℎ𝑘𝑙 sin 𝜃 = 𝑛λ,

where 𝑛 is an integer, 𝑑 the distance between two planes at position ℎ𝑘𝑙 and the

wavelength λ.

The most prominent sets of planes are determined by the faces of the unit cell and care

numbered by Miller indices ℎ𝑘𝑙. Bragg’s model shows where to look for the diffractions,

while the Fourier-sum model describes the atoms in the unite cell and helps to determine

the molecular structure. The computed Fourier transform therefore simulates the lens

and produces an image of the crystallized molecules [362]. The Fourier-sum description

of the reflections can be converted into a Fourier-sum of the description of the election

densities, while a reflection is described as structure-factor equation with once term per

atom and the electron density is described by a number of structure factors (Eq. 12, 13)

[362].

𝑓ℎ𝑘𝑙 = 𝑓𝑗𝑒2𝜋𝑖(ℎ𝑥𝑗+𝑘𝑦𝑗+𝑙𝑧𝑗 ,

Where 𝑓𝑗 is called the scattering factor of atom j that is a function of treating atoms as

spheres of electron density (Friedel’s law) [362]. The exponential term represents a 3D

periodic function with 𝑥𝑗 , 𝑦𝑗 𝑎𝑛𝑑 𝑧𝑗 as coordinates of atom 𝑗 in real space described as

fraction of the unit cell axis ℎ, 𝑘 and 𝑙.

(11)

(12)

Page 62: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction

49

𝑝(𝑥, 𝑦, 𝑧) = 1

𝑉∑ ∑ ∑ 𝐹ℎ𝑘𝑙𝑒

−2𝜋 (ℎ𝑥+𝑘𝑦+𝑧𝑙)𝑙𝑘ℎ ,

where 𝑉 is the volume of the unit cell, 𝑝 the electron density function and 𝐹 the structure-

factor of reflection ℎ𝑘𝑙

The isotropic temperature factor (Eq. 14) represents static disorders in the structures,

errors from the structure determination and the displacement of the scattering center

over time. For very high resolution data the factor could become anisotropic and the

relative motion of each atom could be calculated in three orthogonal directions [362].

𝐵 = 8𝜋2 ⟨𝑢2⟩,

where B is the crystallographic temperature factor with unit Å2 and 𝑢 is the displacement

of scattering center averaged over time.

In order to compute the structure factors and therefore solve the structure, all three

parameters for the wave function are required. The frequencies ℎ, 𝑘 and 𝑙 are the Mille

indices of the set of parallel planes, the amplitude is proportional to the square root of the

measured intensity but the phase is unknown. Solving the so called “phase problem” by

state-of-the are method like isomorphic replacement, anomalous dispersion or molecular

replacement is described elsewhere [362].

(13)

(14)

Page 63: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Goals of this thesis

50

1.5 Goals of this thesis Structure-based pharmacophore modeling is an attractive field for computer-assisted

drug design. The increasing number in public available structural data and current

developments in pocket identification as well as pharmacophore description encourage

researchers to apply RBPS for drug discovery projects. Nevertheless there is no “holy

grail” method for RBPS, solving all given tasks the best. Many presented works and

programs are focused and were optimized for a single target, ignoring the generalizability

necessary for successful additional usage. A crystal structure itself is already a model of

the measured electron density and can vary in precision of the atom positions. Many

computational methods are complex and the end-users only see some kind of black box

without any chance to imply additional target information to increase the prospects of

success. To overcome some of these known downsides, the goals of this thesis were the

following:

The development of a structured receptor-based pharmacophore search tool

that allows interchanging modules for binding pocket detection,

pharmacophore definition and descriptor calculation to enable the user to apply

the algorithm combination of his/her choice.

The incorporation of measurements for structure spatial errors or receptor

flexibility for pocket and pharmacophore definition to obtain pharmacophore

models which address the reality of noisy data.

The implementation of active refinement of individual interim results and

increasing the usability and attract more users to build high quality

pharmacophore models.

The development of a structured workflow for transient and allosteric pocket

detection and targeting.

Page 64: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

51

2 Materials and methods

2.1 Software

2.1.1 Java The computer programming language JavaTM developed by James Gosling for Sun

Microsystems (merged into Oracle Corporation in 2010) is a class-based and object

oriented language with high cross-platform compatibility. The following design goals for

JavaTM were formulated (http://www.oracle.com/technetwork/java/intro-

141325.html):

Simple, object oriented and familiar

Robust and secure

Architecture neutral and portable

High Performance

Interpreted, threaded and dynamic

The popularity of different programming languages is measured in various ways (TIOBE-

index, RedMonk index, PYPL), as there is no scientifically proven method. Although Java

is losing its ascendancy, it still remains one of the most popular programming languages.

One of the advantages of Java comes with the class-based structure and the resulting

reusability of code segments for multiple projects. Many code fragments in the CADD lab

at ETH are written in Java, to conserve the programmed functionality and hand it over to

the next generation of PhD students. All Java implementations were performed under the

JAVA SE Runtime Environment version 1.6.0_45.

Page 65: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Software

52

2.1.2 Python

One of the upcoming languages in the last decades is Python, designed by Guido van

Rossum in 1991. In most of the cases Python is seen as scripting language that is easy to

learn, and there are many libraries at hand to allow the usage of Python for scientific

computations. Prominent examples are the NumPy for numerical operations and

Matplotlib [306] for the visualization of data. The toolkit developed for this work was

written with Python version 2.7 and executes jar files for diverse calculations. The RDKit

library (Chapter 2.1.3) was applied to calculate the ligand-based pharmacophore features

and coupled to the PyMOL (Chapter 2.1.3) visualization software to process the

pharmacophore models and visualize the progress of the RBPS workflow.

2.1.3 RDKit and PyMOL

The RDKit is an open source cheminformatics and machine-learning toolkit developed by

Greg Lanndrum (RDKit: Open-source cheminformatics; http://www.rdkit.org), where the

core implementation is written in C++ and wrappers for Python and Java are provided.

The definition of pharmacophore feature points for small molecules is based on the

Feature Definition File format (FDeF), where a set of chemical features is described by

SMARTS. A detailed introduction to pharmacophore description can be found elsewhere

(http://www.rdkit.org/docs/RDKit_Book.html). The definition of ligand-based

pharmacophore points for this thesis has been derived from the CATS definitions [282]

and the applied FDeF file is shown in Appendix I.

A second application of the RDKit presented in this work is based on the compatibility of

the RDKit to the PyMOL engine [307]. PyMOL is a visualization system for

macromolecules like protein-ligand complexes and distributed by Schrödinger, and in this

work the licensed version PyMOL 1.7 was applied. The advantage of PyMOL over other

programs is the included Python interpreter, which allows for user-generated scripts (e.g.

output files of PocketPicker) or the server-mode, where PyMOL commands can be feed

into the GUI via a host-server connection with the RDKit. In this work the PyMOL and

RDKit combination was applied to generate a visualized pharmacophore search

Page 66: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

53

workflow, where changes on the pocket and the PPPs could be included on the fly. All

graphical visualizations and three dimensional molecule models shown in this thesis were

generated with PyMOL.

2.1.4 Pocket detection

Two different pocket detection programs were used. For single structure analysis and the

pocket cluster approach in the initial study [233] the grid-based PocketPicker algorithm

[138] searching for clusters of buried grid points was applied. The algorithm was adjusted

to comply with the thesis goals. A grid spacing of 1 Å, as implemented in the original

PocketPicker, is too coarse-grained to incorporate spatial errors of receptor atom

positions. The number of grid points grows cubically with decreasing grid spacing and the

tradeoff between computation time and resolution in this case was resolved by setting

the grid spacing to 0.5 Å, which translates into an eight-fold increase in grid points. The

possible incorrect positioning of the atoms was estimated by the crystallographic

temperature-factor for each individual atom. As the atom position can lead to the deletion

of grid points hitting their vdW surface in the original algorithm, those points have to be

reconsidered in a second step (Figure 12).

Figure 12. PocketPicker adjustment. A) Original pocket picker definition. The distance of a potential grid

point (red square) to all atom centers (black dots) is calculated (arrows). The grid point is accepted, when

all distances are shorter than the vdW radius of the according atom (gray circle). The grid point lies to close

to one of the atoms and is dismissed. B) The vdW radius as cutoff is weakened by the atom position

A B

Page 67: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Software

54

inaccuracy (dashed line vdW radius, gray circle new cutoff). The green squares are now accepted grid

points, the red one again lies to close to the atom center.

The second applied algorithm for pocket detection was MDpocket [232], an open source

tool applicable for molecular dynamics trajectories. The MD snapshots were aligned by

PyMOL without any refinement rounds. For the chosen pockets all snapshots within ten

percent variance around the average pocket size where chosen for further calculations.

The snapshot containing the pocket closest to the average size was used for computations.

The atom position inaccuracies are estimated by a local alignment of pocket flanking

residues over all considered snapshots.

2.1.5 Pharmacophore modeling

The ligand-based pharmacophore features were calculated with the RDKit and based on

individual definitions as shown in Appendix I. For pharmacophore comparison, the

LIQUID vectors including HBA, HBD, lipophilic and aromatic features were calculated. The

Euclidian distance between each vector pair (query against data base) was calculated and

the molecules ranked in ascending distance order.

For RBVS the “Virtual Ligand” software by Löwer et al. [300] was modified in multiple

ways to achieve the project goals. The original pharmacophore description was updated

according to the findings of Stahl [76]. The consideration of flexibility was implemented

as well as an idea of hot spot identification. The resulting set of PPPs was fed into LIQUID

descriptor calculation, which enabled pocket-ligand or pocket-pocket comparison.

2.1.6 Molecular dynamics simulation

To capture the protein flexibility, detect transient pockets and incorporate these steps

into RBPS was one of the main goals of this thesis. For the simulations on HIV-1 protease

[233] and IspD the simulations were performed with the software package NAMD 2.8b

[308]. The CHARMm 27 force-field (Eq. 9) was applied coupled with the Particle Mesh

Page 68: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

55

Ewald [309] method for long-range interactions, whereas short-term electrostatic

interactions were computed within 12 Å. The vdW interactions were calculated with a 6-

12 Lennard-Jones potential (Equation 15). The calculation of the TIP3P [310] water box

with periodic boundary conditions and the following neutralization with Na+ and Cl- ions

was performed with the VMD (Visual Molecular Dynamics) software package [311].

𝑈(�� ) = ∑ 𝐾𝑏(𝑏 − 𝑏𝑜)2 + ∑ 𝐾𝑈𝐵(𝐷 − 𝐷𝑜)

2 +𝑢𝑛𝑏𝑜𝑢𝑛𝑑 ∑ 𝐾𝜃(𝜃 − 𝜃𝑜)2 +𝑎𝑛𝑔𝑙𝑒𝑏𝑜𝑛𝑑𝑠

∑ 𝐾𝜑(1 + cos(𝑛𝜑 − 𝛿)) + ∑ 𝐾𝑖𝑚(𝜙 − 𝜙𝑜)2 + ∑ 휀 [(

𝑅𝑚𝑖𝑛𝑖𝑗

𝑟)12

−𝑛𝑜𝑛𝑏𝑜𝑛𝑑𝑖𝑚𝑝𝑟𝑜𝑝𝑒𝑟𝑠𝑑𝑖ℎ𝑒𝑑𝑟𝑎𝑙𝑠

(𝑅𝑚𝑖𝑛𝑖𝑗

𝑟)6

] +𝑞𝑖𝑞𝑗

𝜀1𝑟𝑖𝑗 ,

where 𝐾𝑏, 𝐾𝑈𝐵, 𝐾𝜃, 𝐾𝜑 and 𝐾𝑖𝑚 are constants resolved by experimental data and ab inito

results, 𝑏 is the bond length with 𝑏0 as equilibrium bond length, D is the unbound 1-3-

distance, 𝐷0 is the energetic ideal 1-3-distance, 𝜃 and 𝜃𝑜 is the angle and equilibrium

angle value respectively, 𝜑 is the dihedral angle value with 𝑛 being the periodicity, 𝜙 is

the dihedral angle (e.g. peptide bonds) and 𝜙0 its ideal value. Nonbonded interactions,

already described for docking function, are combined estimates for vdW and electrostatic

interactions. The Lennard-Jones well depth is 휀 , 𝑅𝑚𝑖𝑛𝑖𝑗 is the distance at the Lennard-

Jones minimum, atom point charges are 𝑞𝑖 and 𝑞𝑗 with the effective dielectric constant

being 휀1 and atom distance 𝑟𝑖𝑗.

As a result of the successful first stage project with the HIV-1 protease [233], we started

a collaboration with the National Institute of Health (NIH) in Oxford, UK to extend the

conformational sampling by MD simulations. The initial 20 ns simulation by NAMD was

scaled up to a total simulation time of 3 µs divided into ten times 100 ns simulations with

Gromacs 4.6.5 on nVidia GPUs and Verlet cutoff scheme. The applied force-field was the

CHARMm22* [312] with NMR corrections to the protein backbone.

(15)

Page 69: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Software

56

2.1.7 KNIME

The Konstanz Information Miner [313] is an open-source visualization approach for data

mining workflows competing with the commercial Pipeline Pilot by Accelrys from 1999.

The open-source characteristic of KNIME facilitates the usage of multiple software

packages. As an example, molecules can be read in and pre-processed by MOE nodes (MOE

license required), while the main calculation is performed by RDKit functions. The

implementation of own nodes is possible as well as statistical analysis by R nodes or Java

snippets to manipulate the data. An example workflow for ligand-based virtual screening

with LIQUID descriptors is shown in Figure 14.

All calculations were performed with KNIME 2.6.3. The ROC calculations for the

retrospective analysis are performed with the “ROC Curve” node provided by the KNIME

GmbH. Pareto front calculations are conducted with the “Pareto Ranking” node of the

open source CADD nodes for KNIME.

Figure 14. KNIME example workflow for ligand-based pharmacophore similarity screening. The SDF-

Reader reads the input structure and the LIQUID descriptor is calculated. The database descriptors are pre-

calculated and unique entries are ensured. The Euclidian distance to each database entry is calculated and

the list is sorted to find the most similar vectors. The top fraction is taken and the database structures are

joined with their distances. The joined table is written as SDF containing structural data and the calculated

LIQUID distance.

Page 70: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

57

2.1.8 R statistical software

The software package R is an open-source programming language for statistical

computing and plotting [357] Histograms, pie charts and plots presented in this work

were created with R 3.1.2.

The dendrograms presented in the retrospective analysis are generated with the “ape”

package [356]. The hierarchical clustering is performed with the “hclust” of R on the

distance matrix applying the “complete linkage” method to join clusters based on the

maximum distance between all cluster members.

2.2 Data basis

2.2.1 A Database of Useful Decoys: Enhanced (DUD-E)

The DUD-E [315] is a data set intended for benchmarking molecular docking. Compared

to the original Directory of Useful Decoys (DUD) [316] the enhanced version includes more

diverse targets and an optimized decoy generation workflow. The data set contains 102

targets (Figure 14) with 22,886 active ligands taken from the ChEMBL database [317]. On

average 224 active ligands are annotated per target and 50 decoys with similar

physicochemical properties but dissimilar molecular constitutions originate from the

ZINC [318] database. The number of chosen active compounds is controlled by Bemis-

Murcko scaffold clustering to reduce the bias towards overrepresented ligand scaffolds.

The highest affinity ligand is taken from each cluster, until at least 100 ligands are chosen.

Decoys in the original DUD are too similar to the active molecules, which often results in

false negative classifications. The revised procedure in the DUD-E includes property

matching of decoys to each ligand and focuses more on the removal of false negative

decoys (e.g. avoid “wargroups” that more or less guarantee activity).

The DUD-E provides a reference PDB structure, the extracted crystal ligand, a set of active

molecules (according to ChEMBL) and the set of decoys generated automatically with the

Page 71: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Data basis

58

active references. The DUD-E does not correct PDB failures and the automated ligand

extraction shows inaccuracies in bond order conservation. In order to work with high

quality data for the comparison of ligand-based pharmacophore search (crystal ligand as

a query to score the actives and the decoys) and the structure-based approach (extract

the pocket around the crystal structure and apply the structure-based pharmacophore

descriptions to generate the query), all extracted crystal ligands were checked against the

respective PDB entry and publication. Errors were manually corrected and the corrected

structures used during this study.

Figure 14. The DUD-E target classification (adapted from Ref. 315)

2.2.2 Screening library The screening library for VS consisted of 5”353”844 unique molecules provided by twelve

different vendors (Table 4). A single 3D conformation for each molecule was generated

with CORINA 3.46 (Molecular Networks GmbH. http://www.molecular-networks.com.

Erlangen. Germany). Additionally all compounds were preprocessed applying the “wash”

function (deprotonation of strong acids, protonation of strong bases) provided in the

Molecular Operating Environment (MOE) [285].

Page 72: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

59

Table 4. Commercial compound collections.

Vendor Number of compounds Homepage

ASINEX 474”661 http://www.asinex.com/

CHEMICAL BLOCK 125”424 http://www.chemblock.com/

ChemBridge 1”022”400 http://www.chembridge.com/

ChemDiv 916”461 http://eu.chemdiv.com/

Enamine 1”884”364 http://www.enamine.net/

InterBioScreen 514”677 http://www.ibscreen.com/

LIFE CHEMICALS 377”958 http://www.lifechemicals.com/

MAYBRIDGE 54”318 http://www.maybridge.com/

Princeton Bio 970”808 http://www.princetonbio.com/

Specs 212”620 http://www.specs.net/

TimTec 127”395 http://www.echemstore.com/

VITAS-M 1”319”079 http://www.vitasmlab.com/

2.3 Protein targets for prospective studies

Two prospective examples were investigated to assess the applicability of the software:

Human Immunodeficiency Virus 1 protease (HIV-1 protease) is one of the main

targets for the treatment of AIDS. The channel-like active site formed by the two

monomers. The high mutation rate of the virus leads to manifold resistant strains

and facilitates the successful treatment of the disease [320,328]. Targeting more

conserved allosteric regions might help overcome this problem.

4-Diphosphcytidyl-2C-methyl-D-erythritol synthase (IspD) is a key enzyme in the

non-mevalonate pathway, which is not present in mammals, and a potential

antiinfective drug target. The catalyzed condensation involves CTP, and the

phosphate binding sites are polar. The necessity of potential inhibitors to cross the

cell membrane contradicts the required polarity needed for pocket. Therefore the

detection and validation of allosteric pockets might help target this protein.

Page 73: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Protein targets for prospective studies

60

2.3.1 Human Immunodeficiency Virus 1 protease (HIV-1 protease)

The Human Immunodeficiency Virus (HIV) is a major issue on global public health.

According to the World Health Organization (WHO) more than 39 million people died

from consequences of a HIV infection so far, with about 1.5 million in 2013 [350]. Around

35 million people are infected currently with approximately 2.1 million newly infected

people in 2013 [350]. HIV is predicted to become the most lethal infectious disease within

the next 15 years [319]. HIV belongs to the class of retroviruses; it is a lentrivirus. They

are characterized by their long incubation period and their ability to affect non-dividing

cells, while integrating viral RNA into the host cell DNA. HIV targets the immune system

and the body becomes vulnerable to cancer or other diseases that are considered

harmless. The infection is not metonymic with the breakout of AIDS, which can take from

2 to 15 years depending on the host and subtype [350]. Two different types of the virus

have been reported, with HIV-1 being more virulent, showing higher infectivity and

spreading globally, while HIV-2 is mainly found in West Africa. HIV-1 can be further

separated into groups: M for major, O for outlier, N for non-M and non-O and since 2009

the group P for pending, where the M group is once more split up into subtypes labeled

by the letters A to K. [351]

The nature of retroviruses hinders the cure of HIV infection due to the fact that the host

cells with integrated viral RNA can barely be detected and removed. The treatment with

antiretroviral therapy (ART) can minimize the spread and therefore successfully treat the

infection. Medicated HIV infections today have the flavor of a chronic disease. The therapy

of HIV infections has a long history and it mainly focuses on three key enzymes of the

virus:

The reverse transcriptase translating the viral RNA into DNA

The integrase to integrate the new pieces of DNA into the host genome

The protease cleaving the polyproteins produced by the host cells due to the newly

integrated DNA.

In the first years of ART the research focus was on nucleoside analogs, blocking the

reverse transcriptase activity to stop the viral life cycle. The missing proof-reading

Page 74: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

61

function of the reverse transcriptase [320] causes problems for HIV-1 treatment in the

form of drug resistances. In the mid-90s a study by Hammer and coworkers showed that

simultaneous intake of two HIV-1 medicines improves the therapeutic outcome [352].

The highly active antiretroviral therapy (HAART) combines drugs for different viral

targets and has become state-of-the-art for HIV-1 therapy [351]. Besides the nucleoside

analogs and non-nucleoside reverse transcriptase inhibitors, protease and integrase

inhibitors have been presented to disturb the HIV-1 life cycle in the cells [321]. The agent

Enfuvirtide® by Hoffmann-La Roche in 2003 belongs to the class of fusion inhibitors and

inhibits the fusion of the virus to its target cell instead.

HIV-1 protease

HIV-1 protease is a C2-symmetric, homo-dimeric aspartate protease. Each monomer

consists of 99 amino acids (Figure 15) [322]. All essential HIV proteins are translated out

of a single spliced transcript. The regular translation result is the p55 polyprotein for

structural proteins, but with a five percent chance of a -1 shift in the reading frame, the

p160 Gag-Pol is expressed containing the HIV-1 enzymes [323]. The HIV-1 protease

cleaves the polyprotein to obtain the individual structural proteins and enzymes to

rebuild the virus and infect additional cells after release [323]. The cleavage takes place

in the active site of the protease, a channel formed upon the dimerization and the core

residues are D25, T26 and G27 of both subunits, while the required catalytic water

molecule is present in many crystal structures [324, p. 16] (Figure 16). The channel is

flanked by glycine enriched flexible regions (also referred to as flaps) that allow for

opening and closing of the active site for substrate binding and product release [325].

Page 75: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Protein targets for prospective studies

62

Figure 15. HIV-1 protease overview in cartoon representations. In the main frame the HIV-1 protease

dimer (PDB:3ixo) [344] is shown, both monomers are colored in gray and blue. The active site lies in the

channel formed by the monomers. In the top frame, two different flap conformations are shown (closed

light gray PDB:3ixo; open dark gray PDB:1hhp) [345]. The lower frame depicts the active site with the core

amino acids (D25, T26, G27) in stick representation (C: yellow, O: red, N: blue) and the catalytic water as

spheres.

Page 76: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

63

The first crystal structure of the HIV-1 protease was solved in 1989 and was the starting

point for structure-based drug design [324,326]. It has been shown that the mutation of

the core residues as well as chemical inhibition of the active site leads to the production

of non-infectious HIV-1 particles. [324, p. 13] Targeting of the HIV-1 protease is mainly

conducted by peptide analogs of the natural substrate not cleavable with the protease.

Many pitfalls have been reported for protease inhibitors, starting with the poor

gastrointestinal absorption of the drugs and the risky boosting by the simultaneous

inhibition of the human cytochrome P450 enzymes. On the one hand the half-life period

of the drug can be increased, on the other hand the inhibition of cytochrome P450

enzymes has effects on other medications as well and therefore increases the risk of

overdose due to longer metabolic half-life [327]. The minimization of side effects is a main

goal for the further development of protease inhibitors [323]. A second goal is to

overcome the reported growing drug resistance and cross-resistance of HIV strains [328].

The high mutation rate of the protease due to the missing proofreading function of the reverse

transcriptase cannot be changed. As the virus stays in the human body, it is important to fully

inhibit its replication to prevent mutations or crossover events between different strains and

therefore avoid further diversification of the virus [321]. The investigation of allosteric

inhibitors can lead to molecules with better absorption or ligands that bind to more

conserved regions to reduce the risk of resistances [329].

2.3.2 HIV-1 protease assay

The HIV-I activity determination and IC50 measurements were performed by

ReactionBiology Corp. (Malvern, U.S.A.) on a fee-for-service basis. The assay conditions

can be found in the Anaspec SensoLyte® HIV-1 Protease Assay (Catalogue: 71127). A

FRET [358] labeled substrate is added to the mixture of protein and test substance. The

product formation is plotted against the time by fluorescence read outs and the activity is

compared to a non-inhibitor and the reference inhibitor pepstatin A.

Page 77: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Protein targets for prospective studies

64

2.3.3 4-Diphosphcytidyl-2C-methyl-D-erythritol synthase (IspD)

The 4-diphosphcytidyl-2C-methyl-D-erythritol (CDE-ME) synthase (IspD) is a key

enzyme of the non-mevalonate pathway for biosynthesis of isoprenoids that was

discovered in 1990 [330,331]. The enzyme catalyzes the condensation of CDE-ME by

cytosine triphosphate (CTP) and 2C-methyl-D-erythritol-4-phosphate (MEP) (Scheme 1).

The mevalonate-dependent pathway for the biosynthesis of steroids, terpenoids, and

vitamins was discovered by Lynen and Bloch in the 1960s [330]. An advantage of

targeting the non-mevalonate pathway (also referred to as DXP pathway) is based on its

non-existence in mammals. Targeting enzymes without homologs in human reduces the

risk of side effects and is an active research field for microbial diseases like malaria and

tuberculosis and is also interesting for the development of new herbicides [331,332,340].

Scheme 1. Reaction catalyzed by IspD.

The IspD enzyme is a member of the cytidyltransferase family and forms an active homo-

dimer with two distinct catalytic sites. The catalyzed reaction (Schema 1) works in the

presence of a divalent metal ion like Mn2+, Co2+ or Mg 2+, which forms coordinative bonds

Page 78: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Materials and methods

65

with phosphate oxygens of CTP. Examples of AtIspD bound to CMP and the binding to an

allosteric inhibitor are depicted in Figure 16. The available crystal structures (PDB: 1w77,

2ycm) for ATIspD contain unsolved regions in the CTP binding site (residues 88 – 94) and

the arm that is involved in the dimerization (residues 224 – 229) [340, 353]. In order to

perform MD-simulations, the missing residues were added with the program Modeller

[333]. Multiple models were generated and further manually refined with Moloc [334].

Figure 16. Arabidopsis thaliana IspD (PDB: 1w77) bound to CMP (left) and IspD bound (PDB: 2ycm) to an

allosteric inhibitor (right). The protein monomer is shown as cartoon and transparent vdW surface.

Molecules are presented as sticks (C: green, N: blue, O: red, P: orange). In the lower panel, the ligand

interaction plot provided by MOE is shown. Green arrows indicate side-chain hydrogen-bridges; blue

arrows indicate backbone hydrogen bridges.

Page 79: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Protein targets for prospective studies

66

Page 80: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

67

3 Results

The result of this thesis was a study on the influence of protein atom inaccuracy

measurements for receptor-based pharmacophore searches. Existing algorithm were

adapted and a new workflow was established on retrospective analyses and prospectively

tested in two projects. The study included the implementation of a modular structured

interactive receptor-based pharmacophore search program, which allowed for an easy

exchange of pocket detection, pharmacophore definition and pharmacophore descriptor

calculation modules. The user was asked to add specific information during the process.

Protein flexibility could be included as inaccuracies of the receptor atoms positions. For

single X-ray structures the temperature-factor may be applied to adjust the detected

pocket as well as the pharmacophore feature points. For multiple input structures an

alignment (global or local pocket residues) could be performed to determine the atom

flexibility. The result chapter is structured in three main parts:

Introduction of the RBVS tool based on a showcase,

Retrospective analysis of the DUD-E database,

Prospective studies on HIV-1 protease and IspD.

3.1 Introduction of the RBVS tool based on a showcase

A flow chart describing the developed SVBS tool is shown in Figure 17. In the following

section each flow chart step will be discussed in detail, including computational details,

and a showcase is presented to visualize the implementations. For the showcase

PockerPicker was applied to extract the potential binding pockets and the VirtualLigand

descriptor was calculated for a chosen pocket.

Page 81: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

68

Figure 17. Flow chart for the receptor-based pharmacophore search workflow presented in this work.

Explanations for the individual parts and a showcase are presented in chapter 3.1. The black arrows indicate

Python User

Parameter file

PyMOL

Initiation

Protonate 3D

Chose and adjust pocket

Display pocket

PPP calculation

Pocket processed

Save user changes

Alignment

Pocket refinement

PPP feature definition

MOE

Java

Java

Java

Delete undesired

PPPs

Display PPPs

Lipophilic cut-off

Descriptor calculation

PPPs processed

Save user changes

Descriptor calculation

Display descriptor

Save project

Java

Start

End

Page 82: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

69

the flow from the start to the endpoint. The rat-tail files indicate external process calls and the doted arrow

indicates a user input during an external Java call.

3.1.1 Parameter file

At first, the parameter file has to be generated by the user and declares all variables

important for the RBPS project. The input files locations are stored, parameters for the

individual steps are defined and output names as well as locations are set. Commentary

lines show a leading ‘#’, while parameter name and value are tab separated. The

parameter file prepared for the showcase is presented in Appendix II. The meanings and

influence of the parameters are discussed in the corresponding following sections.

3.1.2 Initiation

In the initial phase of the program the parameter file is parsed in Python and the

parameters are extracted for the configuration of the program (Figure 18). In order to

allow the communication between the Python RDKit package and PyMOL, a PyMOL

instance has to be started in server mode by adding the suffix (-R) in terminal PyMOL call

(Figure 18).

After the parameter parsing the “p3d” parameter is checked and conditionally an external

MOE script is automatically written and executed to run the MOE Protonate 3D function

on the defined protein (“inPDB”). The protonation of the protein is important for the

definition of potential hydrogen-bridge donor points provided by the receptor. The

(protonated) protein is now load into the PyMOL window together with the pre-

calculated pockets (“inPP” parameter) and the user is asked to chose a pocket and to

delete undesired parts of it on the fly (Figure 19). According to the “useTempFac”

parameter the crystallographic temperature factor is translated into radial atom position

inaccuracy values by Equation X and applied during the workflow.

Page 83: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

70

Figure 18. Two shell windows for the initiation of the program. In the upper shell, PyMOL is started in

server mode by adding the (-R) suffix to the program call (first red line). The second red line is printed by

PyMOL and confirms the running server mode. In the lower shell calling Python with the parameter file is

shown and the workflow is started. The parsed parameters are printed out to double-check the values for

the user.

modlab$ MacPyMOL –R

PyMOL(TM) Incentive Product - PyMOL Executable Build

Copyright (C) Schrodinger, LLC

...

...

xml-rpc server running on host localhost, port 9123

modlab$ python Interactive_VL.py

configFile_Showcase.txt

############# RUN PARAMETERS ##############

outPath <-- ../Showcase

prot_name <-- HIV_showcase

inPDB <-- ../Showcase/3ixo_noh2o.pdb

p3d <-- true

inPPTXT <-- ../Showcase/Data/GridData18.txt

inPP <-- ../Showcase/Data/Grid.py

useMD <-- false

useTempFac <-- true

changeGrid <-- true

steps <-- 3

refine <-- false

writeGrid <-- true

VLCoreParam <-- 1.4 2.2 2.6 1.6 0 0 20 1 0

align_radius <-- -

MDSerialfile <-- -

align_mode <-- -

keepTmp <-- true

#############################################

Page 84: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

71

Figure 19. Workflow initiation visualized in PyMOL. A) The protein (HIV-1 protease PDB:3ixo for the

showcase) was loaded into the PyMOL window (manual coloring of the two monomers in blue and gray) B)

The Python script visualizing the pocketome is loaded (PockerPicker for the showcase) C) The chosen

pocket (black frame in B); located in elbow region of the HIV-1 protease D) The user can delete undesired

parts of the pocket on the fly as shown for the lower part of the pocket in this showcase.

A B

C

D

Page 85: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

72

3.1.3 Pocket adjustment

In principle the program is not limited to PocketPicker files. Every pocket description

following the concept presented in Figure 27 can be applied. Python is waiting for the user

input declaring which pocket is chosen and processed. Once the input is given the Python

script takes over to prepare and perform the potential pharmacophore point calculation.

First of all the data file containing the pocket grid point positions is rewritten, including

the user’s changes performed in PyMOL. In the next step the focus lies on the protein

flexibility. Multiple protein conformations derived by MD-simulations. NMR or X-ray

crystallography can be aligned to the processed protein structure (“useMD” parameter).

The global or local pocket residue alignment (“align_mode” can be set to “global” or

“local”; “local”, which requires the “align_radius” parameter to describe the maximal

allowed distance between a protein atom and any pocket grid point to be considered for

the alignment) is performed by a Java implementation of the Kabsch alignment [338]. The

flexibility of each atom is set to be the maximal distance between the atom position in the

reference protein structure (“inPDB”) and the according atom positions in all aligned

structures. The alignment takes a serial-file (“MDSerialfile”) containing one data file path

of a structure in PDB format per line, starting with the reference structure in line one. The

output file contains a unique atom identifier and the according calculated flexibility value.

In contrast to the protein flexibility derived from the alignment, the crystallographic

temperature-factor can also be applied to approximate the atom position inaccuracies.

The calculated values can be transferred to the pharmacophore feature calculation

algorithm and inflexible regions will produce well-defined pharmacophore feature points,

while flexible regions will result in "fuzzified" feature points. The basis for this concept is

already incorporated in pocket description, which can be adapted to consider the possible

movements of the protein (Figure 20).

Page 86: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

73

Figure 20. Automated pocket adjustments. Added grid points are displayed in blue, while already known

ones are shown in gray A) The 1 Å spaced pocket grid processed by the user B) Visualization of the

“changeGrid” function. Grid spacing was reduced to 0.5 Å and all inner grid points were calculated C)

Additional outer grid points not clashing with the vdW surface of the protein were calculated D) New pocket

points after two additional calculation steps. Clashes were now calculated with the MD / temperature-factor

corrected receptor atom positions.

In order to increase the sensitivity for notable changes in the protein, the grid spacing can

be reduced to 0.5 Å (0.25 Å, 0.125 Å, …), with a tradeoff between computing time and

resolution found at 0.5 Å. In the initial step all inner grid points for the main and the new

hub axles are calculated (“changeGrid” parameter). Next, the pocket can be extended by

an incremental procedure (“steps” parameter). Therefore all possible grid points

surrounding the actual pocket with distance 0.5 Å (0.25 Å, …) are calculated in each step.

In the first iteration the algorithm works with the original vdW surface as cut-off for

clashes, while for each additional iteration the MD or temperature-factor corrected cut-

off values are considered (Figure 20). New points located far away from any receptor

atom are dismissed (solvent exposed grid points) as well as points closer to the any

receptor atom than the actual cut-off value. The newly added points accepted in one step

are also offspring for additional potential grid points in the following steps. Additionally

for each grid point a radius is assigned describing the shortest distance to a clash with

C

D

A

B

Page 87: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

74

either the protein or a second grid point. In the showcase the “steps” parameter was set

to 3. The two rounds of searching additional grid points with temperature factor

corrected atom positions was sufficient, as the maximal measured atom position

inaccuracy was lower than 1 Å.

Figure 21. The “refine” parameter and pocket shape. In the upper picture spheres represent the refined

pocket grid. The sphere radii approximate the available free space without clashing with the protein or any

another pocket grid point. Gray spheres represent the grid before applying the “refine” parameter; blue

spheres were added, when the “refine” parameter was set to “true”. In the bottom left the pocket surface

was calculated in PyMOL based on the refined grid. The bottom right picture presents the pocket (gray

surface) fitting the potential binding site (blue surface) of the protein.

The “refine” parameter produces a grid with irregular grid spacing at positions close the

protein. After the user-defined number of steps to find new grid points, the search radius

Page 88: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

75

for new points is halved once more (Figure 21). One additional round of new points is

searched based on the reduced radius to increase the pocket resolution close to the

receptor.

The “writeGrid” parameter creates a file containing all coordinates of the new grid.

Additional to the changed pocket, an exclusion sphere file is generated containing

receptor atom centered spheres with vdW or MD/ temperature-factor adjusted volumes

(Figure 22).

Figure 22. Exclusion spheres. The pocket grid points are displayed as gray spheres and the atom centered

exclusion spheres are shown as blue spheres. Exclusion sphere coordinates are saved together with the

spheres radii.

3.1.4 Potential pharmacophore point definition

In the next step geometry based interaction rules described by Bissantz and coworkers

[76] are applied to define the PPPs in the pocket (Appendix IV). The pocket forming

residues are specified and for each pair of pocket points and pocket forming atoms the set

of pharmacophore rules is evaluated.

Page 89: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

76

Figure 23. Atom position inaccuracies for potential pharmacophore point description. The pocket grid is

shown as black squares, the protein surface sketched by the orange line. The blue square symbols a HBD

function present in the protein cavity, while red colored squares are acceptor features in the binding site.

A

B

C

Page 90: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

77

The gray circles indicate the position inaccuracy. A) For each grid point the angle and distance to the donor

function of the protein are calculated. B) Whenever the values pass the according pharmacophore rule a

pharmacophore acceptor features is generated at the grid point position and the radius is copied as well. C)

All grid points within the distance of this radius will become HBA points themselves with a radius equal to

the difference between the initial point’s radius and the distance between the two points. The radius is

further capped by the grid point radius parameter.

For each positive evaluation a potential pharmacophore point complementary to the

receptor offered feature is placed at the according pocket point coordinates. When

including the atom position inaccuracies obtained by the alignment or by consideration

of the crystallographic temperature-factor, the inaccuracy value of the atom position is

copied to the potential pharmacophore point. The feature is then spread among the

neighboring pocket points as shown in Figure 24. Each grid point closer than the

inaccuracy value becomes a potential pharmacophore point of the same type, but the

transferred inaccuracy value is reduced by the distance to the original PPP and cannot

exceed the radius assigned to the grid point during the grid adjustment steps. The pocket

shape therefore restricts the shape of the produced PPP.

3.1.5 Potential pharmacophore point hotspots

During the PPP definitions step a grid point can be assigned as PPP of the same type by a

single receptor atom multiple times, but only the solution with the highest assigned

inaccuracy value (Equation X) is kept. As it is possible that different receptor atoms assign

the same pharmacophoric feature to a grid point, the frequency at which a grid point is

found is stored too. The identification of pharmacophore hotspots is difficult. The

definition of an automated way that works consistently well seemed impossible, as the

differences in pocket shape and especially the ranking of individual contributions to

binding in unknown protein pockets was hard to predict [75]. In the presented program

the user is asked to dismiss PPP in two steps. The rules for lipophilic interactions are

unspecific, as they are only distance based and do not require any interaction specific

angles. Therefore lipophilic PPPs dominate the other features in terms of numbers. In our

showcase around 94 % of the pocket present lipophilic PPPs. A histogram is prepared for

the user and a lipophilic cutoff value defining the minimal required PPP frequency is

Page 91: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

78

requested. The differences in protein shape make it difficult to define the cutoff value

automatically, as the number of neighboring protein atoms can differ a lot. The histogram

for the presented showcase is presented in Figure 24.

Figure 24. Histogram of the number of potential lipophilic interaction partners by the protein. Lipophilic

potential pharmacophore points dominate the receptor-derived pharmacophore in terms of numbers. The

user of the program is asked to define a cutoff value depending on the number of potential lipophilic

interaction partners a grid point has. For the showcase the cutoff was set to 14.

Afterwards all remaining PPPs are saved and visualized in the PyMOL window for the

second selection step. The PPPs of each feature are grouped according their originating

amino acid of the receptor (e.g. all acceptor PPPs found by Asp60). The feature types are

Page 92: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

79

indicated by colors (HBA: red, HBD: blue, Aromatic: yellow, Lipophilic: green) and the

frequency is coded (darker coloring: lower frequency, brighter coloring: higher

frequency). The user is asked to delete undesired PPPs groups before the pharmacophore

descriptor calculation is started (Figure 25).

Figure 25. PPP depiction and processing in PyMOL. All PPPs were loaded to PyMOL and were grouped

according to their responsible amino acid (right panel). Pharmacophore features were presented by colored

meshes (HBA: red, HDB: blue, Lipophilic: green, Aromatic: yellow). In the lower left three HBD groups were

presented within the binding site (C: white. N: blue. O: red). Two different representations were

demonstrated (mesh and surface). In the lower right one out of the three groups was deleted and was not

considered for the descriptor calculation. For the showcase all PPPs attributable to Asp60 (black circle)

were removed in PyMOL.

In order to allow alternative pharmacophore feature definition algorithms or

pharmacophore descriptors the file format for PPP description is presented in Figure 27.

Once the user agrees to the current displayed PPPs, Python prepares the files for the final

Page 93: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Introduction of the RBVS tool based on a showcase

80

pocket derived pharmacophore descriptor calculation. The finalized pharmacophore

points presented by PyMOL are automatically saved in the introduced PPP format (Figure

26).

Figure 26. File formats for pocket and PPP description. On the left side the pocket data file format is

described. Pockets are stored as list of grid points with the same ID. Each line defines on grid point with ID

and the grid point coordinates. On the right side the PPP file format is presented. It is similar to the SDF

format and can store multiple pharmacophore models in one file. The name of the molecule is given first. In

the following lines all PPPs are listed with feature type and the XYZ coordinates. (Feature types: D: donor.

A: acceptor. L: lipophilic. R: aromatic). Four dollar symbols “$$$$” mark the end of the model and the next

one may start in the next line.

MyPockets:

0 1.1 1.4 2.2

0 1.1 1.4 3.2

1 4.6 1.9 0.5

1 5.1 2.9 0.5

MyPPPs:

Molecule1:

D 1.1 1.4 2.2

A 1.1 1.4 3.2

$$$$ Molecule2:

L 4.6 1.9 0.5

A 5.1 2.9 0.5

Page 94: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

81

3.1.6 Pharmacophore descriptor calculation

In the present version receptor derived LIQUID descriptors, as described in the original

study by Löwer et al. [300], are calculated applying the “VLCoreParam” parameter set.

The automatically generated Python script for the visualization of the descriptor is loaded

into the PyMOL window (Figure 27). As final step the user can save the complete PyMOL

session, while the “keepTmp” parameter decides whether the individual interim scripts

and files should be deleted and only the descriptor file is kept. The calculated descriptor

vector can now be applied as query for a ligand database pharmacophore search to

perform virtual screening, or the distance to other processed pocket pharmacophore

vectors can be calculated to compute an estimate of the pharmacophoric pocket-pocket

similarity.

Figure 27. Visualization of the receptor-derived pharmacophore model. The “VLCoreParam” parameters

are used to generate this pharmacophore model based on the reduced PPP set prepared in PyMOL. The

trivariate Gaussians are colored according to the four feature types (HBA: red, HBD: blue, Lipophilic: green,

Aromatic: yellow).

Page 95: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Retrospective analysis for the DUD-E database

82

3.2 Retrospective analysis for the DUD-E database

The retrospective analysis for the DUD-E database was performed with three different

settings. In the first setting the crystallized ligands of all 102 targets served as reference

molecules for a ligand-based virtual screening of the active and decoys provided by the

DUD-E dataset. Based on the Euclidian distance (Eq. 2) to the LIQUID vector of the

crystallized ligand the DUD-E decoys and active molecules were ranked and the area

under the ROC curve calculated. In the second and third evaluation, the reference vector

was derived from the receptor, while the second one was performed similar to the method

described by Löwer et al. [300] and the third one includes atom position inaccuracies

based on the crystallographic temperature-factor as described in chapter 3.1. One part of

the evaluation was the parameter optimization for the cluster radii applied for the feature

clustering during the descriptor calculation. Starting from a reference vector containing a

reference value for the four cluster radii (Lipophilic: 1.4 Å, HBD: 2.2 Å, HBA: 2.6 Å,

Aromatic: 1.6 Å) only one radius was changed at a time. Forty values were tested for each

radius (0.1 Å – 4.0 Å in 0.1 Å intervals) and the best performing value (highest ROC value)

determined. In the final step a vector containing all individual best performing radii was

created and evaluated.

Table 5. Retrospective VS performance on the DUD-E dataset. For each of the 102 targets the LIQUID cluster

radii were optimized. Several ROC AUC values are presented (Combi: Value for the combinations of best

performing individual optimized radii; Min: Lowest AUC value found during the optimization; Max: Highest

value found during the individual optimization).

DUD-E ID

Family Crystal ligand Virtual ligand Interactive virtual ligand

Combi Min Max Combi Min Max Combi Min Max

aa2ar GPCR 0.59 0.43 0.54 0.61 0.37 0.53 0.69 0.36 0.55

abl1 Kinase 0.40 0.26 0.37 0.41 0.24 0.37 0.42 0.30 0.37

ace Protease 0.61 0.34 0.51 0.68 0.59 0.64 0.57 0.61 0.65

aces Other Enzymes 0.50 0.35 0.50 0.70 0.50 0.69 0.50 0.34 0.44

ada Other Enzymes 0.30 0.15 0.27 0.48 0.19 0.45 0.66 0.40 0.58

ada17 Protease 0.54 0.38 0.48 0.63 0.45 0.53 0.50 0.42 0.51

adrb1 GPCR 0.46 0.23 0.38 0.60 0.40 0.53 0.43 0.32 0.43

adrb2 GPCR 0.63 0.31 0.50 0.60 0.37 0.52 0.42 0.34 0.41

akt1 Kinase 0.55 0.19 0.45 0.31 0.15 0.22 0.62 0.21 0.39

akt2 Kinase 0.40 0.17 0.33 0.29 0.13 0.23 0.48 0.14 0.31

aldr Other Enzymes 0.53 0.31 0.50 0.64 0.46 0.59 0.60 0.60 0.68

ampc Other Enzymes 0.54 0.49 0.56 0.73 0.55 0.69 0.80 0.51 0.71

andr Nuclear Receptor 0.36 0.22 0.34 0.64 0.38 0.61 0.51 0.29 0.44

Page 96: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

83

aofb Other Enzymes 0.57 0.51 0.57 0.64 0.40 0.52 0.68 0.56 0.67

bace1 Protease 0.41 0.31 0.41 0.62 0.47 0.60 0.59 0.58 0.65

braf Kinase 0.27 0.17 0.24 0.31 0.12 0.31 0.52 0.34 0.45

cah2 Other Enzymes 0.53 0.41 0.50 0.75 0.58 0.71 0.73 0.50 0.64

casp3 Protease 0.48 0.19 0.35 0.57 0.32 0.48 0.50 0.26 0.47

cdk2 Kinase 0.38 0.27 0.36 0.46 0.31 0.38 0.45 0.32 0.41

comt Other Enzymes 0.38 0.11 0.34 0.72 0.17 0.75 0.87 0.31 0.72

cp2c9 Cytochrome P450 0.53 0.37 0.44 0.47 0.33 0.44 0.47 0.37 0.48

cp3a4 Cytochrome P450 0.50 0.39 0.47 0.45 0.36 0.41 0.48 0.38 0.45

csf1r Kinase 0.32 0.21 0.27 0.43 0.29 0.39 0.54 0.39 0.45

cxcr4 GPCR 0.63 0.45 0.61 0.68 0.40 0.61 0.77 0.45 0.60

def Other Enzymes 0.39 0.27 0.35 0.43 0.30 0.42 0.42 0.37 0.48

dhi1 Other Enzymes 0.45 0.41 0.45 0.51 0.41 0.48 0.47 0.37 0.43

dpp4 Protease 0.63 0.48 0.57 0.64 0.46 0.56 0.47 0.39 0.48

drd3 GPCR 0.70 0.56 0.70 0.62 0.45 0.59 0.71 0.29 0.56

dyr Other Enzymes 0.40 0.22 0.36 0.63 0.31 0.47 0.52 0.43 0.51

egfr Kinase 0.39 0.24 0.31 0.48 0.31 0.41 0.39 0.35 0.38

esr1 Nuclear Receptor 0.36 0.12 0.35 0.39 0.27 0.36 0.66 0.36 0.59

esr2 Nuclear Receptor 0.14 0.06 0.13 0.71 0.41 0.63 0.77 0.45 0.66

fa10 Protease 0.45 0.26 0.38 0.42 0.21 0.35 0.42 0.34 0.45

fa7 Protease 0.36 0.17 0.28 0.43 0.19 0.34 0.56 0.35 0.50

fabp4 Miscellaneous 0.46 0.33 0.43 0.64 0.46 0.58 0.67 0.65 0.72

fak1 Kinase 0.26 0.10 0.26 0.57 0.37 0.53 0.64 0.52 0.61

fgfr1 Kinase 0.42 0.17 0.36 0.49 0.19 0.38 0.63 0.36 0.47

fkb1a Other Enzymes 0.69 0.53 0.69 0.82 0.68 0.81 0.34 0.41 0.55

fnta Other Enzymes 0.49 0.43 0.48 0.41 0.25 0.34 0.57 0.25 0.47

fpps Other Enzymes 0.59 0.23 0.38 0.81 0.63 0.78 0.63 0.43 0.62

gcr Nuclear Receptor 0.54 0.27 0.50 0.45 0.33 0.43 0.58 0.39 0.54

glcm Other Enzymes 0.29 0.17 0.23 0.78 0.44 0.72 0.45 0.29 0.47

gria2 Ion Channel 0.52 0.30 0.47 0.57 0.33 0.50 0.63 0.53 0.59

grik1 Ion Channel 0.62 0.28 0.50 0.69 0.46 0.65 0.75 0.59 0.71

hdac2 Other Enzymes 0.61 0.48 0.63 0.68 0.44 0.53 0.69 0.46 0.72

hdac8 Other Enzymes 0.62 0.28 0.58 0.70 0.37 0.57 0.63 0.39 0.67

hivint Other Enzymes 0.64 0.40 0.56 0.60 0.49 0.60 0.57 0.48 0.57

hivpr Protease 0.59 0.40 0.55 0.58 0.43 0.53 0.61 0.49 0.56

hivrt Other Enzymes 0.54 0.37 0.47 0.60 0.49 0.56 0.61 0.48 0.58

hmdh Other Enzymes 0.58 0.26 0.45 0.57 0.25 0.42 0.55 0.35 0.46

hs90a Miscellaneous 0.45 0.23 0.36 0.55 0.30 0.50 0.64 0.41 0.61

hxk4 Other Enzymes 0.46 0.29 0.39 0.53 0.39 0.51 0.60 0.49 0.59

igf1r Kinase 0.36 0.22 0.31 0.39 0.33 0.39 0.50 0.39 0.47

inha Other Enzymes 0.50 0.42 0.53 0.43 0.34 0.42 0.54 0.40 0.50

ital Miscellaneous 0.58 0.43 0.57 0.69 0.49 0.70 0.54 0.39 0.52

jak2 Kinase 0.44 0.38 0.44 0.53 0.33 0.52 0.61 0.44 0.58

kif11 Miscellaneous 0.49 0.38 0.49 0.55 0.46 0.54 0.65 0.58 0.62

kit Kinase 0.47 0.35 0.44 0.45 0.28 0.38 0.47 0.24 0.41

kith Other Enzymes 0.26 0.10 0.23 0.55 0.35 0.51 0.68 0.39 0.57

kpcb Kinase 0.45 0.20 0.27 0.53 0.36 0.52 0.59 0.42 0.55

Page 97: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Retrospective analysis for the DUD-E database

84

lck Kinase 0.39 0.30 0.39 0.39 0.29 0.39 0.45 0.36 0.41

lkha4 Protease 0.50 0.38 0.46 0.44 0.31 0.42 0.56 0.30 0.46

mapk2 Kinase 0.45 0.20 0.40 0.55 0.39 0.47 0.51 0.32 0.51

mcr Nuclear Receptor 0.61 0.40 0.51 0.61 0.45 0.54 0.60 0.52 0.59

met Kinase 0.33 0.22 0.27 0.27 0.20 0.28 0.34 0.26 0.30

mk01 Kinase 0.27 0.12 0.17 0.45 0.15 0.35 0.47 0.27 0.42

mk10 Kinase 0.39 0.33 0.39 0.47 0.41 0.47 0.48 0.40 0.47

mk14 Kinase 0.42 0.38 0.42 0.44 0.38 0.46 0.51 0.46 0.50

mmp13 Protease 0.42 0.29 0.39 0.54 0.45 0.53 0.53 0.51 0.57

mp2k1 Kinase 0.42 0.19 0.36 0.53 0.31 0.49 0.56 0.38 0.51

nos1 Other Enzymes 0.60 0.53 0.61 0.55 0.43 0.53 0.60 0.45 0.52

nram Other Enzymes 0.64 0.47 0.60 0.78 0.57 0.72 0.51 0.36 0.52

pa2ga Other Enzymes 0.55 0.25 0.48 0.61 0.53 0.62 0.67 0.49 0.57

parp1 Other Enzymes 0.42 0.24 0.36 0.49 0.33 0.47 0.51 0.36 0.45

pde5a Other Enzymes 0.45 0.26 0.35 0.56 0.37 0.52 0.55 0.50 0.57

pgh1 Other Enzymes 0.64 0.41 0.62 0.50 0.38 0.50 0.67 0.53 0.66

pgh2 Other Enzymes 0.48 0.26 0.43 0.67 0.53 0.62 0.63 0.49 0.63

plk1 Kinase 0.53 0.34 0.49 0.48 0.23 0.42 0.41 0.23 0.37

pnph Other Enzymes 0.73 0.24 0.56 0.76 0.36 0.59 0.82 0.64 0.83

ppara Nuclear Receptor 0.32 0.19 0.29 0.39 0.20 0.36 0.22 0.23 0.27

ppard Nuclear Receptor 0.63 0.60 0.63 0.48 0.21 0.41 0.27 0.26 0.34

pparg Nuclear Receptor 0.42 0.29 0.41 0.41 0.20 0.31 0.35 0.30 0.37

prgr Nuclear Receptor 0.60 0.37 0.59 0.49 0.27 0.41 0.61 0.47 0.62

ptn1 Other Enzymes 0.66 0.44 0.54 0.69 0.27 0.47 0.48 0.42 0.60

pur2 Other Enzymes 0.03 0.00 0.01 0.56 0.06 0.37 0.39 0.13 0.29

pygm Other Enzymes 0.50 0.35 0.50 0.49 0.24 0.41 0.58 0.49 0.58

pyrd Other Enzymes 0.32 0.22 0.31 0.54 0.32 0.41 0.43 0.41 0.51

reni Protease 0.48 0.10 0.32 0.59 0.22 0.47 0.52 0.35 0.52

rock1 Kinase 0.59 0.43 0.55 0.53 0.31 0.45 0.56 0.42 0.52

rxra Nuclear Receptor 0.55 0.37 0.53 0.67 0.30 0.65 0.68 0.46 0.66

sahh Other Enzymes 0.19 0.08 0.16 0.67 0.22 0.50 0.90 0.46 0.80

src Kinase 0.45 0.26 0.39 0.43 0.30 0.44 0.46 0.34 0.41

tgfr1 Kinase 0.56 0.34 0.48 0.53 0.41 0.51 0.60 0.52 0.58

thb Nuclear Receptor 0.22 0.08 0.19 0.37 0.22 0.32 0.44 0.35 0.45

thrb Protease 0.59 0.41 0.55 0.51 0.25 0.45 0.47 0.30 0.37

try1 Protease 0.46 0.32 0.38 0.58 0.42 0.56 0.54 0.40 0.57

tryb1 Protease 0.48 0.36 0.46 0.73 0.61 0.69 0.43 0.33 0.49

tysy Other Enzymes 0.35 0.15 0.29 0.27 0.18 0.24 0.38 0.20 0.28

urok Protease 0.65 0.32 0.56 0.54 0.33 0.50 0.65 0.46 0.58

vgfr2 Kinase 0.53 0.36 0.45 0.39 0.21 0.34 0.43 0.34 0.38

wee1 Kinase 0.17 0.02 0.13 0.35 0.11 0.31 0.66 0.19 0.57

xiap Miscellaneous 0.53 0.21 0.43 0.76 0.49 0.72 0.66 0.34 0.71

Average AUC 0.47 0.29 0.42 0.55 0.35 0.49 0.55 0.39 0.52

Max AUC 0.73 0.60 0.70 0.82 0.68 0.81 0.90 0.65 0.83

Page 98: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

85

The values for the worst and best performing vectors as well as the combined vectors

were calculated and summarized in Table 5. The average values were close to the random

performance of 0.5, while the individual performances are spread in a broad range (0.2 –

0.9 for the “Interactive Virtual Ligand).

The average area under the ROC curve did increase by around 8 %, when changing from

ligand to receptor-based pharmacophores. The peak values for the receptor-based

methods were higher, while the new method shows the overall best virtual screening

results. The differences between the new method and the original presented one were

rather small, as no additional information was added during the calculations due to

automation. The chosen cut-off values to reduce the amount of lipophilic interactions are

listed in Appendix III.

Figure 28. Boxplot analysis for the best performing individual cluster radii values for HBD, HBA, lipophilic

and aromatic potential pharmacophore points.

Clu

ste

r ra

diu

s (

Å)

Page 99: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Retrospective analysis for the DUD-E database

86

Further evaluations were conducted for the implementation integrating the atom position

inaccuracies based on the cluster radii applied for the LIQUID descriptor calculation. The

boxplots in Figure 28 point out that there were no preferred radii values over all 102

targets. Both extreme values (0.1 Å and 4.0 Å) could be found for all feature types, while

the medians were 2.2±0.1 Å. The radii for all individual best performing evaluations of

each target are shown in Table 6.

Table 6. Combination of all individual best performing cluster radii (Å) for the new interactive virtual

ligand version and all DUD-E targets. (L: lipophilic, D: donor, A: acceptor, R: aromatic)

Target Family L D A R Target Family L D A R

aa2ar GPCR 2.8 2.7 4.0 3.9 hxk4 Other Enzymes 3.0 3.7 0.2 2.1

abl1 Kinase 2.8 3.5 0.1 0.7 igf1r Kinase 0.5 2.1 0.2 0.4

ace Protease 1.6 3.6 0.4 2.1 inha Other Enzymes 1.6 0.3 2.8 2.0

aces Other Enzymes 2.6 2.3 0.1 2.2 ital Miscellaneous 3.0 3.0 2.5 4.0

ada Other Enzymes 1.6 0.1 0.2 0.6 jak2 Kinase 1.5 2.7 0.4 2.2

ada17 Protease 1.6 1.3 0.8 3.1 kif11 Miscellaneous 1.5 3.7 0.2 0.3

adrb1 GPCR 0.4 2.9 3.7 2.1 kit Kinase 1.5 2.8 3.1 2.8

adrb2 GPCR 1.7 2.9 3.7 2.1 kith Other Enzymes 0.8 2.3 3.6 4.0

akt1 Kinase 4.0 2.0 3.7 3.5 kpcb Kinase 3.0 3.2 0.5 1.3

akt2 Kinase 3.7 3.5 0.3 3.5 lck Kinase 1.3 0.5 3.1 0.1

aldr Other Enzymes 4.0 2.3 0.1 2.1 lkha4 Protease 1.6 1.9 3.0 4.0

ampc Other Enzymes 2.5 0.4 0.3 2.2 mapk2 Kinase 3.8 3.3 0.1 2.2

andr Nuclear Receptor 3.6 3.6 3.9 2.1 mcr Nuclear Receptor 2.6 3.3 0.2 3.8

aofb Other Enzymes 2.6 3.3 0.5 4.0 met Kinase 1.5 0.1 2.9 4.0

bace1 Protease 0.3 2.4 2.5 4.0 mk01 Kinase 2.6 3.4 3.6 2.1

braf Kinase 3.2 3.3 4.0 2.4 mk10 Kinase 0.3 2.3 2.3 2.1

cah2 Other Enzymes 1.7 0.3 0.4 3.9 mk14 Kinase 1.5 3.1 0.1 2.1

casp3 Protease 1.3 3.7 4.0 0.6 mmp13 Protease 1.7 1.7 2.3 3.1

cdk2 Kinase 2.8 3.3 0.5 0.9 mp2k1 Kinase 3.8 2.9 0.1 2.1

comt Other Enzymes 3.3 2.8 0.3 0.5 nos1 Other Enzymes 1.6 0.5 0.3 2.3

cp2c9 Cytochrome P450 4.0 3.3 0.2 4.0 nram Other Enzymes 0.4 2.3 0.2 4.0

cp3a4 Cytochrome P450 2.5 3.3 3.9 4.0 pa2ga Other Enzymes 1.6 0.1 2.8 1.4

csf1r Kinase 3.2 3.0 1.5 0.8 parp1 Other Enzymes 3.3 2.7 2.5 2.0

cxcr4 GPCR 0.5 2.3 0.4 2.8 pde5a Other Enzymes 1.6 1.3 0.6 0.8

def Other Enzymes 0.6 2.1 3.6 1.7 pgh1 Other Enzymes 3.7 0.4 0.5 4.0

dhi1 Other Enzymes 1.6 1.9 2.4 0.3 pgh2 Other Enzymes 4.0 2.1 0.1 1.2

dpp4 Protease 2.5 0.5 3.7 2.4 plk1 Kinase 1.3 0.1 3.0 2.2

drd3 GPCR 0.2 2.8 0.2 0.5 pnph Other Enzymes 0.4 4.0 2.5 0.7

dyr Other Enzymes 3.8 2.0 2.5 4.0 ppara Nuclear Receptor 1.7 0.9 4.0 4.0

egfr Kinase 2.7 0.7 0.5 2.0 ppard Nuclear Receptor 3.2 0.2 3.7 4.0

esr1 Nuclear Receptor 4.0 0.5 3.0 2.1 pparg Nuclear Receptor 0.1 0.2 3.3 4.0

esr2 Nuclear Receptor 4.0 3.0 2.9 1.8 prgr Nuclear Receptor 3.6 3.0 4.0 2.2

fa10 Protease 1.8 1.3 1.5 2.1 ptn1 Other Enzymes 2.6 4.0 3.7 2.1

fa7 Protease 1.7 2.4 1.5 3.6 pur2 Other Enzymes 2.8 2.3 2.3 4.0

fabp4 Miscellaneous 1.3 0.3 2.7 2.0 pygm Other Enzymes 1.3 3.5 0.4 2.4

fak1 Kinase 0.2 2.1 0.2 2.1 pyrd Other Enzymes 2.7 2.4 0.4 2.1

fgfr1 Kinase 1.0 0.7 3.1 2.1 reni Protease 0.7 3.7 3.8 2.2

fkb1a Other Enzymes 1.7 0.1 0.4 3.7 rock1 Kinase 0.2 2.9 4.0 2.4

fnta Other Enzymes 4.0 3.7 0.1 1.3 rxra Nuclear Receptor 3.7 0.5 0.2 4.0

Page 100: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

87

fpps Other Enzymes 2.9 4.0 2.5 4.0 sahh Other Enzymes 0.2 0.5 4.0 2.0

gcr Nuclear Receptor 3.6 0.7 0.1 2.2 src Kinase 3.4 0.3 3.4 1.9

glcm Other Enzymes 1.6 1.6 2.7 2.3 tgfr1 Kinase 3.8 1.1 2.0 2.1

gria2 Ion Channel 1.2 0.5 0.5 2.2 thb Nuclear Receptor 1.7 3.3 3.8 4.0

grik1 Ion Channel 3.0 0.5 0.2 2.2 thrb Protease 1.9 0.6 4.0 3.5

hdac2 Other Enzymes 4.0 2.8 0.7 2.2 try1 Protease 2.1 2.8 1.1 2.2

hdac8 Other Enzymes 1.7 0.4 0.2 4.0 tryb1 Protease 1.7 0.4 0.3 2.2

hivint Other Enzymes 3.6 2.6 2.8 4.0 tysy Other Enzymes 3.1 3.0 4.0 2.2

hivpr Protease 4.0 3.7 0.1 0.8 urok Protease 2.6 0.4 0.4 2.2

hivrt Other Enzymes 0.2 2.4 3.3 2.1 vgfr2 Kinase 1.5 2.5 0.4 2.1

hmdh Other Enzymes 4.0 3.6 4.0 2.2 wee1 Kinase 2.9 0.2 0.1 2.1

hs90a Miscellaneous 2.7 0.7 2.5 4.0 xiap Miscellaneous 2.5 0.5 3.5 1.3

Additional analyses were performed on the protein family level to reveal potential

coherencies between protein family and the set of cluster radii or the protein family and

the pocket pharmacophore descriptor vector. In Figure 29 trees for the crystal ligand (A

and B) and the new method (C and D) are presented. For A and C the Manhattan distances

between the cluster radii vectors were calculated, while for C and D the Euclidian distance

between the resulting pharmacophore correlation vectors for each target was

determined. Small clusters with members of the same protein family, e.g. kinases, can be

found, while the classification of complete individual families was unfeasible. The cluster

algorithm minimized intra cluster distances, so that co-clustered entries share common

features with the complete cluster rather than resemble a single cluster member.

The retrospective analysis showed diverse results. The applied virtual screenings based

on a single reference structures with the DUD-E provided ligand conformations were in

combination with the workflow insufficient to obtain high AUC values for all targets. It

was shown that the cluster radii had a measureable influence on the screenings results

(Table 5), while no set of radii was found to work best on all tested targets. The clustering

of the target families based on the binding site and the known crystal ligand showed

pharmacophoric diversity of the individual families according to the chosen descriptor.

Page 101: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Retrospective analysis for the DUD-E database

88

Figure 29. Cluster dendrograms of all 102 DUD-E targets, leaves are labeled according to their protein

family (Protease, Kinase, GPCR, G protein-coupled receptor, OE: Other enzyme, IC: Ion channel, M:

Miscellaneous, P450: Cytochrome P450). The trees are based on different data: A) Manhattan distance

matrix of the cluster radii vectors for the crystal ligands. B) Euclidian distance matrix of the crystal ligand

correlation vectors. C) Manhattan distance matrix of the cluster radii vectors for the new receptor-based

pharmacophores. D) Euclidian distance matrix of the receptor derived correlation vectors.

A B

C D

Page 102: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

89

3.3 Prospective studies

3.3.1 HIV-1 protease

The work on the HIV-1 protease had been started during my diploma thesis in 2011 [324]

and was followed up in this work. Initial findings of non-competitive inhibitors were

published in 2014 [233] and represented the starting point for the presented work. The

motivation of potential binding sites was performed on different levels. Fragments-based

crystallography revealed the binding of small ring systems on a cavity located on top of

the flap regions [324]. In our initial study we focused on the elbow region of the protease

undergoing conformational changes during the flaps movements. A receptor-based

pharmacophore model based VS resulted in the finding of small molecule HIV-1 protease

inhibitors and non-competitive binding mode has been shown experimentally.

The basis for the previous and the present work was a model of the HIV-1 protease

subtype B in apo-form with closed flaps generated for the initial study [233]. MD-

simulations were conducted to detect changes in the pockets during the opening of the

flaps and a presented workflow was applied to search for allosteric modulators. The MD

simulations were intended to give insights into protein flexibility and sample as much of

the motivated conformational space as possible. The continuous simulations were cut into

pieces, so-called snapshots of the current state of the simulation.

MDpocket was applied to detect binding pockets at the beforehand motivated locations,

while a ligandability prediction tool [234] was used to predict potential ligand binding

regions based on the MD-simulation snapshots to further motivate the pockets.

Three pockets per monomer were selected and for the six pockets the pharmacophore

models were generated applying the presented workflow (Figure 30). In order to reduce

the number of lipophilic PPPs the cut-off values (Table 7) were chosen to remove

histogram bins with more than 100 members.

Page 103: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

90

Figure 30. HIV-1 protease pocket overview. A) A HIV-1 protease MD snapshot is shown in cartoon

representation and colored by the monomers (blue and gray). The chosen pockets for virtual screening are

shown as line models (All six chosen median sized pockets are shown in the same snapshot and may show

slight clashes with the protein). The green ones represent the flap pockets (B, C), the red ones the pockets

close to the active centrum (D, E) and the brown ones the elbow pockets (F, G). For reference the main

A

B

C

D

E

F

G

Page 104: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

91

pocket is colored in gray. B-G) Six pharmacophore models with the original cluster radii values are

presented. The two models on the left show the flap pocket, the middle ones the pocket close the center and

the right ones the elbow pockets. The cartoons are colored corresponding to the top frame (A); the surfaces

are colored according the atom types (C: white, N: blue, O: red). The pharmacophore models are colored

according to the feature types (HBD: blue, HBA: red, Lipophilic: green, Aromatic: yellow).

Table 7. List of chosen lipophilic cut-off values.

Pocket Lipophilic cut-off

Elbow 11

Elbow2 11

Flap 11

Flap2 10

Near 12

Near2 12

As the influence of the descriptor cluster radii parameter was shown during the

retrospective analysis in chapter 3.2, two different sets of cluster radii were chosen for

the virtual screenings. The first set has been originally proposed by Löwer et al [300] and

has also been successfully applied in the initial study. The HIV-1 protease was part of the

DUD-E and therefore a motivated second set of radii has been retrospectively evaluated

and prospectively applied in this study. In total twelve virtual screenings were performed

and the top 1000 compounds fitting the according pharmacophore model best were kept

for each model.

For each VS run a two-dimensional Pareto front optimization was performed with KNIME

to find ligands that can fit into the pocket and with favorable pharmacophoric distance to

the receptor-based query. One of theses optimizations is presented in Figure 31. A

combined list is created containing all Pareto front molecules, is visually inspected and a

subset of cherry picked molecules is chosen for testing (Table 8).

The Pareto Front calculations were applied to reduce the number of screening hits in a

motivated manner. The high number of conducted virtual screenings resulted in ranked

lists of the VS database. The best fitting molecules according to the descriptor should be

tested, while the chosen descriptor was size independent and promising compounds may

Page 105: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

92

not fit into the pocket. Further more the detection of fragment like molecules would be

beneficial for hit to lead optimization processes and therefore the second optimization

criterion was chosen. The Pareto optimization led to set of compounds with favorable

values for the chosen properties.

Figure 31. Pareto front for VS screening results. The top 1”000 compounds for one VS are plotted (black

circles). The ligand volume is presented on the y-axis against the LIQUID distance on the x-axis. The Pareto

optimization searches for solutions only dominated in one of the two properties by every other compound.

The chosen molecules are colored in red. The blue line indicates the respective binding pocket volume.

1.0 1.5 2.0 2.5 3.0

10

020

03

00

400

500

600

700

Pareto front HIV−1 protease

Liquid distance

Lig

and v

olu

me

Lig

and V

olu

me (

Å3)

LIQUID distance

Page 106: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

93

Table 8. Ordered virtual screening compounds for the HIV-1 protease. Additional information on the

cluster radii set, pharmacophore distance to the query and the ligand volume are provided.

Molecule Pocket Cluster

radii set PPP

distance Volume

Å3

Flap DUD-E 1.47 442

Flap DUD-E 1.51 387

Near DUD-E 0.86 185

Flap2 Normal 2.07 220

Near DUD-E 1.31 171

Near2 Normal 1.05 289

Flap DUD-E 1.77 299

Flap Normal 1.34 361

Page 107: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

94

Flap Normal 1.40 318

Flap Normal 1.46 354

Near DUD-E 0.78 299

Near DUD-E 0.81 308

Near Normal 0.92 353

Flap2 DUD-E 1.61 257

Elbow DUD-E 1.58 172

Flap DUD-E 1.82 264

Near DUD-E 1.14 178

Near2 Normal 1.38 257

Flap DUD-E 1.62 294

Page 108: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

95

Near2 DUD-E 1.08 250

Flap2 DUD-E 2.04 219

Elbow DUD-E 1.28 301

Flap DUD-E 1.59 345

Flap2 Normal 1.52 302

Flap Normal 1.47 332

Elbow DUD-E 1.43 245

Near Normal 1.02 292

Elbow2 DUD-E 0.80 294

Near Normal 1.65 156

Page 109: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

96

In the first round of assays all compounds were tested in a single dose duplicate at 100

µM against the HIV-1 protease. Out of the 29 compounds four showed auto-fluorescence

and the interference with the assay led to measurement ranges out of the assay

capabilities. Two compounds showed a reduction of the protein activity by 31% and 61%

respectively (Figure 32).

Figure 32. Two compounds inhibiting the HIV-1 protease activity at 100 µM concentration

For compound 2 an IC50 of 80 µM was measured in duplicate and the curves were

presented in Figure 33. The calculated ligand efficiency [359] was 0.29.

Figure 33. IC50 measurements for compound 2 and the reference compound pepstatin A.

6 7 8 9

0

50

100

pConcentration

Pe

rce

nta

ge

inh

ibitio

n[%

]

2

Pepstatin A

2

(1) 31% inhibition

(2) 61% inhibition

Page 110: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

97

3.3.2 A. thaliana IspD

The IspD study presented in this thesis was conducted as a member of a consortium of

research groups and industrial collaborators working on the different aspects of the non-

mevalonate pathway. The discovery of an allosteric binding site on AtIspD as potential

target for herbicides and additional HTS data was pursued.

The available structural data on AtIspD in the PDB was very limited. Only eight crystal

structures were retrieved, while seven out of them are solved by the consortium we

joined during this work. There was no complexed crystal structure available showing a

main site inhibitor. In order to search for new ATIspD inhibitors two different approaches

were conducted. The first one was based on available X-ray data. Receptor-based

pharmacophore models were generated for the known allosteric binding pocket. One

model was built taking into account the atom position inaccuracies by the

crystallographic determined temperature factor, while the other one did not (Figure 34).

Overlap top 1”000 Overlap Pareto front

378 molecules 1 molecule

Figure 34. Impact of the crystallographic temperature-factor on the receptor-based pharmacophore

screening. The left pharmacophore model was derived without considering any atom position inaccuracies,

while the right one did include the crystallographic temperature-factor. The overlap in retrieved molecules

within the 1”000 most similar screening molecules was 37.8%. One molecule occurred in both Pareto Front

optimizations, but was dismissed during the cherry picking process.

Page 111: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

98

The second approach was based on MD-simulations. Two 25 ns MD-simulations with

different minimized conformations of the missing residues were calculated and 5000

equal distant (in time) snapshots were taken. Ligandable protein patches were defined by

averaging the Local Roughness Index (LoRI) (234) over all snapshots. LoRI was build on

the observation, that surface roughness correlates with biological function. The local

roughness or irregularities on the protein surface were calculated applying the concept

of fractal dimensions (FD). These values were averaged over all snapshots and the protein

depicted in Figure 35 was colored accordingly (warm colors: high FD, colder color: low

FD) (234). Beside the substrate binding site and the known allosteric site a third patch at

the backside of the orthosteric binding pocket was revealed (Figure 35, Figure 36).

Figure 35 LoRI calculations averaged over all IspD MD-simulation snapshots. The IspD dimer is presented

from both sides and the surface is colored according to the LoRI calculations. Warmer colors indicate

increased probability of ligandability, while colder colors indicate lower probability. In the left picture the

substrate binding sites, as well as outer parts of the allosteric binding pockets were detected by the

algorithm and highlighted by the black frame. The red frame indicates an additional warm colored region,

the backside pocket. The picture on the right hand side shows the prediction for the opposite sides and the

active as well as the backside pocket was not present in this conformation.

Page 112: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

99

Figure 36. The backside pocket of ATIspD. Pocket detection with MDpocket is shown on the right (IspD

represented as gray cartoon and the pocket as blue dots) and the calculated pharmacophore model on the

left (HBA: red, HBD: blue, Lipophilic: green).

The MDpocket algorithm [232] was applied to detect potential transient binding pockets

around the backside residues and to show the different conformations of the known

allosteric pocket. The algorithm was able to detect a backside pocket in 70% (67% for the

second monomer) of the MD snapshots and the pocket sizes are depicted in Figure 37.

The snapshot closest to the median sized pocket was chosen for the pharmacophore

modeling for both backside and the allosteric pockets. Local pocket residue alignments of

all snapshots within a ten percent difference in pocket size to the chosen snapshots were

performed to estimate the atom position movements later on considered in the workflow.

For all five systems (two crystal-based, three MD-based) a VL screening was performed

with the introduced workflow. The grid spacing was changed for all runs, and the “step”

parameter was set to three for the four runs including atom position inaccuracies. For the

local alignments a 5 Å align radius around the pocket was chosen and the lipophilic cut-

off values are listed in Table 9.

Page 113: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

100

Figure 37. Boxplots of the pocket sizes for the newly identified ATIspD backside pockets.

Table 9 Lipophilic cutoff values for the IspD virtual screenings

Pocket Lipophilic cut-off

Crystal allosteric 11

Crystal allosteric (+temp-factor) 11

MD-Allosteric 13

MD-Back1 11

MD-Back2 11

The procedure described in chapter 3.3.1 is repeated to prioritize the VS hits based on a

Pareto front optimization focusing on small molecules with low pharmacophoric distance

to the pocket. The cheery picked compounds are tested experimentally (Table 10).

backside_1 backside_2

01

00

200

300

40

0500

60

0

Pocket

Po

cket siz

e

Pocket siz

e (

Å3)

Page 114: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

101

Table 10. Ordered VS hits for AtIspD.

Molecule Pocket PPP distance Volume Å3

C-nT 1.55 354

C-T 1.62 357

C-T 1.75 331

C-T 1.36 558

MD-Allo 1.73 239

C-T 1.64 338

Page 115: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

102

C-T 1.57 376

C-nT 1.73 304

C-T 1.81 323

C-T 1.83 320

C-T 1.85 316

C-T 1.78 328

MD-Back1 1.87 224

MD-Back1 1.67 254

MD-Back2 1.37 213

MD-Back1 1.75 230

Page 116: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

103

MD-Back2 0.99 269

MD-Back2 1.37 220

MD-Allo 1.65 281

MD-Allo 1.68 279

MD-Allo 1.80 224

MD-Allo 1.53 317

MD-Allo 1.55 298

C-nT 1.72 316

C-T 1.80 324

C-T 1.92 141

Page 117: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective studies

104

MD-Allo 1.64 296

MD-Allo 1.44 350

MD-Back2 1.35 223

C-T stands for crystal structure allosteric pocket with temperature factor; C-nT for no temperature factor

Page 118: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Results

105

Page 119: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Receptor-based pharmacophore workflow

106

4 Discussion

Studying the methods in more detail revealed the same applied principles in most of the

receptor-based pharmacophore search tools. The aim was not to build up an additional

combination of algorithms based on the same data, but check whether it is possible to

influence the basic steps. Looking at the provided precision given by pharmacophore tools

but having in mind the fact that the used proteins contain errors themselves motivated

the development of the presented workflow. The workflow was based on the idea that we

cannot increase the general performance of receptor-based pharmacophore searches

without considering the model errors. For example increasing the retrospective

performance of a tool by trying to detect more hydrogen-bridges in optimal geometric

positions was questionable, when the receptor model error was higher than the tolerance

values of the interactions. The conducted work tried to incorporate those errors or the

simulated flexibility of the atoms into pharmacophore research.

4.1 Receptor-based pharmacophore workflow

The presented workflow consists of individual modules, which will be discussed one by

one in this chapter. The user has to decide which algorithms to choose and consider how

they influence each other.

4.1.1 Protein preparation

The protein preparation for the workflow can be separated into multiple steps. The

completeness and correctness of the structure is not checked automatically, but is task of

the user. The program is able work with incomplete structures; missing residues will be

ignored during the process and no PPPs will be created for missing atoms / residues.

Important to note is the fact that missing atoms will reduce the number of exclusion

spheres and that the grid based pocket detection methods as well as the optional pocket

adjustments are able to generate pocket grid points at the free spaces of missing atoms.

The protonation state of the protein can be optionally calculated with the commercial

MOE software toolkit. The protonation is performed with the standard parameters

provided by the MOE software. The workflow is meant to be applicable to a broad range

Page 120: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Discussion

107

of protein families and therefore family specific information has to be included by the user

(e.g. the pH-value for protonation).

4.1.2 The potential binding pocket

Pocket detection methods differ in their description of the pockets, as there is no unerring

description of a binding pocket [78]. The size and therefore the volume and shape

definitions result in various legit descriptions of the same protein cavity. How to handle

shallow regions in between two cavities is one of the questions the algorithms answer

differently. The main binding site is often the largest cavity on the protein [138], but

structure-based design is not restricted to the main site, not even to confirmed binding

sites. The motivation of the chosen pocket for the workflow lies out of the scope of this

work and there is no prediction on ligandability conducted on the fly. For the

retrospective analysis the pockets have been defined by the positions of the according

crystal ligands. The radius of 6 Å around the ligand has been applied to identify as much

receptor atoms involved in the ligand binding as possible, while minimizing the noise in

form of PPPs created by not involved atoms. For the prospective studies on HIV-1

protease and IspD the pockets have been motivated before feeding them into the

workflow. A mixture of known allosteric and uncharacterized potential binding sites has

been targeted in this work to present a variety of possible applications.

In order to implement the newly described features, some changes on the pocket grid are

made during the workflow. In the initial step, the user is asked to shrink the investigated

pocket. This option can be beneficial in numerous situations. The targeted binding site

may be located within a channel and the pocket detection method is unable to focus on

the important part. In some cases the project focuses on a small subpocket, in other

projects subpockets are filled with co-factors and should not be considered for the

pharmacophore search. For the retrospective analysis the pocket was well defined by the

position of the crystal ligand and for the prospective studies the snapshots have been

chosen to have a pocket size most similar to the MD simulation median pocket size and

changing the pocket afterwards would have been contradictory to the selection criteria.

The pocket grid spacing can be changed in an automated way and without the

implementation of atom position flexibilities this is primarily a question of computation

costs. For structure-based VS campaigns a computing time of up to ten minutes is feasible,

Page 121: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Receptor-based pharmacophore workflow

108

as only very few receptor-based models will be generated. As soon as the pockets are not

used as query, but also as database (e.g. all versus all pocket comparison of two proteins)

single calculations need to be faster while handling databases in PDB scale. When

introducing the atom position inaccuracy, the user is relying on low grid spacing in order

to be able to observe the influences of the modeled errors. The crystallographic

temperature-factor can be transformed into a radial measurement of the atom position

inaccuracies and can be applied for the grid changes as described in this work. Copied to

atom flexibility as observed during a MD-simulation is a simplification of the problem.

With additional computational effort the spatial coordinate distribution of each atom

could be calculated and probabilistic models could be created describing the atom

position. Potential pharmacophore points created by this atom would copy the

probabilistic model instead of the single radius value in the current approach. At the same

time the atom centered exclusion volume definitions could be replaced with simulation

derived models as well. A downside of this approach would be the difficult coupling to

geometric pharmacophore point definitions. Assuming that the center of the position

cloud would be chosen as new reference coordinate for the atom, the bond lengths and

angles between two atoms could be changed in a way that geometric rules for PPP

definition would fail to identify valid pharmacophore points. Additional problems are

caused by the inaccuracy describing values themselves. The crystallographic temperature

factor is a measurement of the displacement of a scattering center in X-ray experiments,

but the atom position uncertainty cannot be completely described by Equation 14. The

resolution, completeness of the data and the model are some of the additional factors and

computational complex models would be needed to incorporate all these contributions

[339]. Therefore only Equation 14 is applied during this work to estimate the receptor

atom uncertainties.

The MD-simulations are performed to generate a conformational ensemble and therefore

reveal insights into the protein flexibility. The simulations do not necessarily show

biological relevant or in real life observable conformations [361]. Furthermore the

complete sampling of the conformational space for protein sized molecules is

computational expansive [361]. As we were interested in motivated conformations

around the reference structure, multiple MD-simulations were performed starting from

the reference structure. The procedure allows running of multiple simulations in parallel

and also increased the chance to sample the conformational space around the reference

Page 122: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Discussion

109

conformation in more detail. In this work we omitted the conformations at the extreme

values according to pocket size and focused on a ten percent breathing of the pocket

around the median observed size. This was done to avoid sparsely populated regions in

the conformational space that may be populated due to computational artifacts.

The idea of an inhomogeneous grid with higher grid point density close to the receptor

was able to cut down the computational costs. With an increased number of grid points at

certain positions the probability for PPPs increases and could falsify the feature density

calculations applied in many algorithms.

4.1.3 Potential pharmacophore point calculation

In this work PPPs were calculated for receptors as well as for small molecules. Even so

the considered potential ligands in this work were small molecules, the chemical diversity

is less restricted as for the receptor-based pharmacophores relying on the natural amino

acids. Many nuances were discussed and dozens of exceptions were postulated to

describe the pharmacophore of a small molecule [35]. In this work we focused on easy

understandable rules. The FDeF format provided by the RDKit however allows the user

to install or remove PPP definition rules to adapt the program to his or her needs.

For receptor-based PPPs the reimplementation of the LUDI rules by Löwer et al were

updated to match the values proposed by Stahl. In the current version HBD, HBA,

lipophilic and aromatic features were implemented. Further interaction types could be

integrated in additional works or imported from other feature definition tools. Examples

for missing interactions were planar 𝐶 − 𝑋 ⋯𝐴 halogen bond configurations (C: carbon.

X: halogen, D: H-bond acceptor) for halogen atoms (𝜎-hole), salt bridges based on charges

or water mediated H-bond bridges (water can act as HBD and HBA) [78]. It was not

necessary to re-implement all the different tools, but it would be sufficient to unify the

feature definition description. Therefore a simple feature representation format was used

in the workflow, where merging of point-based definitions was straightforward. The

definition of interaction hot spots remained one of the most difficult tasks in receptor-

based pharmacophore searches, as the importance of the individual contributions differ

from target to target. For well-known targets it was often possible to extract a set of key

pharmacophores whereas one goal of this work was to target transient and allosteric

Page 123: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Receptor-based pharmacophore workflow

110

pockets without any ligand information. In automated approaches the PPPs were ranked

according to the size, to energy values calculated by MIFs or any other criterion and only

the top ranking PPPs are kept for the final model [335]. In the presented work the users

were asked to define their criteria themselves. The first reduction step for lipophilic

interactions was optional, as returning 0 as cut-off value would keep all of the lipophilic

PPPs. Leaving all lipophilic points in place resulted in a lipophilic pharmacophore cluster

including nearly all available pocket grid points. The lipophilic pharmacophore in those

cases was more or less an estimate of the pocket shape. A hint on the number of final PPP

clusters for each feature type could be found in the statistics file calculated for the ligand

database pharmacophores. Therein the average number of PPPs as well as average and

maximum number for each interaction type per ligand was calculated. For the removal of

PPP groups within PyMOL the size of the group was not necessarily correlated with the

number of positive evaluated geometric rules, but is influenced by the inaccuracy value.

The user had to decide, whether highly conserved atom positions resulting in small

groups may be more important for ligand binding than the attempt to reduce the receptor

movements by targeting flexible regions of binding site. The shades of the colors

according to the occurring frequency shall help to define regions with multiple potential

interaction partners of the same feature type.

Figure 38. Undetected PPPs. The protein is sketched as gray line; the pocket grid is presented by black

squares. The HBD donor function of the protein is shown as blue square and the inaccuracy value as gray

circle. Assuming that that the orange square would be the only grid point fulfilling the geometric rule to

generate acceptor features in the pocket, but the grid point is invalid and therefore the rule at this position

is never evaluated. As result all red squares that would become HBA acceptor functions according to the

transferred value, will never be defined as HBA PPPs.

Page 124: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Discussion

111

The presented solution of copying the atom position inaccuracy as radius to the all

successfully validated pocket grid points allowed the usage of the same rules as for the

normal pharmacophore modeling, but may missed some the theoretically acceptable

PPPs (Figure 38).

4.1.4 The pharmacophore descriptor

The descriptor selection was crucial for the pharmacophore search results. The workflow

was constructed to provide as much data and liberties as possible to the user. For some

purposes the exclusion volumes were valuable, while for other ones the inclusion of atom

position inaccuracies could be beneficial. The presented concept generates a fuzzy

representation of the receptor-based pharmacophores, motivated by the measurement

errors during X-ray crystallography or artificial with MD-simulations. The LIQUID

descriptor, applied in the current version was build up to describe ligand-based

pharmacophores in a fuzzy manner as well. An upcoming question to answer was,

whether the abstraction level for ligands should also be increased, as shown for the

receptors. In LIQUID the pharmacophores were fuzzy and in the new workflow the PPP

generating atoms were fuzzy as well. For the ligands single conformations were chosen to

generate the pharmacophore model. Alternative to the production of individual

pharmacophores for each ligand conformation, the fuzziness could already be included in

a consensus pharmacophore describing the ligand conformation ensemble.

The embedding of alternative pharmacophore descriptors was implemented as simple as

possible. All output files out of the PPP definition section and the exclusion sphere

calculation were three-dimensional points annotated with the corresponding feature

type. The ligand database pharmacophore models were presented in the same way, so

that ligand- and receptor-based pharmacophore descriptors could be calculated in the

exact same fashion.

Page 125: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Retrospective analysis

112

4.2 Retrospective analysis One of the first steps in every project is the data collection. To validate the predictive

power of a new method, a retrospective test set is often applied for pharmaceutical

research. Let the program solve a task, where we already know the “correct” answer and

check the performance of the program. In general the idea and its implementation sound

rather easy, but the realization for cheminformatic benchmark sets show several

difficulties. One of the tasks is to define the “correct” answer. Discussions on VS rankings

and early enrichment have already been presented in chapter 1.1.1. Secondly the amount

of published negative or biological inactive tested molecules is rather small. The

generation of meaningful test sets therefore becomes a very difficult task. To test the VS

ranking capabilities of a method there is a need for negative examples (also referred to as

decoys). The generation of decoy sets is recently discussed [315] and brings up the

following questions:

Are these molecules really inactive

Is the test set diverse enough

Which property is important for binding

To generate a high quality test set for a receptor requires a good understanding of the

underlying receptor-ligand interactions. In other words we would need the perfect

classifier for active / inactive compounds we are ultimately looking for already to produce

a perfect test set to find it. In reality approximations are made to define a decoy set, taking

the risks of mixing in false negatives (the molecule is active but is annotated as inactive)

or false positives (the molecule is inactive even so it has been reported as active before).

The more difficult (variations between actives and decoys are small and affect only few

molecular properties) a test set should be, the more likely the test set contains errors.

This also means that the “correct” answer cannot be known beforehand and therefore

checking high-ranking decoys for activity might be worthwhile instead of penalizing the

tested algorithm.

The DUD-E database was generated as benchmark database for three-dimensional

docking tools. Within the papers citing the DUD-E paper only few presented the results

on the complete dataset, but more often results for specific targets were presented [336].

Page 126: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Discussion

113

In many studies the protein was dismissed and a ligand-based virtual screening

evaluation was performed based on the decoy and active molecule data. The three-

dimensional query crystal ligand structure was present in the bioactive conformation.

Even so a single three-dimensional structure was calculated for every active and decoy

molecule, it cannot be assured that the bioactive conformation was found. In contrast to

the presented retrospective analysis of the DUD-E database, in other publications

additional conformations were calculated and influences on the performance were

detected [337]. In the publication of USRCAT [337] the lowest energy conformation for

each molecule was taken out of up to twenty generated conformations for each molecule.

The ligand-based virtual screening was performed with each active molecule as query

once and the performance was averaged over all actives for the same target. They did not

present the performance with the crystal ligands as queries only or showed the influence

of taking the lowest energy conformation instead of the one provided by the DUD-E

database. Therefore the presented results in the paper were not comparable to the

retrospective analysis shown in this work here, but demonstrate the difficulties of current

methods working on three-dimensional structures. Several 1D and 2D descriptors were

tested for active ligand retrieval on the DUD-E dataset and show significantly better

results than the 3D LIQUID descriptor [337,346]. The database however was created

based on 2D representations of the actives and it was uncertain whether the third

dimension was beneficial to distinguish the two classes.

For many DUD-E targets the crystal ligand pharmacophore performance and the

receptor-based approach did only differ sparsely. For those cases it would be interesting

to calculate additional conformations for the actives and decoys and check whether the

performance was mainly influenced by the conformations. While the ligand-based

approach was rarely outperforming the structure-based pharmacophore searches, the

additional information of the pocket often increased the performance (Table 5).

The influence of the cluster radii to the AUC value can be seen in Table 6 and the difference

between the potentially best and the worst performing radii sets was often larger than

the difference between ligand- and structure-based approaches. The distribution of the

radii values showed that there was no preferred set of values and therefore no single

solution was motivated to be applicable to all targets. Additional evaluations with

alternative ligand conformations and different pharmacophore descriptors were needed

Page 127: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Prospective analysis

114

to clarify, whether the algorithm itself was failing to distinguish active from decoy

molecules or the DUD-E database evaluations require a conformation generator

(originally the benchmark set was built for docking tools. where conformation generators

were part of the software). Taken together the DUD-E may not be the best choice for pure

3D pharmacophore searches, but the question on how to build up a better test set

revealed the main problems. Assuming only observed three-dimensional conformations

were taken to build the set of active molecules, how would a meaningful set of decoys be

created (e.g. lowest energy conformation or known conformation from different receptor-

ligand complex)? Secondly the number of potential test set members would decrease a

lot, as there was significant less structural data available than activity measurements.

4.3 Prospective analysis

4.3.1 HIV-1 protease

To investigate the facts on the complete HIV-1 protease project, many successful stories

were presented. The initial study demonstrated the applicability of the state-of-the-art 3D

pharmacophore methods to reveal non-competitive inhibitors of the HIV-1 protease. The

concept of ligandability prediction was successfully applied and a ligandable binding site

was confirmed by X-ray crystallography [234]. In the second stage a prolonged MD-

simulation was performed in combination with a pocket detection algorithm specialized

for this kind of simulations. Potential binding pockets were targeted in the confirmed

potential binding region as well as on additional spots motivated by the pocket size and

known cavities described in the literature. Receptor-based pharmacophore models were

generated and the receptor flexibility was considered as well. The focus was on motivated

receptor conformations around the median sized pocket conformation. The motivation

based on the pocket size was only one possible solution and may be exchanged for

structural alignments or approaches focusing on the maximal possible pocket size. The

chosen way was a conservative approach trying to avoid artifacts of the simulations or

the pocket detection algorithm with the aim to work within the applicability domain of

the methods. The motivation of the chosen descriptor cluster radii was based on the

successful VS campaigns of the initial study and on the DUD-E evaluation. Even so the

Page 128: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Discussion

115

DUD-E only worked on active site inhibitors may the nature of a protein is conserved over

the complete structure and therefore the cluster radii profile was beneficial for all HIV-1

protease cavities. The Pareto Front calculations were applied to reduce the number of

screening hits in a motivated manner. The high number of conducted virtual screenings

in vendor libraries and the available budget in academic research limited the number of

possible activity tests. The best fitting molecules according to the descriptor should be

tested, while the chosen descriptor was size independent and promising compounds may

not fit into the pocket. Furthermore the detection of fragment like molecules would be

beneficial for hit to lead optimization processes and therefore the second optimization

criterion was chosen.

4.3.2 IspD

The project on IspD was an excellent example to demonstrate the multiple opportunities

on how to consider receptor flexibility in structure-based pharmacophore searches.

Three different pharmacophore models were created for the known allosteric binding

site. First of all a classical virtual ligand was calculated applying the updated feature point

definitions. In the second model the crystal structure temperature-factor was used to

describe the atom and therefore the PPP positions with a calculated fuzziness. The third

model was computational most expensive and implied a MD-simulation to derive the

receptor atom flexibilities later on included in the workflow. The combination of MD-

simulation, prediction of ligandable protein patches by LoRI and transient pocket

detection with MDpocket revealed a transient pocket, not present in original crystal

structure. The cut-off values for lipophilic interactions were chosen to include all

lipophilic PPPs within the tailored region of the histograms (Figure 24). There was no

scientific proof for the correct value, as the pockets are not characterized and key features

especially for the backside pockets were unknown. Based on the screening library the

number of lipophilic PPPs did not dominate the pharmacophores and therefore a

reduction of the number of potential lipophilic pharmacophore points was motivated. As

there was no obvious way to motivate the cluster radii for the descriptor calculation, the

standard values were applied. The pocket detection and the VS procedure were already

discussed in the HIV-1 protease section.

Page 129: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software
Page 130: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Conclusion and outlook

117

5 Conclusion and outlook

In conclusion we were able to include the crystallographic temperature-factor or other

measurements of atom position inaccuracies like the observed protein flexibility during

MD-simulations in a workflow for receptor-based pharmacophore searches. Assessed

against the project goals a modular interactive receptor-based pharmacophore toolkit

was developed fulfilling the following tasks:

The individual segments for pocket detection, pharmacophore point definition and

pharmacophore descriptor calculation are interchangeable and based on simple

file formats.

The potential binding pocket can be manually processed on the fly and optional

automated changes have been introduced to adjust the pocket by a radial atom

position inaccuracy value.

The geometric PPP definition rules have been updated to reflect the latest insights

in geometry based feature definition. The inaccuracy value can be transferred to

the positions of PPPs.

The high number of potential lipophilic interaction points can be automatically

reduced according to an on the fly user input.

Amino acid residue grouped PPPs for each feature type are visualized and can be

excluded from the descriptor calculation. The shape of the pocket and grouped

PPPs can be displayed.

Additional PPPs can be added manually to the PPP file format.

Receptor atom centered exclusion volumes are calculated and may be adjusted to

the inaccuracy value.

Page 131: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Conclusion and outlook

118

The overall workflow was well structured and the user was asked to give additional

input to optimize the resulting pharmacophore description. The potency of the

workflow was retrospectively evaluated and prospective studies were performed. The

modular structure allows for further developments on the workflow and may include

additional algorithms for:

The pocket detection, as for the time being only geometry-based algorithms

were applied.

The inclusion of atom position inaccuracies, as in the current version only

radial values was implemented

The PPP feature definition: There were additional feature types not considered

so far and alternative approaches like energy-based feature type definitions.

Pharmacophore descriptor calculations, as there were many ways to describe

and to compare pharmacophore models.

In the current state the workflow can adjust to the receptor flexibility and with a

combination of MD-simulation, ligandability prediction for surface patches and a pocket

detection algorithm we were able to generate flexible pharmacophores in dynamic

protein models.

Page 132: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

ACKNOWLEDGMENTS

119

6 ACKNOWLEDGMENTS First and foremost, I would like to thank Prof. Dr. Gisbert Schneider for giving me the

opportunity to work on such an interesting topic within his outstanding research group

at ETH in Zurich. He guided me throughout every step of my work, always keeping the

perfect distance for me to develop my own understandings of research and science, but

help me back on track whenever needed. Second to none, I would like to thank Prof. Dr.

Gerd Folkers, who agreed to act as my co-referee for my thesis. Further I want to thank

my collaborators, Jan Domanski at the NIH for the HIV-1 follow-up study and the IspX

group led by Prof. Dr. François Diederich for the most interesting and interdisciplinary

research on unexplored targets.

Many special thanks go to my colleagues in the CADD group led by Prof Schneider at ETH.

I enjoyed working with excellent researchers every day. I would like thank the senior

scientists, post docs and all the permanent members of the lab, namely Dr. Petra

Schneider, Dr. Jan Hiss, Dr. Tiago Rodrigues, Sarah Haller and Chrissula Chatzidis for the

well-organized environment you create for your PhD students. There was no unanswered

question or any better support I could have asked for. A special thank goes to all the

alumni who welcomed me in the group and supported me during fruitful discussions. I

would like to mention Katharina Stutz and thank her for her open-minded nature and for

many Swiss-German expressions I will never forget. Last but not least I would like to

thank Daniel Reker. Thank you Daniel for your friendship that developed over the last

four years. It is hard to pick out individual parts, but thank you for taking care about my

work-life balance, by helping me out with scientific discussions on the one hand and a

fitness coach and friend on the other hand.

Finally, I would like to thank my family and friends for always believing in me and

supporting me at all times. Without your support this work would not have been possible.

My last words shall express my gratitude to my beloved Eva, who supported me selfless

during the last period of my PhD.

Thank you all

Page 133: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

ACKNOWLEDGMENTS

120

Page 134: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

121

7 References

1. Hartenfeller M, Schneider G (2011) Enabling future drug discovery by de novo design. WIREs Comput. Mol. Sci. 1, 742-759.

2. Böhm H J (2002) Wirkstoffdesign : der Weg zum Arzneimittel. Spektrum Akad Verl. Heidelberg. 3. Schneider G (2008) Molecular design: concepts and applications. Wiley-VCH, Weinheim. 4. Keseru G M, Makara G M (2006) Hit discovery and hit-to-lead approaches. Drug Discov. Today

11, 741-748. 5. Alberts B (2009) Essential cell biology. Garland Science, New York. 6. Hopkins A L, Groom C R (2002) The druggable genome. Nat. Rev. Drug Discov. 1, 727–730. 7. Sumner J B (1926) The isolation and crystallization of the enzyme urease. J. Biol. Chem. 69, 435. 8. Sanger F, Tuppy H (1951) The amino-acid sequence in the phenylalanyl chain of insulin. I. The

identification of lower peptides from partial hydrolysates. Biochem. J. 49, 463. 9. Sanger F, Thompson E O (1953) The amino-acid sequence in the glycyl chain of insulin. I. The

identification of lower peptides from partial hydrolysates. Biochem. J. 53, 353. 10. Kendrew J C, Bodo G, Dintzis H M, Parrish R G, Wyckoff H, Phillips D C (1958) A three-

dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 181, 662. 11. Sanders M P A, McGuire R, Roumen L, de Esch I J P, de Vlieg J, Klomp J P G, de Graaf C (2012)

From the protein’s perspective: the benefits and challenges of protein structure-based pharmacophore modeling. Med. Chem. Commun. 3, 28-38.

12. Berman H M, Westbrook J, Feng Z, Gilliland G, Bhat T N, Weissig H, Shindyalov I N, Bourne P E

(2000) The Protein Data Bank, Nucleic. Acids Res. 28, 235. 13. Scannell J W, Blanckley A, Boldon H, Warrington B (2012) Diagnosing the decline in

pharmaceutical R&D efficiency. Nat. Rev. Drug Discov. 11, 191-200. 14. Schneider G, Böhm H J (2002) Virtual screening and fast automated docking methods. Drug

Discov. Today 7, 64-70. 15. Schneider G (2012) Designing the molecular future. J. Comput. Aid. Mol. Des. 26, 115-120. 16. Willett P (2006) Similarity-based virtual screening using 2D fingerprints. Drug Discov Today. 11,

1046-1053. 17. Klebe G (2000) Virtual screening: An alternative or complement to high throughput screening?.

Springer. Dordrecht. 18. Persidis A. (1998) High-throughput screening. Nat Biotechnol. 16, 488. 19. Schneider G (2010) Virtual screening: an endless staircase? Nat Rev Drug Discov. 9, 273–276 20. Schneider G, Baringhaus K H (2013) De novo design: From models to molecules. In: De Novo

Molecular Design (ed. Schneider G) Wiley-VCH, Weinheim, 1-56 21. Walters W P, Namchuk M (2003) Designing screens: How to make your hits a hit. Nat. Rev. Drug

Discov. 2, 259-266. 22. Roche O, Schneider P, Zuegge J, Guba W, Kansy M, Alanine A, Bleicher K, Danel F, Gutknecht E M,

Rogers-Evans M, Neidhart W, Stalder H, Dillon M, Sjögren E, Fotouhi N, Gillespie P, Goodnow R,

Page 135: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

122

Harris W, Jones P, Taniguchi M, Tsujii S, von der Saal W, Zimmermann G, Schneider G (2002) Development of a virtual screening method for identification of “frequent hitters” in compound libraries. J. Med. Chem. 45, 137-142.

23. Baell J B, Holloway G A (2010) New substructure filters for removal of pan assay interference

compounds (PAINS) from screening libraries and for their exclusion in bioassays. J. Med. Chem. 53, 2719-2740.

24. Lipinski C A, Lombardo F, Dominy B W, Feeney P J (1997) Experimental and computational

approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliver Rev. 23, 3-25.

25. Congreve M, Carr R, Murray C, Jhoti H (2003) A “rule of three” for fragment-based lead

discovery? Drug Discov. Today 8, 876-877. 26. Wunberg T, Hendrix M, Hillisch A, Lobell M, Meier H, Schmeck C, Wild H, Hinzen B (2006)

Improving the hit-to-lead process: Data-driven assessment of drug-like and lead-like screening hits. Drug Discov. Today 11, 175-180.

27. Leeson, P. (2012) Drug discovery: Chemical beauty contest. Nature 481, 455-456.

28. Giménez B G, Santos M S, Ferrarini M, Fernandes J P S M (2010) Evaluation of blockbuster drugs

under the rule-of-five. Pharmazie 65, 148–152. 29. Johnson M A, Maggiora G M (1990) Concepts and applications of molecular similarity. Wiley, New

York. 30. Martin Y C, Kofron J L, Traphagen L M (2002) Do structurally similar molecules have similar

biological activity? J. Med. Chem. 45, 4350-4358. 31. Xue L, Godden J W, Bajorath J (2000) Evaluation of descriptors and mini-fingerprints for the

identification of molecules with similar activity. J. Chem. Inf. Comp. Sci. 40, 1227-1234. 32. Maggiora G M (2006) On outliers and activity cliffs-why QSAR often disappoints. J. Chem. Inf.

Model. 46, 1535. 33. Borchardt R, Kerns R, Lipinski C, Thakker D, Wang B (2005) Pharmaceutical Profiling in Drug

Discovery for Lead Selection. Springer, USA.

34. Medina-Franco J L, Martinez-Mayorga K, Giulianotti MA, Houghten R A, Pinilla C (2008) Visualization of the chemical space in drug discovery. Curr Comput-Aid Drug. 4, 322-333.

35. Todeschini R, Consonni V (2008) Handbook of molecular descriptors. Wiley, New York. 36. Rognan D (2007) Chemogenomic approaches to rational drug design. Br. J. Pharmacol. 152, 38-

52. 37. Wiener H (1947) Structural determination of paraffin boiling points. J. Am. Chem. Soc. 69, 17-20. 38. Kier L B (1976) Molecular connectivity in chemistry and drug research. Academic Press, New

York 39. Kier L B, Hall L H (1986) Molecular connectivity in structure-activity analysis. Research Studies

Press, Letchworth. 40. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742-754. 41. Schneider G, Neidhart W, Giller T, Schmid G (1999) “Scaffold-hopping” by topological

pharmacophore search: A contribution to virtual screening. Angew. Chem. Int. Edit. 38, 2894-2896

Page 136: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

123

42. Perola E, Charifson P S (2004) Conformational Analysis of Drug-Like Molecules Bound to

Proteins: An Extensive Study of Ligand Reorganization upon Binding. J. Med. Chem. 47, 2499–2510

43. Kirchmair J, Wolber G, Laggner C, Langer T (2006) Comparative Performance Assessment of the

Conformational Model Generators Omega and Catalyst:  A Large-Scale Survey on the Retrieval of Protein-Bound Ligand Conformations. J. Chem. Inf. Model. 46, 1848–1861

44. Fechner U, Franke L, Renner S, Schneider P, Schneider G (2003) Comparison of correlation

vector methods for ligand-based similarity searching. J. Comput. Aid. Mol. Des. 17, 687-698. 45. Hristozov D P, Oprea T I, Gasteiger J (2007) Virtual screening applications: A study of ligand-

based methods and different structure representations in four different scenarios. J. Comput. Aid. Mol. Des. 21, 617-640.

46. Tanrikulu Y, Nietert M, Scheffer U, Proschak E, Grabowski K, Schneider P, Weidlich M, Karas M,

Gobel M, Schneider G (2007) Scaffold hopping by “fuzzy” pharmacophores and its application to RNA targets. Chembiochem. 8, 1932-1936

47. Reutlinger M (2013) Adaptive combinatorial de novo design of multi-target modulating

compounds. PhD thesis, Eidgenossische Technische Hochschule Zurich. 48. Leach A R, Gillet V J (2007) Similarity Methods. In: An Introduction To Chemoinformatics,

Springer Netherlands, Dordrecht, 99-117. 49. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2004) Comparison of

fingerprint-based methods for virtual screening using multiple bioactive reference structures. J. Chem. Inf. Comp. Sci. 44, 1177-1185.

50. Wood FJ, de Vlieg J, Wagener M, Ritschel T (2012) Pharmacophore Fingerprint-Based Approach

to Binding Site Subpocket Similarity and Its Application to Bioisostere Replacement. J. Chem. Inf. Model. 52, 2031–2043.

51. Hawkins P C, Warren G L, Skillman A G, Nicholls A (2008) How to do an evaluation: Pitfalls and

traps. J. Comput. Aid. Mol. Des. 22, 179-190. 52. Truchon J F, Bayly C I (2007) Evaluating virtual screening methods: Good and bad metrics for

the “early recognition” problem. J. Chem. Inf. Model. 47, 488-508. 53. Hanely J A, McNeil B J (1982) The meaning and use of the area under a Receiver Operating

Characteristic (ROC) curve. Radiology 143, 29-36. 54. Wang R, Wang S (2001) How does consensus scoring work for virtual library screening? An

idealized computer experiment. J. Chem. Inf. Comp. Sci. 41, 1422-1426. 55. Nasr R J, Swamidass S J, Baldi P F (2009) Large scale study of multiple-molecule queries. J.

Cheminform. 1, 1-7. 56. Melville J L, Burke E K, Hirst J D (2009) Machine learning in virtual screening. Comb. Chem. High

T. Scr. 12, 332-343. 57. Wallach I (2011) Pharmacophore inference and its application to computational drug discovery.

Drug Develop. Res. 72, 17–25. 58. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano (2014) A Critical assessment of

methods of protein structure prediction (CASP)—round x. Proteins 82, 1-6.

Page 137: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

124

59. Mount D M. (2004). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, New York

60. Chothia C, Lesk A M (1986) The relation between the divergence of sequence and structure in

proteins, Embo. J. 5, 823 61. Klabunde T (2007) Chemogenomic approaches to drug discovery: similar receptors bind similar

ligands. Brit. J. Pharmacol. 152, 5. 62. Geppert T, Hoy B, Wessler S, Schneider G (2011) Context-based identification of protein-protein

interfaces and “hot-spot” residues. Chem. Biol. 18, 344-353. 63. Bonvin A M J J (2006) Flexible protein–protein docking. Curr. opin. struc. biol. 16, 194-200 64. Desaphy J, Azdimousa K, Kellenberger E, Rognan D (2012) Comparison and Druggability

Prediction of Protein–Ligand Binding Sites from Pharmacophore-Annotated Cavity Shapes. J. Chem. Inf. Model. 52, 2287–2299

65. Nair R, Liu J, Soong T T, Acton T B, Everett J K, Kouranov A, Fiser A, Godzik A, Jaroszewski L,

Orengo C, Montelione G T, Rost B (2009) Structural genomics is the largest contributor of novel structural leverage. J. Struct. Funct. Genomics 10, 181−191.

66. Sanders M P A, Verhoeven S, de Graaf C, Roumen L, Vroling B, Nabuurs S B, de Vlieg J, Klomp J P

G (2011) Snooker: A Structure-Based Pharmacophore Generation Tool Applied to Class A GPCRs. J. Chem. Inf. Model. 51, 2277–2292.

67. Edfeldt F N, Folmer R H, Breeze A L (2011) Fragment screening to predict druggability

(ligandability) and lead discovery success. Drug Discov. Today 16, 284–287. 68. Surade S, Blundell T L (2012) Structural biology and drug discovery of difficult targets: The

limits of ligandability. Chem. Biol. 19, 42–50. 69. Fisher E (1894) Einfluss der Configuration auf die Wirkung der Enzyme. Ber. Dtsch. Chem. Ges.

27, 2985–2993. 70. Koshland D E (1958) Application of a theory of enzyme specificity to protein synthesis. P. Natl.

Acad. Sci. U.S.A. 44, 98-104. 71. Boehr D D, Nussinov R, Wright P E (2009) The role of dynamic conformational ensembles in

biomolecular recognition. Nat. Chem. Biol. 5, 789-796. 72. DeDecker B S (2000) Allosteric drugs: thinking outside the active-site box. Chem. Biol. 7, 103–

107. 73. Tsai C J, del Sol A, Nussinov R (2008) Allostery: absence of a change in shape does not imply that

allostery is not at play. J. mol. biol. 378, 1-11. 74. Tsai C J, del Sol A, Nussinov R (2009) Protein allostery, signal transmission and dynamics: a

classification scheme of allosteric mechanisms. Mol. Biosyst. 5, 207-216. 75. Henrich S, Salo-Ahen O M H, Huang B, Rippmann F F, Cruciani G, Wade R C (2010)

Computational approaches to identifying and characterizing protein binding sites for ligand design. J. Mol. Reconit. 23, 209–219.

76. Bissantz C, Kuhn B, Stahl M (2010) A Medicinal Chemist’s Guide to Molecular Interaction. J. Med.

Chem. 53, 5061–5084 77. Garcia-Sosa A, Mancera R L, Dean P M (2003) WaterScore: a novel method for distinguishing

between bound and displaceable water molecules in the crystal structure of the binding site of protein-ligand complexes. J. Mol. Model. 9, 172–182.

Page 138: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

125

78. Pérot S, Sperandio O, Miteva M A, Camproux A C, Villoutreix B O (2010) Druggable pockets and

binding site centric chemical space: a paradigm shift in drug discovery. Drug Discov. Today 15, 656–667.

79. Greenidge P A, Carlsson B, Bladh L G, Gillner M (1998) Pharmacophores incorporating

numerous excluded volumes defined by X-ray crystallographic structure in three-­dimensional database searching: Application to the thyroid hormone receptor. J. Med. Chem. 41, 2503-2512.

80. Bondi A (1964) van der Waals Volumes and Radii. J. Phys. Chem. 68, 441–451. 81. Lee B, Richards F M (1971) The interpretation of protein structures: estimation of static

accessibility. J. Mol. Biol. 55, 379–400. 82. Richards F M (1977) Areas, volumes, packing, and protein structure. Annu. Rev. Biophys. Bio. 6,

151–176. 83. ROCS; OpenEye Scientific Software: Santa Fe, NM, 2006. 84. Blundell T L, Jhoti H, Abell C (2002) High-throughput crystallography for lead discovery in drug

design. Nat. Rev. Drug Discov. 1, 45–54. 85. Blaney J M, Dixon J S (1993) A good ligand is hard to find: Automated docking methods. Perspect.

Drug Discov. 1, 301-319. 86. Moitessier N, Englebienne P, Lee D, Lawandi J, Corbeil CR (2008) Towards the development of

universal, fast and highly accurate docking/scoring methods: A long way to go. Brit. J. Pharmacol. 153, 7-26.

87. Warren G L, Andrews C W, Capelli A M, Clarke B, LaLonde J, Lambert M H, Lindvall M, Nevins N,

Semus S F, Senger S, Tedesco G, Wall I D, Woolven J M, Peishoff C E, Head M S (2006) A critical assessment of docking programs and scoring functions. J. Med. Chem. 49, 5912-5931.

88. Waszkowycz B, Clark D E, Gancia E (2011) Outstanding challenges in protein–ligand docking

and structure-based virtual screening. WIREs Comput. Mol. Sci. 1, 229-259. 89. Plewczynski D, Łaźniewski M, Augustyniak R, Ginalski K (2011) Can we trust docking results?

Evaluation of seven commonly used programs on PDBbind database. J. Comput. Chem. 32, 742-755.

90. Bissantz C, Folkers G, Rognan D (2000) Protein-based virtual screening of chemical databases. 1.

Evaluation of different docking/scoring combinations. J. Med. Chem. 43, 4759-4767. 91. Friesner R A, Banks J L, Murphy R B, Halgren T A, Klicic J J, Mainz D T, Repasky M P, Knoll E H,

Shelley M, Perry J K (2004) Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47, 1739-1749.

92. Kuntz I D, Blaney J M, Oatley S J, Langridge R, Ferrin T E (1982) A geometric approach to

macromolecule-ligand interactions. J. Mol. Biol. 161, 269-288. 93. Horvath D (2011) Pharmacophore-based virtual screening. In: Chemoinformatics and

Computational Chemical Biology, Springer, New York, 261-298. 94. Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an

incremental construction algorithm. J. Mol. Biol. 261, 470-489. 95. Metropolis N, Ulam S (1949) The Monte Carlo method. J. Am. Stat. Assoc. 44, 335-341. 96. Kirkpatrick S (1984) Optimization by Simulated Annealing: Quantitative Studies. J. Stat. Phys. 34,

975-986.

Page 139: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

126

97. Goldberg D E (1989) Genetic algorithms in search, optimization, and machine learning. Addison-

Wesley, Boston. 98. Oshiro C M, Kuntz I D, Dixon J S (1995) Flexible ligand docking using a genetic algorithm. J.

Comput. Aid. Mol. Des. 9, 113-130. 99. Jones J G, Willett P, Glen R C, Leach A R, Taylor R (1997) Development and validation of a

genetic algorithm for flexible docking. J. mol. biol. 267, 727-748

100. Baxter C A, Murray C W, Clark D E, Westhead D R, Eldridge M D (1998) Flexible docking using Tabu search and an empirical estimate of binding affinity. Proteins 33, 367-382

101. Miyamoto S, Kollman P A (1993) Absolute and relative binding free energy calculation of the

interaction of biotin and its analogs with streptavidin using molecular dynamics/free energy perturbation approaches. Proteins 16, 226-245

102. Höltje H D, Sippl W, Rognan D, Folkers G (2008) Molecular Modeling - Basic Principles and

Applications, Wiley-VCH, Weinheim. 103. Böhm H J (1994) On the use of LUDI to search the fine chemicals directory for ligands of

proteins of known three-dimensional structure. J. Comput. Aid. Mol. Des. 8, 623-632.

104. Ferrara P, Gohlke H, Price D J, Klebe G, Brooks C L (2004) Assessing scoring functions for protein-ligand interactions. J. Med. Chem. 47, 3032-3047.

105. Dill K A (1997) Additivity principles in biochemistry. J. Biol. Chem. 272, 701-704. 106. Morris G M, Goodsell D S, Halliday R S, Huey R, Hart W E, Belew R K, Olson, A J (1998)

Automated Docking Using a Lamarckian Genetic Algorithm and and Empirical Binding Free Energy Function. J. Comput. Chem. 19,1639-1662.

107. Ewing T J, Makino S, Skillman A G, Kuntz I D (2001) DOCK 4.0: search strategies for automated

molecular docking of flexible molecule databases. J. Comp. aid. mol. des. 15, 411-428 108. Eldridge M D, Murray C W, Auton T R, Paolini G V, Mee R P, (1997) Empirical scoring functions: I.

The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. comp. aid. mol. des. 11, 425-445

109. Rognan D, Lauemøller S L, Søren Buus A H, Tschinke V (1999) Predicting Binding Affinities of Protein Ligands from Three-Dimensional Models:  Application to Peptide Binding to Class I Major Histocompatibility Proteins. J. Med. Chem. 42, 4650–4658.

110. Cozzini P, Fornabaio M, Marabotti A, Abraham D J, Kellogg G E, Mozzarelli A (2002) Simple,

Intuitive Calculations of Free Energy of Binding for Protein−Ligand Complexes. 1. Models without Explicit Constrained Water. J. Med. Chem. 45, 2469–2483.

111. Krammer A, Kirchhoff P D, Jiang X, Venkatachalam C M, Waldman M (2005) Lig Score: a novel

scoring function for predicting binding affinities. J. Mol. Graphics 23, 395–407. 112. Böhm H-J (1994) The development of a simple empirical scoring function to estimate the

binding constant for a protein-ligand complex of known three---dimensional structure. J. Comput. Aid. Mol. Des. 8, 243-256.

113. Gehlhaar D K, Verkhivker G M, Rejto P A, Sherman C J, Fogel D R, Fogel L J, Freer S T (1995)

Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: conformationally flexible docking by evolutionary programming. Chem. Biol. 2, 317–324.

114. Stahl M and Rarrey M (2001) Detailed analysis of scoring functions for virtual screening. J. Med.

Chem. 44, 1035-1042.

Page 140: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

127

115. Wang R, Lai L, Wang S (2002) Further development and validation of empirical scoring

functions for structure-based binding affinity prediction. J. Chem. Aid. Mol. Des. 16, 11-26. 116. Mitchell J B O, Laskowski R A, Alex A, Forster M J and Thornton J M (1999) BLEEP—potential of

mean force describing protein–ligand interactions: II. Calculation of binding energies and comparison with experimental data. J. Comput. Chem. 20, 1177–1185.

117. Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein-

ligand interactions. J. Mol. Biol. 295, 337-356. 118. Muegge I, Martin Y C (1999) A general and fast scoring function for protein-ligand interactions:

A simplified potential approach. J. Med. Chem. 42, 791-804. 119. Ishchenko A V, Shakhnovich E I (2002) Small molecule groth 2001 (SMoG2001): an improved

knowledge-based scoring function for protein-ligand interactions. J. Med. Chem. 45, 3032-3047. 120. Klebe G (2006) Virtual ligand screening: Strategies, perspectives and limitations. Drug Discov

Today. 11, 580-594. 121. Ellingson S R, Dakshanamurthy S, Brown M, Smith J C, Baudry J (2014) Accelerating virtual high-

throughput ligand docking: current technology and case study on a petascale supercomputer. Concurr. Comp-Pract. E. 26, 1268–1277.

122. Zhao H, Gartenmann L, Dong J, Spiliotopoulos D, Caflisch A (2014)Discovery of BRD4

bromodomain inhibitors by fragment-based high-throughput docking. Bioorg. Med. Chem. Lett. 24, 2493–2496.

123. Shulman-Peleg A, Nussinov R, Wolfson H J. (2004) Recognition of functional sites in protein

structures. J. Mol. Biol. 339, 607-33. 124. Colman P M (1994), Structure-based drug design, Curr. Opin. Struct. Biol. 4, 868-74. 125. Klebe, G. (2000), Recent developments in structure-based drug design. J. Mol. Med. 78, 269-281. 126. Weigelt J, McBroom-Cerajewski L D B, Schapira M, Zhao Y, Arrowmsmith C H (2008) Structural

genomics and drug discovery: all in the family. Curr. Opin. Chem. Biol. 12, 32–39 127. Fedorov O, Sundström M, Marsden B, Knapp S (2007) Insights for the development of specific

kinase inhibitors by targeted structural genomics. Drug Discov. Today 12, 365–372 128. Kolb P, Ferreira R S, Irwin J J, Shoichet B K (2009) Docking and chemoinformatic screens for

new ligands and targets. Curr. Opin. Biotech. 20, 429–436 129. Chalk A J, Worth C L , Overington J P, Chan A W E (2004) PDBLIG: classification of small

molecular protein binding in the Protein Data Bank. J. Med. Chem. 47, 3807–3816. 130. Abagyan R, Kufareva I (2009) The flexible pocketome engine for structural chemogenomics.

Methods Mol. Biol. 575, 249–279. 131. Kellenberger E, Schalon C, Rognan D (2008) How to measure the similarity between protein

ligand-binding sites? Curr. Comput-Aid. Drug 4, 209–220.

132. Andersson C D, Chen B Y, Linusson A (2009) Mapping of ligand-binding cavities in proteins. Proteins 78, 1408–1422

133. Klabunde, T. (2007) Chemogenomic approaches to drug discovery: similar receptors bind

similar ligands, Brit. J. Pharmacol. 152, 5.

Page 141: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

128

134. Weisel M, Kriegl J M, Schneider G (2010) Architectural repertoire of ligand-binding pockets on protein surfaces. ChemBioChem. 11,v556–563.

135. Nayal M, Honig B (2006) On the nature of cavities on protein surfaces: application to the

identification of drug-binding sites. Proteins, 63, 892–906. 136. Laskowski R A, Luscombe N M, Swindells M B, Thornton J M (1996) Protein clefts in molecular

recognition and function. Protein Sci. 5, 2438–2452. 137. Laskowski R A (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and

intermolecular interactions. J. Mol. Graph. 13, 323-330. 138. Weisel M, Proschak E, Schneider G (2007) PocketPicker: analysis of ligand binding-sites with

shape descriptors. Chem. Cent. J. 1, 1. 139. Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities:

Measurements of binding site geometry and implications for ligand design. Protein Sci. 7, 1884-1897.

140. Binkowski T, Naghibzadeh S, Liang J (2003) CASTp: Computed Atlas of Surface Topography of

proteins. Nucleic. Acids. Res. 31, 3352-3355. 141. Aurenhammer F (1991) Voronoi diagrams – a survey of a fundamental geometric data structure.

ACM Comput. Surv. (CSUR) 23, 345-405. 142. Lee D T, Schachter B J (1980) Two Algorithms for constructing a Delaunay triangulation. Int. J.

Comput. Inf. Sci. 9, 219-242 143. Brady G P, Stouten P F W (2000) Fast prediction and visualization of protein binding pockets

with PASS. J. Comput. Aid. Mol. Des. 14, 383-401. 144. Levitt D G, Banaszak L J (1992) POCKET: A computer graphics method for identifying and

displaying protein cavities and their surrounding amino acids. J. Mol. Graph. 10, 229-234 145. Hendlich M, Rippmann F, Barnickel G (1997) LIGSITE: Automatic and efficient detection of

potential small molecule-binding sites in proteins. J. Mol. Graph. Model. 15, 359-363. 146. Huang B, Schröder M (2006) LIGSITEcsc: predicting ligand binding sites using the Connolly

surface and degree of conservation. BMC Struct. Biol. 6, 19-29. 147. Ghersi D, Sanchez R (2011) Beyond structural genomics: computational approaches for the

identification of ligand binding sites in protein structures. J. Struct. Funct. Genomics. 12, 109-117.

148. Goodford P J (1985) A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J. Med. Chem. 28, 849–857.

149. Cruciani G (2006) Molecular interaction fields - Applications in drug discovery and ADME

prediction. Wiley, Weinheim. 150. Laurie A T, Jackson R M (2005) Q-SiteFinder: an energy-based method for the prediction of

protein-ligand binding sites. Bioinformatics 21, 1908–1916. 151. Morita M, Nakamura S, Shimizu K (2008) Highly accurate method for ligand-binding site

prediction in unbound state (apo) protein structures. Proteins 73, 468–479. 152. Ghersi D, Sanchez R (2009) EasyMIFS and SiteHound: a toolkit for the identification of ligand-

binding sites in protein structures. Bioinformatics 25, 3185–3186.

Page 142: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

129

153. Harris R, Olson A J, Goodsell D S (2008) Automated prediction of ligand-binding sites in proteins. Proteins 70, 1506–1517.

154. Mattos C, Ringe D (1996) Locating and characterizing binding sites on proteins. Nat. Biotechnol.

14, 595–599. 155. Vajda S, Guarnieri F (2006) Characterization of protein-ligand interaction sites using

experimental and computational methods. Curr. Opin. Drug Di. De. 9, 354–362. 156. Capra J A, Singh M (2007) Predicting functionally important residues from sequence

conservation. Bioinformatics 23, 1875–1882. 157. Artymiuk P J, Poirrette A R, Grindley H M, Rice D W, Willett P (1994) A graph-theoretic approach

to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J. Mol. Biol. 243, 327–344.

158. Zhang Z, Grigorov M G (2006) Similarity networks of protein binding sites. Proteins 62, 470–

478. 159. Xie L, Bourne E (2007)A robust and efficient algorithm for the shape description of protein

structures and its application in predicting ligand binding sites. Bioinformatics 8, 9. 160. Wallace A C, Borkakoti N, Thornton J M (1997) TESS: a geometric hashing algorithm for deriving

3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci. 6, 2308–2323.

161. Brylinski M, Skolnick J (2009) FINDSITE: a threading-based approach to ligand homology

modeling. PLoS. Comput. Biol. 5, e1000405. 162. Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to

protein function prediction. Brief. Bioinform. 10, 378–391. 163. Wass M N, Kelley L A, Sternberg M J (2010) 3DLigandSite: predicting ligand-binding sites using

similar structures. Nucleic Acids Res. 38, W469–W473. 164. Peters K P, Fauck J, Frommel C (1996) The automatic search for ligand binding sites in proteins

of known three-dimensional structure using only geometric criteria. J. Mol. Biol. 256, 201–213. 165. Zhong S, Mackerell A D Jr (2007) Binding response: a descriptor for selecting ligand binding site

on protein surfaces. J. Chem. Inf. Model. 47, 2303–2315. 166. Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, Liang J (2006) CASTp: computed atlas of

surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res. 34, W116–W118.

167. Petrek M, Otyepka M, Banas P, Kosinova P, Koca J, Damborsky J (2006) CAVER: a new tool to

explore routes from protein clefts, pockets and cavities. BMC Bioinformatics 7, 316. 168. Capra J A, Laskowski R A, Thornton J M, Singh M, Funkhouser T A (2009) Predicting protein

ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput. Biol. 5, e1000585.

169. Lichtarge O, Bourne H R, Cohen F E (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol. 257, 342–358.

170. Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket

detection. BMC Bioinformatics 10, 168.

Page 143: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

130

171. Brenke R, Kozakov D, Chuang G Y, Beglov D, Hall D, Landon M R, Mattos C, Vajda S (2009) Fragment-based identification of druggable “hot spots” of proteins using Fourier domain correlation techniques. Bioinformatics 25, 621–627.

172. An J, Totrov M, Abagyan R (2005) Pocketome via comprehensive identification and classification

of ligand binding envelopes. Mol. Cell. Proteomics 4, 752–761. 173. Till M S, Ullmann G M (2010) McVol - a program for calculating protein volumes and identifying

cavities by a Monte Carlo algorithm. J. Mol. Model. 16, 419–429. 174. Parca L, Gherardini P F, Helmer-Citterich M, Ausiello G (2011) Phosphate binding sites

identification in protein structures. Nucleic Acids Res. 39, 1231–1242. 175. Kalidas Y, Chandra N (2008) PocketDepth: a new depth based algorithm for identification of

ligand binding sites in proteins. J. Struct. Biol. 161, 31–42. 176. Nayal M, Honig B (2006) On the nature of cavities on protein surfaces: application to the

identification of drug-binding sites. Proteins 63, 892–906. 177. Halgren T (2007) New method for fast and accurate binding-site identification and analysis.

Chem. Biol. Drug. Des. 69, 146–148. 178. Halgren T A (2009) Identifying and characterizing binding sites and assessing druggability. J.

Chem. Inf. Model. 49, 377–389. 179. Tseng Y Y, Dupree C, Chen Z J, Li W H (2009) SplitPocket: identification of protein functional

surfaces and characterization of their spatial patterns. Nucleic Acids Res. 37, W384–W389. 180. Ho C, Marshall G, (1990) Cavity search: An algorithm for the isolation and display of cavity-like

binding regions J. Comput. Aid. Mol. Des. 4, 337–354. 181. Delaney J (1992) Finding and filling protein cavities using cellular logic operations J. Mol. Graph.

10, 174–177. 182. Del Carpio C, Takahashi Y, Sasaki S A (1993) new approach to the automatic identification of

candidates for ligand receptor sites in proteins: (I). Search for pocket regions. J. Mol. Graph. 11, 23–9.

183. Kleywegt G, Jones T (1994) Detection, delineation, measurement and display of cavities in

macromolecular structures. Acta Crystallogr.D. 50, 178–185. 184. Stahl M, Taroni C, Schneider G (2000) Mapping of protein surface cavities and prediction of

enzyme class by a self-organizing neural network. Protein Eng. 13, 83–88. 185. Venkatachalam C, Jiang X, Oldfield T, Waldman M (2003) LigandFit: A novel method for the

shape-directed rapid docking of ligands to protein active sites. J. Mol. Graph. Model. 21, 289–307.

186. Coleman R, Sharp K (2006) Travel depth, a new shape descriptor for macromolecules:

Application to ligand binding J. Mol. Biol. 362, 441–458. 187. Li B, Turuvekere S, Agrawal M, La D, Ramani K, Kihara D (2008). Characterization of local

geometry of protein surfaces with the visibility criterion. Proteins 71, 670–683. 188. Tripathi A, Kellogg G (2010) A novel and efficient tool for locating and characterizing protein

cavities and binding sites. Proteins 78, 825–842. 189. Ruppert J, Welch W, Jain (1997) A. Automatic identification and representation of protein

binding sites for molecular docking. Protein Sci. 6, 524–533.

Page 144: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

131

190. Bliznyuk A, Gready J (1998) Identification and energetic ranking of possible docking sites for pterin on dihydrofolate reductase J. Comput. Aid. Mol. Des. 12, 325–333.

191. Kortvelyesi T, Silberstein M, Dennis S, Vajda S (2003) Improved mapping of protein binding

sites. J. Comput. Aid. Mol. Des. 17, 173–186. 192. An J, Totrov M, Abagyan R (2004) Comprehensive identification of “druggable” protein ligand

binding sites. Genome Inform. Ser. 15, 31–41. 193. An J, Totrov M, Abagyan R (2005) Pocketome via comprehensive identification and classification

of ligand binding envelopes. Mol. Cell. Proteomics 4, 752–761. 194. Casari G, Sander C, Valencia A (1995) A method to predict functional residues in proteins. Nat.

Struct. Biol. 2, 171–178. 195. de Rinaldis M, Ausiello G, Cesareni G, Helmer-Citterich M (1998) Three-dimensional profiles: A

new tool to identify protein surface similarities. J. Mol. Biol. 284, 1211–1221. 196. Aloy P, Querol E, Aviles F, Sternberg M (2001) Automated structure-based prediction of

functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311, 395–408.

197. Armon A, Graur D, Ben-Tal N (2001) ConSurf: An algorithmic tool for the identification of

functional regions in proteins by surface mapping of phylogenetic information J. Mol. Biol. 307, 447–463.

198. Pupko T, Bell R, Mayrose I, Glaser F, Ben-Tal N (2002) Rate4Site: An algorithmic tool for the

identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 18, 71–7.

199. Huang B, Schroeder M (2006) LIGSITEcsc: Predicting ligand binding sites using the Connolly

surface and degree of conservation. BMC Struct. Biol. 6, 19. 200. Huang B (2009) MetaPocket: A meta approach to improve protein ligand binding site prediction.

Omics 13, 325–330. 201. Bray T, Chan P, Bougouffa S, Greaves R, Doig A, Warwicker J (2009) SitesIdentify: A protein

functional site prediction tool. BMC Bioinf. 10, 379. 202. Volkamer A, Griewel A, Grombacher T, Rarey M (2010) Analyzing the Topology of Active Sites:

On the Prediction of Pockets and Subpockets. J. Chem. Inf. Model. 50, 2041–2052. 203. Luque I, Freire E (2000) Structural stability of binding sites: consequences for binding affinity

and allosteric effects. Proteins 41, 63–71. 204. Nisius B, Sha F, Gohlke H (2012) Structure-based computational analysis of protein binding sites

for function and druggability prediction. J. Biotechnol. 159, 123–134. 205. Liang J, Woodward C, Edelsbrunner H (1998) Anatomy of protein pockets and cavities:

measurement of binding site geometry and implications for ligand design. Protein Sci. 7, 1884–1897.

206. Morphy R, Kay C, Rankovic Z (2004) From magic bullets to designed multiple ligands. Drug

Discov. Today 9, 641-651. 207. Xie L, Xie L, Kinnings S L, Bourne P E (2012) Novel computational approaches to

polypharmacology as a means to define responses to individual drugs. Annu. Rev. Pharmacol. 52, 361-379.

Page 145: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

132

208. Weisel, M., Proschak, E., Kriegl, J. M., Schneider, G. (2009) Form follows function: Shape analysis of protein cavities for receptor-based drug design. Proteomics 9, 451.

209. Brown D, Superti-Furga G (2003) Rediscovering the sweet spot in drug discovery. Drug Discov.

Today 8, 1067–1077. 210. Volkamer A, Kuhn D, Grombacher T, Rippmann F, Rarey M (2012) Combining Global and Local

Measures for Structure-Based Druggability Predictions. J. Chem. Inf. Model. 52, 360–372. 211. Reisen F (2012) Modeling Structure-Function Realationships Of Macromolecular Ligand Binding

Sites By Fuzzy Graphs. PhD thesis, Eidgenossische Technische Hochschule Zurich. 212. Schmitt S, Kuhn D, Klebe, G. (2002) A new method to detect related function among proteins

independent of sequence and fold homology. J. Mol. Biol. 323, 387. 213. Jambon M, Imberty A, Deléage G, Geourjon C (2003) A new bioinformatic approach to detect

common 3D sites in protein structures. Proteins Struct. Funct. Bioinf. 52, 137. 214. Milik M, Szalma S, Olszewski K A (2003) Common structural cliques: a tool for protein structure

and function analysis. Protein Eng. 16, 543. 215. Gold N D, Jackson R M (2006) SitesBase: a database for structure-based protein–ligand binding

site comparisons. Nucleic Acids Res. 34, D231. 216. Andersson C D, Chen B Y, Linusson A (2010) Mapping of ligand-binding cavities in proteins.

Proteins 78, 1408. 217. Comin M, Guerra V, Dellaert F (2009) Binding Balls: Fast Detection of Binding Sites Using a

Property of Spherical Fourier Transform. J. Comput. Biol. 16, 1577-1591. 218. Morris R J, Najmanovich R J, Kahraman A, Thornton J M (2005) Real spherical harmonic

expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics 21, 2347.

219. Sael L, La D, Li B, Rustamov R, Kihara D (2008) Rapid comparison of properties on protein

surface, Proteins 73, 1. 220. Venkatraman V, Sael L, Kihara D (2009) Potential for Protein Surface Shape Analysis Using

Spherical Harmonics and 3D Zernike Descriptors. Cell. Biochem. Biophys. 54, 23–32. 221. Wlodarski T, Zagrovic B (2009) Conformational selection and induced fit mechanism underlie

specificity in noncovalent interactions with ubiquitin. P. Natl. Acad. Sci. U.S.A. 106, 19346–19351.

222. Ahmed A, Kazemi S, Gohlke H (2007) Protein flexibility and mobility in structure-based drug

design. Front. Drug Des. Discov. 3, 455–476. 223. Cozzini P, Kellogg G E, Spyrakis G, Abraham D J, Costantino G, Emerson A, Fanelli F, Gohlke H,

Kuhn L A, Morris G M, Orozco M, Pertinhez P A, Rizzi M, Sotriffer C (2008) Target flexibility: an emerging consideration in drug discovery and design. J. Med. Chem. 51, 6237–6255.

224. Ma B, Shatsky M, Wolfson H J M, Nussinov R (2002) Multiple diverse ligands binding at a single

protein site: a matter of pre-existing populations. Protein Sci. 11, 184–197. 225. Zheng X, Gan L, Wang E, Wang J (2013) Pocket-Based Drug Design: Exploring Pocket Space. The

AAPS Journal, 15, 228-241 226. McCammon JA (2005) Target flexibility in molecular recognition. Biochim. Biophys. Acta 1754,

221–224.

Page 146: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

133

227. Rueda M, Bottegoni G, Abagyan R (2010) Recipes for the selection of experimental protein conformations for virtual screening. J. Chem. Inf. Model. 50, 186–93.

228. Sperandio O, Mouawad L, Pinto E, Bruno OV, Perahia D, Miteva M A (2010) How to choose

relevant multiple receptor conformations for virtual screening: a test case of Cdk2 and normal mode analysis. Eur. Biophys. J. 39, 1365–72.

229. Zhu J, Fan H, Liu H, Shi Y (2001) Structure-based ligand design for flexible proteins: Application

of new F-DycoBlock. J. Comput. Aid. Mol.Des. 15, 979–996. 230. Liu H Y, Duan Z H, Luo Q M, Shi Y Y (1999) Structure-based ligand design by dynamically

assembling molecular building blocks at binding site. Proteins 36, 462. 231. Eyrisch S, Helms V (2007) Transient pockets on protein surfaces involved in protein–protein

interaction. J. Med. Chem. 50, 3457–3464. 232. Schmidtke P, Bidon-Chanal A, Luque F J, Barril X (2011) MDpocket: open-source cavity detection

and characterization on molecular dynamics trajectories. Bioinformatics 27, 3276-3285. 233. Kunze J, Todoroff N, Schneider P, Rodrigues T, Geppert T, Reisen F, Schreuder H, Saas J, Hessler

G, Baringhaus K H, Schneider G (2014) Targeting dynamic pockets of HIV-1 protease by structure-based computational screening for allosteric inhibitors. J. Chem. Inf. Model. 54, 987–991.

234. Todoroff N, Kunze J, Schreuder H, Hessler G, Baringhaus KH, Schneider G (2014) Fractal

dimensions of macromolecular structures. Mol. Inf. 33, 588-596.

235. Van Drie J H (2007) Monty Kier and the Origin of the Pharmacophore Concept. Internet Electron.

J. Mol. Des. 6, 271–279.

236. Langer T (2010) Pharmacophores in Drug Research. Mol. Inf.29, 470–475.

237. Kier L B (1967) Molecular orbital calculation of preferred conformations of acetylcholine,

muscarine, and muscarone. Mol. Pharmacol. 3, 487–494.

238. Kier L B (1971) MO Theory in Drug Research, Academic Press, New York, 164–169.

239. Gund P (1977) Three-dimensional pharmacophore pattern searching. Prog. Mol. Subcell. Biol. 5, 117-143.

240. Wermuth C G, Ganellin C R, Lindberg P, Mitscher L A (1998) Glossary of Terms Used in Medicinal

Chemistry (IUPAC Recommendations 1997). Annu. Rep. Med. Chem. 33, 385-395.

241. Langer T, Hoffmann R D (2006) Pharmacophores and Pharmacophore Searches, Volume 3. Wiley-VCH, Weinheim

242. Evans B E, Rittle K E, Bock M G, Di-Pardo R M, Freidinger R M, Whitter W L, Lundell G F, Veber D

F, Anderson P S, Chang R S, Lotti V J, Cerno D J, Chen T B, Kling P J, Kunkel K A, Springer J P, Hirshfield J (1988) Methods for drug discovery: development of potent, selective, orally effective cholecystokinin antagonists. J. Med. Chem. 31, 2235–2246.

243. Nicholls A (2010) Molecular Shape and Medicinal Chemistry: A Perspective .J. Med. Chem. 53,

3862–3886.

244. Lemmen C, Lengauer T (2000) Computational methods for the structural alignment of molecules. J.Comput. Aid. Mol. Des., 14: 215–232.

245. Grant J A, Pickup B T (1995) A Gaussian Description of Molecular Shape. J. Phys. Chem. 99, 3503–

3510.

Page 147: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

134

246. Grant J A, Gallardo M A, Pickup B T (1996) A fast method of molecular shape comparison: A

simple application of a Gaussian description of molecular shape. J. Comput. Chem. 17 1653.

247. ROCS-CFF; OpenEye Scientific Software: Santa Fe, NM, 2006.

248. Vainio M, Puranen S P, Johnson M S (2009) ShaEP: Molecular Overlay Based on Shape and Electrostatic Potential. J. Chem. Inf. Model. 49, 492–502

249. Ballester P J, Richards W G (2007) Ultrafast shape recognition to search compound databases

for similar molecular shapes. J. Comput. Chem. 28, 1711–1723.

250. Hiagh J A, Pickup B T, Grant J A, Nicholls A (2005) Small Molecule Shape-Fingerprints. J. Chem. Inf. Model. 45, 673–684.

251. Das S, Kokardekar A, Breneman C M (2009) Rapid Comparison of Protein Binding Site Surfaces

with Property Encoded Shape Distributions. J. Chem. Inf. Model. 49, 2863–2872.

252. Sastry G M, Dixon S L, Sherman W (2011) Rapid Shape-Based Ligand Alignment and Virtual Screening Method Based on Atom/Feature-Pair Similarities and Volume Overlap Scoring. J. Chem. Inf. Model. 51, 2455–2466.

253. Pitman M C, Huber W K, Horn H, Krämer A, Rice J E, Swope W C J (2001) FLASHFLOOD: a 3D

field-based similarity search and alignment method for flexible molecules. Comput.-Aided Mol. Des. 15, 587– 612.

254. Good A C (2007) Novel DOCK clique driven 3D similarity database search tools for molecule

shape matching and beyond: Adding flexibility to the search for ligand kin. J. Mol. Graph. Model. 26, 656–666

255. McGregor M (2007) A pharmacophore map of small molecule protein kinase inhibitors. J. Chem.

Inf. Model. 47, 2374–2382.

256. Wolber G, Seidel T, Bendix F, Langer T (2008) Molecule-pharmacophore superpositioning and pattern matching in computational drug design. Drug. Dev. Res. 13, 23-29.

257. Marshall G R, Barry C D, Bosshard H E, Dammkoehler R A, Dunn D A (1979) The conformational

parameter in drug design: the active analog approach. Computer-Assisted Drug Design 112, 205–225.

258. Jones G, Willett P, Glen R C (1995) A genetic algorithm for flexible molecular overlay and

pharmacophore elucidation. J. Comput. Aid. Mol. Des. 9, 532–549.

259. Nissink J W M, Verdonk M L, Kroon, J, MietznerT, Klebe G (1997) Superposition of molecules: Electron density fitting by application of Fourier transforms. J. Comput. Chem. 18, 638–645.

260. Patel Y, Gillet V J, Bravi G, Leach A R (2002) A comparison of the pharmacophore identification

programs: Catalyst, DISCO and GASP. J. Comput. Aid. Mol. Des. 16, 653–681.

261. GALAHAD. Tripos, St. Louis, MO; http://www.tripos.com/.

262. Catalyst. Accelrys Software, San Diego,CA; http://www.accelrys.com.

263. Brooks B R, Bruccoleri R E, Olafson B D, States D J, Swaminathan S (1983) CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187–217.

264. Barnum D, Greene J, Smellie A, Sprague P, (1996) Identification of common functional

configurations among molecules. J. Chem. Inf. Comput. Sci. 36, 563–571.

Page 148: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

135

265. Li H, Sutter J, Remy H. (2000). Pharmacophore perception development and use in drug design. Chapter HypoGen: an automated system for generating 3D predictive pharmacophore models.

International University Line, La Jolla, CA, 49-68.

266. Martin YC. (2000). Pharmacophore perception development and use in drug design. In: DISCO:

what we did right and what we missed. International, University Line, La Jolla, CA, 49–68.

267. Phase. Schrödinger, Portland, OR; http://www.schrodinger.com/.

268. Salam N K, Nuti R, Sherman W (2009) Novel Method for Generating Structure-Based Pharmacophores Using Energetic Analysis. Chem. Inf. Model. 49 2356-2368.

269. LigPre. Schrödinger, Portland; OR, http://www.schroedinger.com/.

270. Taminau J, Thijs G, De Winter H (2008) Pharao: Pharmacophore alignment and optimization. J.

Mol. Graph. Model. 27, 161–169.

271. van Drie J H, Weininger D, Martin Y C (1989) ALADDIN: An integrated tool for computer-assisted molecular design and pharmacophore recognition from geometric, steric, and substructure searching of three-dimensional molecular structures. J. Comput. Aid. Mol. Des. 3, 225 251.

272. Daylight Chemical Information Systems. Smiles ARbitrary Target Specification (SMARTS).

http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.

273. Greene J, Kahn S, Savoj H, Sprague P, Teig S (1994) Chemical Function Queries for 3D Database Search. J. Chem. Inf. Comput. Sci. 34, 1297 1308.

274. Verdonk M L, Cole, J C, Watson P, Gillet V, Willett P (2001) SuperStar: improved knowledge-

based interaction fields for protein binding sites J. Mol. Biol. 307, 841-859.

275. Pastor M, Cruciani G, McLay I, Pickett S, Clementi S (2000) GRid-INdependent Descriptors (GRIND): A Novel Class of Alignment-Independent Three-Dimensional Molecular Descriptors. J. Med. Chem. 43, 3233-3243.

276. Cruciani, G.; Crivori, P.; Carrupt, P. A.; Testa, B. (2000) Molecular Fields in Quantitative

Structure-Permeation Relationships: the VolSurf Approach. J. Mol. Struct. 503, 17-30.

277. Cheeseright T, Mackey M, Rose S, Vinter A (2006) Molecular Field Extrema as Descriptors of Biological Activity: Definition and Validation. J. Chem. Inf. Model. 46, 665-676.

278. Martin E J, Hoeffel T J J (2000) Oriented substituent pharmacophore PRopErtY space

(OSPPREYS): a substituent-based calculation that describes combinatorial library products better than the corresponding product-based calculation. Mol. Graph. Model. 18, 383–403.

279. Wolfson H J, Rigoutsos I (1997) Geometric hashing: An overview. J. Comput. Sci. Eng. 4, 10.

280. Liu X, Kiang H, Li H (2011) SHAFTS: A Hybrid Approach for 3D Molecular Similarity Calculation.

1. Method and Assessment of Virtual Screening. J. Chem. Inf. Model. 51, 2372–2385.

281. Wagener M, Sadowski J, Gasteiger J (1995) Autocorrelation of molecular surface properties for modeling corticosteroid binding globulin and cytosolic Ah receptor activity by neural networks. J. Am. Chem. Soc. 117, 7769.

282. Reutlinger M, Koch C P, Reker D, Todoroff N, Schneider P, Rodrigues T, Schneider G (2013)

Chemically advanced template search (CATS) for scaffold-hopping and prospective target prediction for “orphan” molecules. Mol. Inf. 32, 133-138.

283. Klenner A, Hartenfeller M, Schneider P, Schneider G (2010) “Fuzziness” in pharmacophore-based virtual screening and de novo design. Drug Discov. Today Technol. 7, e237-e244.

Page 149: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

136

284. Leach A R, Gillet V J, Lewis R A, Taylor R (2010) Three-Dimensional Pharmacophore Methods in Drug Discovery. J. Med. Chem. 53, 539–558.

285. Molecular Operating Environment (MOE), 2013.08; Chemical Computing Group Inc., 1010

Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7, 2015.

286. Hooft R W, Sander C, Vriend G. (1996) Positioning hydrogen atoms by optimizing hydrogen-bond networks in protein structures. Proteins 26, 363–376.

287. Wolber G, Langer T (2005) LigandScout: 3-D pharmacophores derived from protein-bound

ligands and their use as virtual screening filters. J. Chem. Inf. Model. 45, 160–169.

288. Claußen H, Buninga C, Rareya M, Lengauera T (2001) FlexE: efficient molecular docking considering protein structure variations. J. Mol. Biol. 308, 377–395.

289. Meagher K L, Carlson H A (2004) Incorporating protein flexibility in structure-based drug

discovery: using HIV-1 protease as a test case. J. Am. Chem. Soc. 126, 13276–13281.

290. Carlson H A, Masukawa K M, Rubins K, Bushman F D, Jorgensen W L, Lins R D, Briggs J M, McCammon J A (2000) Developing a dynamic pharmacophore model for HIV-1 integrase. J. Med. Chem. 43, 2100–2114.

291. Greenidge P A, Carlsson B, Bladh L G, Gillner M (1998) Pharmacophores Incorporating

numerous excluded volumes defined by X-ray crystallographic structure in three-dimensional database searching: application to the thyroid hormone receptor. J. Med. Chem. 41, 2503–2512.

292. Tintori C, Corradi V, Magnani M, Manetti F, Cotta M (2008) Targets looking for drugs: a

multistep computational protocol for the development of structure-based pharmacophores and their applications for hit discovery. J. Chem. Inf. Model. 48, 2166–2179.

293. Baroni M, Cruciani G, Sciabola S, Perruccio F, Mason J S (2007) A common reference framework

for analyzing/comparing proteins and ligands. Fingerprints for Ligands and Proteins (FLAP): theory and application. J. Chem. Inf. Model. 47, 279–294.

294. Cross S, Baroni M, Carosati E, Benedetti P, Clementi S (2010) FLAP: GRID molecular interaction

fields in virtual screening. validation using the DUD data set. J. Chem. Inf. Model. 50, 1442–1450.

295. Chen J, Lai L (2006) Pocket v.2: further developments on receptor-based pharmacophore modeling. J. Chem. Inf. Model. 46, 2684–2691.

296. Deng Z D, Chuaqui C, Singh J (2004) Structural interaction fingerprint (SIFt): a novel method for

analyzing three-dimensional protein-ligand binding interactions. J. Med. Chem. 47, 337–344.

297. Brewerton S C (2008) The use of protein-ligand interaction fingerprints in docking. Curr. Opin. Drug Discovery Dev. 11, 356– 364.

298. Greenidge P A, Carlsson B, Bladh L G, Gillner M (1998) Pharmacophores incorporating

numerous excluded volumes defined by X-ray crystallographic structure in three-dimensional database searching: application to the thyroid hormone receptor. J. Med. Chem. 41, 2503–2512.

299. Ebalunode J O, Ouyang Z, Liang J, Zheng W (2008) Novel Approach to Structure-Based

Pharmacophore Search Using Computational Geometry and Shape Matching Techniques. J. Chem. Inf. Model. 48, 889–901.

300. Löwer M, Geppert T, Schneider P, Hoy B, Wessler S, Schneider G (2011) Inhibitors of

Helicobacter pylori protease HtrA found by “virtual ligand” screening combat bacterial invasion of epithelia. PLoS ONE 6, e17986.

Page 150: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

137

301. Kruger D M, Evers A (2010) Comparison of Structure- and Ligand-Based Virtual Screening Protocols Considering Hit List Complementarity and Enrichment Factors. ChemMedChem. 5, 148–158.

302. Evers A, Hessler G, Matter H, Klabunde T (2005) Virtual Screening of Biogenic Amine-Binding G-

Protein Coupled Receptors:  Comparative Evaluation of Protein- and Ligand-Based Virtual Screening Protocols. J. Med. Chem. 48, 5448-5465.

303. Kumar B. V, Kotla R, Buddiga R, Roy J, Singh S S, Gundla R, Ravikumar M, Sarma J A (2011)

Ligand-based and structure-based approaches in identifying ideal pharmacophore against c-Jun N-terminal kinase-3. J. Mol. Model. 17, 151-163.

304. von Korff M, Freyss J, Sander T (2009) Comparison of ligand- and structure-based virtual

screening on the DUD data set. J. Chem. Inf. Model. 49, 209–231.

305. de Graaf C, Rognan D (2008) Selective structure-based virtual screening for full and partial agonists of the beta2 adrenergic receptor. J. Med. Chem. 51, 4978-4985.

306. Hunter J D (2007) Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 9, 90.

307. DeLano (2002) The PyMOL Molecular Graphics System, Version 1.5.0.4 Schrödinger, LLC.

308. Phillips J C, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel R D, Kalé L,

Schulten K (2005) Scalable molecular dynamics with NAMD. J. Comput. Chem. 26, 1781–1802.

309. Ewald P P (1921) Ewald summation. Annalen der Physik 369, 253.

310. Jorgensen W L, Chandrasekhar J, Madura J D, Impey R W, Klein M L (1983) Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926.

311. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J. Mol. Graph. 14, 33-

38.

312. Plana S, Lindorff-Larsen K, Shaw DE (2011) How robust are protein folding simulations with respect to force field parameterization?. Biophys J. 100, 47-49.

313. Berthold M R, Cebron N, Dill F, Gabriel T R, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B

(2007) KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization. Springer.

314. Bauer M R, Ibrahim T M, Vogel S M, Boeckler F M (2013) Evaluation and Optimization of Virtual

Screening Workflows with DEKOIS 2.0 – A Public Library of Challenging Docking Benchmark Sets. J. Chem. Inf. Model. 53, 1447–1462.

315. Mysinger M M, Carchia M, Irwin J J, Shoichet BK (2012) Directory of Useful Decoys, Enhanced

(DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 55, 6582–6594.

316. Huang N, Shoichet B K, Irwin JJ (2006) Benchmarking Sets for Molecular Docking. J. Med. Chem. 49, 6789–6801.

317. Gaulton A, Bellis L J, Bento A P, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S,

Michalovich D, Al-Lazikani B, Overington J P (2012) ChEMBL: A large-scale bioactivity database for drug discovery. Nucl. Acids. Res. 40, D1100-D1107.

318. Irwin J J, Shoichet B K (2005) ZINC - a free database of commercially available compounds for

virtual screening. J. Chem. Inf. Model. 45, 177-182.

319. Mathers C D, Loncar D (2006) Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med., 3 e442.

Page 151: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

138

320. Roberts J D, Bebenek K, Kunkel T A (1988) The accuracy of reverse transcriptase from HIV-1. Science 242, 1171-1173.

321. Pokorná, J, Machala L, Řezáčová P, Konvalinka J (2009) Current and novel inhibitors of HIV

protease. Viruses 1, 1209- 1239.

322. Schneider J, Kent S B H (1988) Enzymatic activity of a synthetic 99 residue protein corresponding to the putative HIV-1 protease. Cell 54 363-368.

323. Gulnik S V, Afonina E, Eissenstat M (2009) HIV‐1 Protease Inhibitors as Antiretroviral Agents.

Enzyme Inhibition in Drug Discovery and Development: The Good and the Bad (eds C. Lu and A. P. Li), John Wiley & Sons, 749- 810.

324. Kunze J (2011) Virtuelles screening nach allosterischen inhibitoren von proteasen. Diploma thesis, Goethe University, Frankfurt am Main.

325. Hornak V, Simmerling C (2007) Targeting structural flexibility in HIV-1 protease inhibitor

binding. Drug Discov. Today 12, 132-138.

326. Lapatto, R, Blundell T, Hemmings A, Overington J, Wilderspin A, Wood S, Merson J R, Whittle P J, Danley D E, Geoghegan K F, Hawrylik S J, Lee S W, Scheld K G, Hobart P M (1989) X-ray analysis of HIV-1 proteinase at 2.7 Å resolution confirms structural homology among retroviral enzymes. Nature 342, 299 - 302.

327. Winston A, Boffito M (2005) The management of HIV-1 protease inhibitor pharmacokinetic

interactions. J. Antimicrob. Chemother. 56, 1-5.

328. Johnson V A, Calvez V, Günthard H F, Paredes R, Pillay D, Shafer R, Wensing A M, Richman DD (2011) 2011 update of the drug resistance mutations in HIV-1. Top Antivir Med. 19, 156-64.

329. Perryman, A L, Zhang Q, Soutter H H, Rosenfeld R, McRee D E, Olson AJ, Elder J E, David Stout C

(2010), Fragment-Based Screen against HIV Protease. Chem. Biol. Drug. Des. 75, 257–268.

330. Bloch K (1992) Sterol molecule: structure, biosynthesis, and function. Steroids 57, 378–383.

331. Seet M (2013) New antimalarials and herbicides inhibitors of the enzymes 4- diphosphocytidyl-2-C-methyl-D-erythritol synthase (IspD) and serine hydroxymethyltransferase (SHMT). PhD thesis, Eidgenossische Technische Hochschule Zurich.

332. Jomaa H, Wiesner J, Sanderbrand S, Altincicek B, Weidemeyer C, Hintz M, Türbachova I, Eberl M,

Zeidler J, Lichtenthaler H K, Soldati D, Beck E (1999) Inhibitors of the nonmevalonate pathway of isoprenoid biosynthesis as antimalarial drugs. Science 285, 1573-1576.

333. Webb, B, Sali A (2014) Comparative Protein Structure Modeling Using MODELLER. Curr. Protoc.

Bioinform. 47, 5.6:5.6.1–5.6.32.

334. Gerber P R, Müller K (1995) MAB, a generally applicable molecular force field for structure modelling in medicinal chemistry. J. Comput.-Aided Mol.-Des. 9, 251–268.

335. Schellhammer I, Rarey M (2004), FlexX-Scan: Fast, structure-based virtual screening. Proteins

57, 504–517.

336. Radifar M, Yuniarti N, Istyastono E P (2013) PyPLIF-ASSISTED REDOCKING INDOMETHACIN-(R)-ALPHA-ETHYL-ETHANOLAMIDE INTO CYCLOOXYGENASE-1. Indo. J. Chem. 13, 283 - 286.

337. Schreyer A M, Blundell T (2012) USRCAT: real-time ultrafast shape recognition with

pharmacophoric constraints. Journal of Cheminformatics 4, 27.

Page 152: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

139

338. Kabsch W (1976) A solution for the best rotation to relate two sets of vectors. Acta Cryst. A. 32, 922.

339. Murshudov G N, Dodson E J (1997) Simplified error estimation a la Cruickshank in

macromolecular crystallography. CCP4 NEWSLETTER ON PROTEIN CRYSTALLOGRAPHY 33, 31.

340. Witschel M C, Höffken H W, Seet M, Parra L, Mietzner T, Thater F, Niggeweg R, Röhl F, Illarionov B, Rohdich F, Kaiser J, Fischer M, Bacher A and Diederich F (2011) Inhibitors of the Herbicidal Target IspD: Allosteric Site Binding. Angew. Chem. Int. Ed. 50, 7931–7935.

341. Baumlova A, Chalupska D, Róźycki B, Jovic M, Wisniewski E, Klima M, Dubankova A, Kloer D P,

Nencka R, Balla T, Boura E (2014) The crystal structure of the phosphatidylinositol 4-kinase IIα. Embo. Rep. 15, 1085-1092.

342. Rettenmaier T J, Sadowsky J D, Thomsen N D, Chen S C, Doak A K, Arkin M R, Wells J A (2014) A

small-molecule mimic of a peptide docking motif inhibits the protein kinase PDK1. Proc. Natl. Acad.Sci. USA 111, 18590-18595.

343. Kocher O, Birrane G, Tsukamoto K, Fenske S, Yesilaltay A, Pal R, Daniels K, Ladias J A, Krieger M

(2010) In vitro and in vivo analysis of the binding of the C terminus of the HDL receptor scavenger receptor class B, type I (SR-BI), to the PDZ1 domain of its adaptor protein PDZK1. J .Biol. Chem. 285, 34999-35010.

344. Robbins A H, Coman R M, Bracho-Sanchez E, Fernandez M A, Gilliland C T, Li M, Agbandje-

McKenna M, Wlodawer A, Dunn B M, McKenna R (2010) Structure of the unbound form of HIV-1 subtype A protease: comparison with unbound forms of proteases from other HIV subtypes. Acta. Crystallogr. D. Biol. Crystallogr. 66, 233-42.

345. Spinelli S, Liu Q Z, Alzari P M, Hirel P H, Poljak R J (1991) The three-dimensional structure of the

aspartyl protease from the HIV-1 isolate BRU. Biochimie. 73, 1391-1396.

346. Schwartz J, Awale M, Reymond J L (2013) SMIfp (SMILES fingerprint) Chemical Space for Virtual Screening and Visualization of Large Databases of Organic Molecules. J. Chem. Inf. Model. 53, 1979-1989.

347. Schneider G, Fechner U (2005) Computer-based de novo design of drug-like molecules.Nat. Rev.

Drug Discov. 4, 649-663.

348. StaR. Heptares therapeutics, Hertfordshire, UK; http://www.heptares.com/.

349. Benesch R, Benesch R E, Yu C I (1968) Reciprocal binding of oxygen and diphosphoglycerate by human hemoglobin. Proc. Natl. Acad. Sci. USA 59, 526–532.

350. World Health Organization, http://www.who.int/entity/hiv/data/en/

351. Sharp P M, Hahn B H (2011) Origins of HIV and the AIDS pandemic. Cold Spring Harb. Perspect.

Med. 1, a006841.

352. Hammer S M, Katzenstein D A, Hughes M D, Gundacker H, Schooley R T, Haubrich R H, Henry W K, Lederman M M, Phair J P, Niu M, Hirsch M S, Merigan T C (1996) A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 335, 1081-1090.

353. Gabrielsen M, Kaiser J, Rohdich F, Eisenreich W, Laupitz R, Bacher A, Bond C S, Hunter W N

(2006) The crystal structure of a plant 2C-methyl-D-erythritol 4-phosphate cytidylyltransferase exhibits a distinct quaternary structure compared to bacterial homologues and a possible role in feedback regulation for cytidine monophosphate. FEBS J. 273, 1065-73.

Page 153: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

References

140

354. Reker D, Rodrigues T, Schneider P, Schneider G (2014) Identifying the macromolecular targets of de novo-designed chemical entities through self-organizing map consensus. Proc. Natl. Acad.Sci. USA 111, 4067-4072.

355. Christopoulos, A (2002) Allosteric binding sites on cell-surface receptors: novel targets for drug

discovery. Nat. Rev. Drug Discov. 1, 198-210.

356. Paradis E, Claude J, Strimmer K (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

357. R Core Team (2014). R: A language and environment forstatistical computing. R Foundation for

Statistical Computing,Vienna, Austria. http://www.R-project.org/

358. Forster T (1948) Zwischenmolekulare energiewanderung und fluoreszenz. Ann. Phys. 437, 55-75.

359. Hopkins, A L, Groom C R, Alex A (2004) Ligand efficiency: A useful metric for lead selection.

Drug. Discov. Today 9, 430– 431.

360. Brohm D, Metzger S, Bhargava A, Müller O, Lieb F, Waldmann H (2002) Natural products are biologically validated starting points in structural space for compound library development: solid-phase synthesis of dysidiolide-derived phosphatase inhibitors. Angew. Chem. Int. Ed. Engl. 41, 307-311.

361. Durrant J D, Mc Cammon J A (2011) Molecular dynamics simulations and drug discovery. BMC

Biology 9, 71

362. Rhodes G (2006) Crystallography Made Crystal Clear. 3rd edition. Elseevier Inc.

Page 154: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

i

8 Appendix

Appendix I. Feature definition file for ligand-based pharmacophores: #

# CATS2 and LIQUID potential pharmacophore point definition

# RDKit implementation by Daniel Reker & Jens Kunze

#

# CarbonLipophilic = C adjacent to only C and H

AtomType Carbon_AttachedOther

[#6;$([#6]~[#7,#8,#9,#15,#16,#17,#35,#53,#14,#5,#34])]

AtomType CarbonLipophilic [#6;+0;!{Carbon_AttachedOther}]

# ClBrI = Cl or Br or I

AtomType ClBrI [#17,#35,#53]

# SC2 = S with two neighboring carbons

AtomType SC2 [#16;X2]([#6])[#6]

DefineFeature SingleAtomLipophilic

[!a;{CarbonLipophilic},{ClBrI},{SC2}]

Family Hydrophobe

Weights 1

EndFeature

# HBD

# hydroxy =O and H1

# nitrogenWithOneOrTwoHydrogen = N and H1 H2 or H3

AtomType Hydroxylgroup [O;H1;+0]

AtomType NH_NH2_NH3 [#7;H1,H2,H3;+0]

DefineFeature SingleAtomDonor [{Hydroxylgroup},{NH_NH2_NH3}]

Page 155: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

ii

Family Donor

Weights 1

EndFeature

# HBA

# oxygenAtom = O

# nitrogenNoHydrogen = N and H0

# fluorChlorine = Fl or Cl

AtomType OxygenAtom [#8]

AtomType NH0 [#7;H0;+0]

AtomType FlCl [#9,#17]

DefineFeature SingleAtomAcceptor [{OxygenAtom},{NH0},{FlCl}]

Family Acceptor

Weights 1

EndFeature

# P = positiveCharge

# posCharge = charge>0

# NH2 = N and H2

AtomType PosCharge [+,++,+++,++++,++++]

AtomType NH2 [#7;H2]

DefineFeature SingleAtomPositive [{PosCharge},{NH2}]

Family PosIonizable

Weights 1

EndFeature

# N = negativeCharge

# negCharge = charge<0

# CSPOOH = C/S/P(=0)OH1

AtomType NegCharge [-,--,---,----]

AtomType CSPOOH [C,S,P](=O)-[O;H1]

Page 156: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

iii

DefineFeature SingleAtomNegative [{NegCharge},{CSPOOH}]

Family NegIonizable

Weights 1

EndFeature

# Aromatic

# AtomType Aromatic [a]

# DefineFeature SingleAtomAromatic [{Aromatic}]

# Family Aromatic

# Weights 1

# EndFeature

AtomType AromR4 [a]

DefineFeature Arom4 [{AromR4}]1[{AromR4}][{AromR4}][{AromR4}]1

Family Aromatic

Weights 1.0,1.0,1.0,1.0

EndFeature

AtomType AromR5 [a]

DefineFeature Arom5

[{AromR5}]1[{AromR5}][{AromR5}][{AromR5}][{AromR5}]1

Family Aromatic

Weights 1.0,1.0,1.0,1.0,1.0

EndFeature

AtomType AromR6 [a]

DefineFeature Arom6

[{AromR6}]1[{AromR6}][{AromR6}][{AromR6}][{AromR6}][{AromR6}]1

Family Aromatic

Weights 1.0,1.0,1.0,1.0,1.0,1.0

EndFeature

AtomType AromR7 [a]

DefineFeature Arom7

[{AromR7}]1[{AromR7}][{AromR7}][{AromR7}][{AromR7}][{AromR7}][

{AromR7}]1

Family Aromatic

Weights 1.0,1.0,1.0,1.0,1.0,1.0,1.0

EndFeature

Page 157: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

iv

AtomType AromR8 [a]

DefineFeature Arom8

[{AromR8}]1[{AromR8}][{AromR8}][{AromR8}][{AromR8}][{AromR8}][

{AromR8}][{AromR8}]1

Family Aromatic

Weights 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0

EndFeature

Appendix II: Parameter file for the showcase of the pharmacophore search workflow

################# Input / Output #######################################

# PDB file

inPDB /Volumes/Platte2/Showcase/3ixo_noh2o_p3d.pdb

# Protein name

prot_name HIV_test

# Should the protein be protonated with MOE?

p3d true

# Pocket Python script

inPP /Volumes/Platte2/Showcase/Data/GridBuriedness18ConnectedAlternable_fail.py

# Pocket grid data file *.txt

inPPTXT /Volumes/Platte2/Showcase/Data/GridData18Connected.txt

# Working and output directory

outPath /Volumes/Platte2/Showcase/refine

############Parameters for Virtual Ligand###############################

# VL Obligatory Parameters. (Lipo, Don, Acc, Aro, Neg, Pos, #Bins, stepWidth, Scalingmode)

VLCoreParam 1.4 2.2 2.6 1.6 0 0 20 1 0

#####Optional Parameters: ChangeGrid to 0.5Å, useBfac, #Steps to do so, refineGrid, writeGrid

changeGrid true

useBfac true

steps 3

Page 158: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

v

refine false

writeGrid true

useMD false

##################### MD parameters #################################

# Give an serialFile with all MD snapshots (full path)

MDSerialfile /Users/modlab/Desktop/serilatest.txt

#align_mode can be “local” or “global”

align_mode local

# When local, give radius around the pocket you want to align locally

align_radius 3.0

# Do you want to keep the temporary files

keepTmp true

Appendix III: Lipophilic cut-off values for the DUD-E evaluations

Target Cut-off Target Cut-off Target Cut-off

aa2ar 16 fabp4 8 mmp13 8

abl1 7 fak1 13 mp2k1 20 ace 6 fgfr1 10 nos1 12 aces 9 fkb1a 5 nram 7 ada 5 fnta 15 pa2ga 8 ada17 16 fpps 11 parp1 15 adrb1 10 gcr 14 pde5a 16 adrb2 11 glcm 8 pgh1 11 akt1 17 gria2 7 pgh2 18 akt2 14 grik1 10 plk1 14 aldr 8 hdac2 18 pnph 13 ampc 18 hdac8 10 ppara 9 andr 9 hivint 11 ppard 11

aofb 10 hivpr 14 pparg 11 bace1 9 hivrt 15 prgr 11 braf 14 hmdh 13 ptn1 8 cah2 6 hs90a 15 pur2 11 casp3 11 hxk4 8 pygm 8 cdk2 17 igf1r 12 pyrd 11 comt 15 inha 9 reni 12 cp2c9 18 ital 9 rock1 14 cp3a4 14 jak2 17 rxra 9 csf1r 11 kif11 13 sahh 7 cxcr4 21 kit 14 src 13

Page 159: In Copyright - Non-Commercial Use Permitted Rights ... · optimization for the pharmacophore descriptor calculation was performed revealing the difficulties of generalizing software

Appendix

vi

def 11 kith 9 tgfr1 9

dhi1 17 kpcb 19 thb 13 dpp4 10 lck 8 thrb 9 drd3 19 lkha4 16 try1 8 dyr 7 mapk2 13 tryb1 11 egfr 13 mcr 11 tysy 12 esr1 12 met 10 urok 8 esr2 12 mk01 12 vgfr2 9 fa7 10 mk10 12 wee1 11 fa10 9 mk14 13 xiap 13

Appendix IV: Geometric interaction rules for PPP description

Interaction type Atom property Distance (Å) and angle

Donor (in the pocket): N aromatic 2.7-3.2, in plane ±30 degree O 2.6-3.1, angle 104-180 Acceptor (in the pocket): O 2.6-3.0, angle is above 150 N 2.7-3.2, angle is above 150

Lipophilic : * 3.3-4.4 Aromatic : C,N inPlane 3.3-4.4

abovePlane 3.4-4.2 parallelDisplaced 3.4-3.6