QSPR-QSAR Studies on Desired Properties for Drug Design

QSPR-QSAR Studies on Desired Properties for Drug Design

Editor

Eduardo A. Castro

Professor of Physical Chemistry and Theoretical Chemistry, La Plata National University INIFTA (UNLP-CONICET), Suc.4, C.C. 16, La Plata 1900, Buenos Aires, Argentina

Research Signpost, T.C. 37/661 (2), Fort P.O., Trivandrum-695 023 Kerala, India

Published by Research Signpost 2010; Rights Reserved Research Signpost T.C. 37/661(2), Fort P.O., Trivandrum-695 023, Kerala, India Editor Eduardo A. Castro Managing Editor S.G. Pandalai Publication Manager A. Gayathri Research Signpost and the Editor assume no responsibility for the opinions and statements advanced by contributors ISBN: 978-81-308-0404-0

Preface The application of chemoinformatics in the drug discovery process is today a wide field of theoretical and applied research. The development of a new drug is both a rather time- consuming and large cost-intensive event. The still long happening periods until a new compound comes to the market are coupled with a high financial effort. Despite of the introduction of combinatorial chemistry and the establishment of high-throughput screening techniques, the average number of new chemical drugs introduced annually to the world market is relatively low. The major reasons for attrition of new drugs are mainly lack of clinical efficacy, inappropriate pharmacokinetics, animal toxicity, adverse reactions in humans, commercial reasons, and formulation issues. As a consequence, the prediction of pharmacological properties, in addition to lead finding, is a central task of chemoinformatics in drug development. QSAR/QSPR theories are related to the findings of quantitative structure-activity and structure-property relationships, respectively, in order to be able to make sound inferences about which structure might have a desired biological activity or/and a given physical-chemistry property. The several different approaches to propose useful QSAR/QSPR relationships have grown markedly during the last years and the related predictions are very accurate, so that their insertion to the drug design realm is widely recognized. This book has the purpose of presenting some valuable and updated contributions of frontier QSAR/QSPR studies on desired properties for drug design and I hope that readers will find it a quite relevant document and a useful set of alternative methods and valuable results.

Eduardo A. Castro

Contents

Chapter 1 Quantitative structure-activity relationship (QSAR) studies on bioactive cyclopeptides 1 Alicia B. Pomilio, Stella M. Battista and Arturo A. Vitale Chapter 2 QSPR studies on amino acids: Application to proteins 35 Francisco Torrens and Gloria Castellano Chapter 3 Molecular topology in QSAR and drug design studies 63 J. Gálvez and R. García-Domenech Chapter 4 QSAR models constructed by optimal descriptors and by multiple regression analysis for the prediction of carbonic anhydrase II inhibitory activity of substituted aromatic sulfonamides 95 Georgia Melagraki, Antreas Afantitis, Andrey A. Toropov Haralambos Sarimveis and Olga Igglessi – Markopoulou

Chapter 5 From molecular structure to molecular design through the Molecular Descriptors Family Methodology 117 Sorana D. Bolboacă and Lorentz Jäntschi Chapter 6 Selection of an optimal set of descriptors: use of the Enhanced Replacement Method 167 Andrew G. Mercader Chapter 7 Orthogonalization methods in QSPR - QSAR Studies 189 Pablo R. Duchowicz, Francisco M. Fernández and Eduardo A. Castro Chapter 8 Modeling some chemical reactions as a spatio-temporal discrete dynamical systems 205 Juan Luis García Guirao

Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 1-34 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro

1. Quantitative structure-activity relationship (QSAR) studies on bioactive cyclopeptides

Alicia B. Pomilio1,*, Stella M. Battista1 and Arturo A. Vitale2,*

1PRALIB (UBA-CONICET), Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires Junín 956, C1113AAD Buenos Aires, Argentina; 2Research Institute INIFTA (UNLP-CCT

La Plata-CONICET), Universidad Nacional de La Plata, Diag. 113 y 64 Suc. 4, CC 16, B1900 La Plata, Argentina

Abstract. Cyclopeptides show a variety of structures, from cyclodipeptides through cyclohepta- and octapeptides to cyclotides with 30 amino acids, which can be found in bacteria, insects, higher plants, fungi, animals, and humans. Peptides and in particular cyclopeptides are therapeutic targets which play an important role in medicinal chemistry, including drug design and quantitative structure-activity relationships (QSARs). However, few QSAR studies have been carried out on cyclopeptides in comparison with small organic molecules. Some structural features, substructures, and functional groups contribute to enhance the bioactivity. Configurational and conformational studies of cyclopeptides from plant and fungal origin were analysed, and related to antitumour activity and toxicity. Some physico-chemical properties, e.g., partition coefficient (log P) and several molecular parameters have been found to be relevant for activity. QSAR models, including a variety of descriptors, have been applied to synthetic and natural

* Research Members of the National Research Council of Argentina (CONICET). These authors contributed equally to this work.

Correspondence/Reprint request: Prof. Dr. Alicia B. Pomilio, PRALIB (UBA-CONICET), Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires, Junín 956, C1113AAD Buenos Aires, Argentina E-mail: [email protected]

Alicia B. Pomilio et al. 2

cyclopeptides as well as to their analogues. The balance between antimicrobial and haemolytic properties for the design of antimicrobial cyclic peptides is discussed. Considerations about anticancer and cytotoxic activity of cyclopeptides are also included. This kind of studies also help the design of selective drugs. Abbreviations QSARs, quantitative structure-activity relationships; SARs, structure-activity relationships; NMR, Nuclear Magnetic Resonance; MS, mass spectrometry; ESI-MS, electrospray ionization mass spectrometry; LPS, lipopolysaccharide; LA, lipid A; MIC, minimum inhibitory concentration; IC50, inhibition constant for 50% inhibition; 2D-NMR, bidimensional NMR; 3D-QSAR, three-dimensional QSAR; Abu, L-α-aminobutyric acid; Ala, alanine; Arg, arginine; Asn, asparagine; Cys, cysteine; β-Phe, β-phenylalanine; Phe, phenylalanine; Gly, glycine; Leu, leucine; Lys, lysine; Pro(Cl2), β,γ-dichlorinated proline; HOPro, hydroxyproline; D-Pro, D-proline; Pro, proline; Ser, serine; alloThr, allothreonine; Thr, threonine; Trp, tryptophan; Tyr, tyrosine; Val, valine. A, alanine; C, cysteine; D, aspartic acid; E, glutamic acid; F, phenylalanine; G, glycine; H, hystidine; I, isoleucine; K, lysine; L, leucine; M, methionine; N, asparagine; P, proline; Q, glutamine; R, arginine; S, serine; T, threonine; V, valine; W, tryptophan; Y, tyrosine. Introduction Proteins and peptides play an important role in biological systems and affect most physiological processes of plants, animals and microorganisms, in particular in humans, usually acting as agonists and/or antagonists at specific receptors. Bioactive cyclopeptides or cyclic peptides have been found in animals, plants, fungi, bacteria, insects and humans. Cyclopeptides have shown cytotoxic, cytostatic, antifungal, antiviral, antibacterial, plant-stimulating, insecticidal, antimalarial, estrogenic, sedative, nematicidal, immunosupressive, and enzyme-inhibitory activities. Recently, the occurrence, structures and bioactivity of cyclopeptides in higher plants and higher fungi have been reported [1]. In fact, many biologically active proteins and peptides, including cyclopeptides, are involved in disease processes, and the abnormal expression of these peptides has also been associated with human disease [2]. The experimental demonstration of the occurrence of antimicrobial peptides in several unexplained human inflammatory disorders can provide novel therapeutic approaches to the treatment of disease. Therefore, these compounds can be therapeutic targets, which can be used in the development of future drugs [3].

QSAR of cyclopeptides 3

A variety of antimicrobial peptides is naturally produced by different organisms through either ribosomal (defensins and small bacteriocins) or non-ribosomal synthesis (peptaibols, cyclopeptides and pseudopeptides). Some of these natural antimicrobial peptides are used for the design of new synthetic analogues, and have been also expressed in transgenic plants to confer disease protection. These compounds are also secreted by microorganisms, being active ingredients of commercial biopesticides [4,5]. However, most native peptides have a limited applicability as drug candidates due to: (a) rapid metabolism by proteolysis, (b) poor absorption by the gastrointestinal tract, and poor transport over the blood brain barrier, (c) rapid excretion by the liver and kidneys, and (d) lack of receptor specificity due to the conformational flexibility [6]. Therefore, the structure and key pharmacophore groups of native peptides are usually converted into new nonpeptidic drug candidates [7], the so-called peptidomimetics. The development of peptidomimetic protease inhibitors has been feasible in recent years by the three-dimensional (3D) structural information on proteases from X-ray diffraction and NMR spectroscopy [8, 9]. Until the 3D structures of the important group of the seven transmembrane G-protein-coupled receptors (GPCRs) have been resolved [10], it was necessary to rely on homology modelling of the receptors based on the structure of bovine rhodopsin and site-specific mutagenesis to provide important clues in the development of new drugs for these receptors [11]. Structural features of the peptides Peptides have a large degree of conformational freedom due to their free rotating bonds, thus giving rise to a large number of conformations. However, in solution and in the absence of receptors or enzymes the biologically active conformation may be poorly populated. Conformationally constrained analogues can significantly contribute to the identification of these conformations. Hence, peptidomimetics provide valuable information on SARs of both peptides and complex proteins [12]. Once the primary structure of the biologically active peptide has been identified, the first design step is to remove the amino acids from the amino and carboxy termini, one at a time, in order to obtain the smallest biologically active peptide fragment. Subsequently, side chain requirements, e.g., pharmacophore groups, can be determined by systematically replacing each residue in the peptide with a specific amino acid and evaluating the biological activity. The most often used amino acid is Ala, but occasionally Gly is used [13].


Constrained structures mimicking the bioactive conformation will be less exposed to proteolytic cleavage, may give more selective ligands, providing an entropy advantage in receptor binding compared to the more flexible linear peptides [12,14]. After the SAR of each amino acid in the peptide has been determined the next step is to try to establish the bioactive conformation(s). The conformational freedom of the highly flexible peptide can be reduced by the introduction of local and/or global constraints. Methods of obtaining local constraints include incorporation of modified amino acids, e.g., D-amino acids and N-methyl amino acids, introduction of modified amide bonds or short-range cyclizations forming a link between two backbone termini, between one of the termini and one side chain, between two side chains, or between backbone atoms other than the termini. Secondary structure mimetics can also be used as constraints. When incorporated into a peptide, a secondary structure mimetic enforces a particular conformation. Incorporation of such a moiety provides additional information about requirements for receptor binding and/or activation [14]. SARs of the constrained analogues together with information obtained from NMR spectroscopy and molecular modelling can, in an iterative process, give a 3D pharmacophore model of the bioactive conformation. The last step is to introduce the pharmacophore groups onto a nonpeptidic scaffold in the correct spatial arrangement, in agreement with the obtained model of the bioactive conformation [12-14]. The receptor-based screening of natural and synthetic compound collections has proved to be a useful method for identifying peptidomimetics [11]. Furthermore, design, combinatorial chemistry and classical medicinal chemistry play important roles. Hirschmann et al. [15] reported an example in which pharmacophore modelling was used to obtain a peptidomimetic agonist of the cyclopeptide hormone somatostatin. Furthermore, a systematic exploration of the conformational space for a series of analogues of FC131, a cyclopentapeptide CXCR4 antagonist, has been recently performed, thus leading to a minimalistic 3D pharmacophore model for binding of these cyclopentapeptides [16]. The chemokine receptor CXCR4 is involved in HIV entry, and therefore, is an attractive target for antiretroviral drugs. The most essential conformational components of peptides and proteins are the secondary structure elements: α-helices, β-sheets, reverse turns and loops. These elements are often located on the protein surface and seem to act as molecular recognition sites in biological processes. Even small peptide fragments can fold into turn conformations in which the amino acid side chains are displayed on the surface of a compact backbone core [17]. There is


a lot of evidence suggesting that the side chain groups in the peptide are the most important recognition elements in peptide-receptor interaction [11]. The reversed turns can be divided into β-turns and γ-turns. One of the most common structure motifs in proteins is the β-turn [18]. To be considered a β-turn the tetrapeptide sequence should not be part of an α-helical region, and the distance between C-α of the first and C-α of the fourth amino acid residue should be ≤ 7 Å. A number of slightly different turn types has been reported [19]. A hydrogen bond between the carbonyl in the first residue and the NH group in the fourth residue is often present to stabilize the turn in a pseudo-ten-membered ring structure. A γ-turn, which is a more rare reversed turn, consists of three amino acid residues. It is defined by a hydrogen bond between the carbonyl group in the first residue and the NH-group of the third residue forming a pseudo-seven-membered ring. γ-Turns are divided into two classes, inverse and classic. The second side chain is oriented in an equatorial position in the most common inverse γ-turn, while the rare classic γ-turn has an axial second side chain. γ-Turns are rare in proteins but are frequently found in small peptides, especially in small cyclic peptides [20]. Cyclotides Cyclotides are small plant proteins, e.g., cyclic disulfide-rich plant peptides of about 30 amino acids, found in plants of the Rubiaceae, Violaceace and Cucurbitaceae families [21], and are believed to be part of the host defence system [22] on the basis of their high expression levels in plants, and their toxic and growth retardant activity in feeding trials against Helicoverpa spp. insect pests [23]. These compounds have a macrocyclic peptide backbone and a cystine knotted arrangement made up of six conserved cysteine residues (three conserved disulfide bonds), which makes them very stable [24]. This unique structure, together with a variety of biological activities, makes them of great interest as possible leads in drug development or as carriers of grafted peptide sequences [25,26]. A database of cyclic protein sequences and structures, with applications in protein discovery and engineering is known [27] as well as chemical and biomimetic total syntheses of natural and engineered MCoTI cyclotides [28]. Molecular dynamics simulations and MM-PBSA free energy calculations have been carried out to elucidate structure and folding of these disulfide-rich miniproteins [29]. NMR spatial structure of ternary complex Kalata B7/Mn2+/DPC micelle has been recently reported [30].


The insecticidal activity of cyclotides and the comparison with structurally similar cystine knot proteins from peas (Pisum sativum) and an amaranth crop plant (Amaranthus hypocondriacus) have been recently reviewed [23]. In addition to their insecticidal effect, cyclotides have also shown to be cytotoxic, anti-HIV [31], anthelmintic [32,33], molluscicidal [34], antimicrobial and haemolytic agents (see Anticancer and cytotoxic activities section). Configurational and conformational studies on bioactive cyclopeptides Conformational studies of the natural cyclopeptides and their derivatives were related to the stereochemical requirements for a bioactive compound. Antitumour cyclic pentapeptides, astins A - H, have been isolated from Aster tataricus (Asteraceae), which is known as a Chinese medicine and as an ornamental higher plant [35]. Astins contain a 16-membered ring system with a unique mono- or dichlorinated proline and/or allothreonine residues. The main active principle, astin B, contains a β,γ-dichlorinated Pro, an alloThr, a Ser, a β-Phe and an α-aminobutyric acid, and was identified as cyclo(Pro(Cl2)-alloThr-Ser-β-Phe-Abu). Astin C was identified as cyclo(Pro(Cl2)-Abu-Ser-β-Phe-Abu) [36]. Thus, astins A-C (Fig. 1) showed to be similar to cyclochlorotine, cyclo(Pro(Cl2)-Abu-Ser-β-Phe-Ser), which has been isolated from Penicillium islandicum Sopp. (Fig. 2). In order to understand the mechanisms involved in the action of cyclic peptides, it was necessary to assess their conformational characteristics. The conformational analysis of this antitumour astin B was performed by 2D NMR techniques [9], temperature effects on NH protons, rate of hydrogen-deuterium exchange, vicinal NH-CαH coupling constants, and NOE experiments [37].

H

HH

H

H

R2

HR1

H

HOOO

O O Cl

Cl

N

N

NH

N

N

Astin A R1=H; R2=OHAstin B R1=OH; R2=HAstin C R1=H; R2=H

Figure 1. Structures of Astin A, B and C from the higher plant Aster tataricus (Asteraceae).


H

HH

H

CH2OH

H

OH

H H

HOOO

O O Cl

Cl

N

N

NH

N

N

Figure 2. Structures of cyclochlorotine isolated from Penicillium islandicum. The combination of 2D NMR analysis with molecular dynamics and mechanics calculations allowed to determine the energetically favourable conformation of astin B in solution. The methods of molecular mechanics and restrained molecular dynamics calculations were applied to understand the energetic preferences of various conformations of astin B. A conformational difference was observed between astin B, showing a cis configuration in a Pro amide bond, and cyclochlorotine from Penicillium islandicum, showing all trans amide configurations [38]. A detailed knowledge of the conformation of astin B in a polar solvent such as DMSO-d6 was considered the basis for SARs, allowing the design of new analogues with higher activity [39]. The assignments of 1H and 13C NMR signals of astin B were made by combination of 1H-lH COSY, HMQC and HMBC spectra [35,36]. The HMBC, which provided 1H-l3C long-range couplings, proved to be extremely valuable for the assignments. The conformational determination of astin B in solution was made on the basis of the results of the following experiments: Hydrogen bonding, vicinal NH-CαH coupling, NOE enhancements, and quenched molecular dynamics [36,38]. Computational procedures using NMR data were applied to the elucidation of the solution conformation of astin B and further to the disclosure of the difference between the conformations in the solid and solution states. Molecular dynamics techniques were applied to astins. Three distance constraints involved in the hydrogen bondings, as also found in the crystal state, and four distance constraints derived from the NOE experiments were used to show that this solution structure of astin B was consistent with experimental data [38]. Conformations, dipole moments and toxicity Conformation of the cyclopeptides isolated from the higher fungus Amanita phalloides (Vaill. ex Fr.) Secr. has been related to its toxicity. Hence, the electronic structures and conformations of the cyclopeptides,


O-methyl-α-amanitin, phalloidin, and antamanide, were obtained from molecular parameters on the basis of semiempiric and ab initio methods [40]. The electronic structures and conformational analysis of the toxic cyclopeptides, α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin, α-amanitin-(S)-sulfoxide and α-amanitin-sulfone (Fig. 3) were obtained from molecular parameters on the basis of AM1 and ab initio methods [41]. Accordingly, the planar indole moiety of α-amanitin showed to be ahead from the rest of the bean-shaped bicyclic structure (Fig. 4). Therefore, the upper and lower sides of the π-heterocycle were available for interacting with any π-compounds, forming stable π-complexes. This region was also so lipophilic as required for transport through membranes to enter cells [40,41]. Total and point-charge dipole moments and sp hybrid were calculated for the five cyclic peptides of Fig. 3. The negative charge was then located on the sulphur atom of α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin, α- amanitin-(S)-sulfoxide and α-amanitin-sulfone towards the inner cavity. Then, this cavity was negative, nucleophilic, and thus adequate for scavenging cations in order to form complexes with probably a high stability constant. Inclusion of molecules was also possible depending on the inner cavity’s size of each compound, ranging from nearly 10 Å to 6-8 Å [41]. Smaller dipole moment values accounted for a major toxicity of the molecule examined. The lowest point-charge dipole moment was that of α-amanitin-sulfone, similar to that of the thioether S-deoxo-α-amanitin, while the highest point-charge dipole moment was achieved for α-amanitin-(S)-sulfoxide, followed by the (R)-isomer, α-amanitin, due to the distinct

CH2

R3NH2C

O S

COR4CH2

H

R5

NH

CH

HC

HN

OC C

HHN

OC

H2C

CO

HC

NHCO

H2C

HN

OC

HC

HN

OC

CO

NCH

OCCH

NH

CHR1R2CH3

CHC2H5

CH3

H

α-amanitin R1=CH2OH; R2=OH; R3=OH; R4=NH2; R5=OHβ-amanitin R1=CH2OH; R2=OH; R3=OH; R4=OH; R5=OHγ-amanitin R1=CH3; R2=OH; R3=OH; R4=NH2; R5=OHε-amanitin R1=CH3; R2=OH; R3=OH; R4=OH; R5=OHamanin R1=CH2OH; R2=OH; R3=H; R4=OH; R5=OHamanin amide R1=CH2OH; R2=OH; R3=H; R4=NH2; R5=OHamanullin R1=CH3; R2=H; R3=OH; R4=NH2; R5=OHamanullinic acid R1=CH3; R2=H; R3=OH; R4=OH; R5=OHpromanullin R1=CH3; R2=H; R3=OH; R4=NH2; R5=H

Figure 3. Structures of α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin α-amanitin-(S)-sulfoxide and α-amanitin-sulfone.


Figure 4. Spatial structure of α-amanitin. orientation of the oxygen atom in relation to the inner cavity, thus being able to modify its negative charge [41]. The α-amanitin-(R)-sulfoxide decreased the inner negative charge, while the (S)-compound increased it. The important difference was the direction of the dipole moment vector of both cyclic compounds. Dipole moment's direction was nearly alike for α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin, and α-amanitin-sulfone, only α-amanitin-(S)-sulfoxide showed quite a distinct orientation. In the case of α-amanitin-(S)-sulfoxide, HOPro (amino acid 2), Asn (amino acid 1) and Cys (amino acid 8) were situated on the axis of the dipole moment. The dipole moment pointed away from the sulphur atom towards this positively charged portion of the molecule. Therefore, in the (S)-sulfoxide, the distribution of charges was disturbed owing to the occurrence of a marked polar character of one side, which usually should take part in the binding to macromolecules. Furthermore, this feature made difficult its passing through membranes. Hence, on the basis of dipole moment calculations it was possible to explain the decrease in toxicity and binding of this molecule, and the results were in agreement with the inhibitory constants Ki of RNA-polymerase II and lethal doses in mice [1,41]. Hydrophobicity of cyclopeptides The physical properties of peptides have not been studied so much as those of small organic molecules. In fact, only few QSAR studies on peptides and proteins have been carried out [42]. Physico-chemical descriptors, such as the partition coefficient, are useful for selecting compounds for screening and development of predictive QSAR models. In fact, experimental partition coefficients are main descriptors of


lipophilicty or hydrophobicity, and many other ADMET (absorption, distribution, metabolism, excretion and toxicity) properties [43-47]. Hydrophobicity governs a variety of biological processes, such as transport, distribution and metabolism of biological molecules, molecular recognition, and protein folding. Therefore, such a parameter is essential to predict the transport and activity of drugs and potential pharmaceuticals. As it is known, the partition coefficient (P) is the ratio between the molar concentration of a chemical compound in an organic nonpolar layer, e.g., n-hexane, and that in an aqueous layer, e.g., water. This partition coefficient is expressed as log P. Elution times from RP-HPLC have been used as a measure of relative hydrophobicity of peptides and peptide analogues [48]. Unfortunately, the availability of measured log P values for peptides is limited [42]. During the past three decades, many methods for the prediction of log P have been reported [49-51]. Various physico-chemical parameters were used in these models, including structural effects, β-turn formation corrections, N- and C-terminal effects, etc. Akamatsu’s results were incorporated into the PLogP program [52]. At present, the most widely used method is an additive approach, where a molecule is divided into fragments, and the log P value is obtained by summing the contributions of each fragment. Addition of the log P values of each atom within the compound is also used, e.g., XLogP program [53]. Other approaches are based upon the use of topological indices and quantum mechanics. Thompson et al. [42] have recently reported on the accuracy of available programs for the prediction of log P values for peptides as effective measures of hydrophobicity for use in peptide QSAR studies [54]. Eight log P prediction programs were tested, of which seven programs were fragment-based methods. Owing to the different input requirements of each program, various representations of the structures were used: amino acid sequences for use with PlogP [52]; SMILES strings [55] for ALogP, LogKow and Interactive Analysis’s LogP (IALogP); 2D SYBYL ‘mol2’ files for XlogP [53]; 3D structures from Corina [56] for MLogP and one whole-molecule approach (QikProp). The dataset consisted of 340 peptides, varying from 2 to 16 amino acids in length, and included 141 blocked peptides, 158 unblocked peptides, and 41 cyclic peptides [42]. The predictive accuracy of the programs was assessed using r2 values, with ALogP being the most effective program, and MLogP the least one. Blocked, unblocked, and cyclic peptide structures were studied. All programs gave better predictions for blocked peptides, while, in general, log P values for cyclic peptides were under-predicted and those of unblocked peptides were over-predicted. The performance of the programs (from best to


worse) for cyclopeptides was as follows: LogKow, ALogP, XLogP, MLogP, QikProp, ACDLogP, IALogP [42]. Hattotuwagama and Flower [57] developed a new approach to the prediction of log P values for both blocked and unblocked peptides based on an empirical relationship between global molecular properties and measured physical properties. The final model consisted of five physico-chemical descriptors: molecular weight, number of single bonds, bidimensional van der Waals (2D-VDW) volume, bidimensional hydrophobic and polar van der Waals surface area (2D-VSA). The approach was peptide specific and its predictive accuracy was high, but was not applied to cyclopeptides. It is worth to mention that many authors has suggested that measuring the partition into other organic phases, such as phospholipids bilayers or micelles, might be more adequate than n-hexane for seeking biologically-relevant measures of peptide hydrophobicity. Antimicrobial activity of peptides The extensive clinical use of the classical antibiotics has led to resistant bacteria strains, in particular those responsible for infectious diseases [58,59]. Then, new effective antibiotics are required [60]. Cationic antimicrobial peptides can represent such a class of antibiotics [61,62]. The endotoxin of the Gram-negative bacteria is a lipopolysaccharide (LPS), which is a component of the outer membrane of these bacteria, and the endotoxic membrane anchor moiety is a lipid A (LA) [63]. LPS is spread out during Gram-negative bacterial infection and antimicrobial therapy and/or bacteria lysis, and may result in a lethal endotoxemia [64]. Therefore, the target of any novel class of antimicrobial peptides is the neutralization of LPS and/or LA [65]. Endotoxin-binding host defence proteins showed that an LPS- and LA-binding substructure was formed by amphipathic sequences, e.g., hydrophilic and hydrophobic moieties into opposite faces of the molecule [66,67], rich in cationic residues with a β-sheet conformation [68,69]. The amphipathicity of antimicrobial peptides was necessary for their mechanism of action, because the positively charged polar face would help the molecules reach the biomembrane through electrostatic interaction with the negatively charged head groups of phospholipids, and then the nonpolar face of the peptides will allow insertion into the membrane through hydrophobic interactions, causing increased permeability and loss of barrier function of target cells [68,70]. Then, amphipathic cationic peptides were proposed as antimicrobials against Gram-negative bacteria by targeted disruption of LPS [71].


Antimicrobial peptides take part in the innate immune response by providing a rapid first-line defence against infection [72]. Examples of antimicrobial peptides are magainins, cecropins, defensins, lactoferricins, tachyplesins, protegrins, thanatin, and others [73,74]. Antimicrobial peptide databases have been recently reported [75,76]. These compounds have been classified into three classes on the basis of secondary structures: a) linear peptides with propensity for amphiphilic α-helical structure [77,78], which mainly occur as disordered structures in aqueous media and become amphipathic helices upon interaction with the hydrophobic membranes [79,80], e.g., cecropins, magainins, and melittins; b) peptides with β or αβ structure stabilized by different number of disulfide bridges. The β-sheet class consists of cyclic peptides constrained in this conformation either by intramolecular disulfide bonds, e.g., defensins [81] and protegrins [82], or by an N-terminal to C-terminal covalent bond, e.g., gramicidin S and tyrocidins [83]; and c) peptides with over-representation of certain amino acids or unusual structures. The third group has been recently reviewed [84]; it includes aromatic amino acid-rich peptides, (Pro-Arg)-rich peptides, unusual defensins and defensin-like molecules, unusual antimicrobial peptides from amphibians, bacteriocins with unusual structure and anionic antimicrobial peptides [84]. Design of new molecules has been achieved using combinatorial-chemistry procedures coupled to high-throughput screening systems and data processing with design-of-experiments (DOE) methodology to obtain QSAR equation models and optimized compounds. Upon selection of best candidates with low cytotoxicity and moderate stability to protease digestion, anti-infective activity has been also evaluated in plant-pathogen model systems [85]. Large-scale production can be achieved by solution organic or chemoenzymatic procedures in the case of very small peptides, but, in many cases, production can be performed by biotechnological methods using genetically modified microorganisms (fermentation) or transgenic crops (plant biofactories) [85]. A variety of human proteins and peptides has antimicrobial activity and plays important roles in innate immunity. There are three important groups of human antimicrobial peptides, defensins, histatins, and cathelicidins [86]. Defensins are cationic non-glycosylated peptides containing six cysteine residues that form three intramolecular disulfide bridges, resulting in a triple-stranded β-sheet structure, e.g., α-defensins and β-defensins in humans. The second group is the family of histatins, which are small, cationic, histidine-rich peptides present in human saliva. Histatins adopt a random coil


conformation in aqueous solvents and form α-helices in non-aqueous solvents. The third group comprises only one antimicrobial peptide, the cathelicidin LL-37. This peptide is derived proteolytically from the C-terminal end of the human CAP18 protein. Just like the histatins, it adopts a largely random coil conformation in a hydrophilic environment, and forms an α-helical structure in a hydrophobic environment [86]. Cathelicidin and defensin gene families are multifunctional natural antibiotic peptides and signalling molecules that activate host cell processes involved in immune defence and repair [87,88]. In mammals, defensins have evolved to have a central function in the host defence properties of granulocytic leukocytes, mucosal surfaces, skin and other epithelia. Three structural subgroups of mammalian defensins are involved as effectors of antimicrobial innate immunity [89]. Furthermore, over 80 different α-defensin or β-defensin peptides are expressed by the leukocytes and epithelial cells of birds and mammals. Although these compounds may be candidates for therapeutic development due to the broad-spectrum antimicrobial properties, there are technical limitations related to their size (30-45 residues) and complex structure. Therefore, minidefensins have been developed, which are antimicrobial peptides with 16-18 residues, approximately half the number found in α-defensins [90]. The θ-defensins are evolutionarily related to α- and β-defensins, but other minidefensins probably arose independently. Like α- or β-defensins, minidefensin molecules have a net positive charge and a largely β-sheet structure that is stabilized by intramolecular disulfide bonds. Whereas α-defensins are found only in mammals and θ-defensins only in nonhuman primates, the other minidefensins come from widely divergent species, including horseshoe crabs, spiders, and pigs. Several α-defensins and minidefensins are effective inhibitors of HIV-1 infection in vitro, and recent evidence implicates α-defensins in resistance to HIV-1 progression in vivo [90]. Cyclic peptides have been and are under study as potential antimicrobial therapeutic agents. These peptides usually exhibit broad-spectrum activity against Gram-positive and Gram-negative bacteria, yeasts, fungi and enveloped viruses [86]. Combinatorial synthesis of cyclic peptides together with antimicrobial screening and other bioactivities can provide new lead identification, and construction of QSARs. Redman et al. [91] reported a new sequencing protocol for rapid identification of the members of a cyclic peptide library based on automated computer analysis of mass spectra, new lead identification, and construction of QSARs. The utility of the new MS-sequencing approach was demonstrated using sonic spray ionization ion trap MS and MS/MS spectrometry on a single compound per bead cyclic peptide


library and validated with individually synthesised pure cyclic D,L-α-peptides [91]. Also, complex libraries of glycosidated cyclic peptides, which are an important class of drug-like compounds, have been developed by incorporating glycosidated amino acids into linear peptides via solid-phase peptide synthesis followed by thioesterase-mediated peptide cyclization [92]. Then, the two major classes of cationic amphipathic antimicrobial peptides are α-helical and β-sheet peptides [93,94]. In fact, potent cyclic antimicrobial peptides selective for Gram-negative bacteria have been successfully developed on the basis of the β-stranded framework mimicking the putative LPS-binding sites of the LPS-binding protein family [95]. The disadvantage of antimicrobial peptides for clinical use as antibiotics is their toxicity or ability to lyse eukaryotic cells [61]. Then, it would be necessary to dissociate anti-eukaryotic activity from antimicrobial activity in order to use them as broad-spectrum antibiotics. SAR studies indicated that changes in the amphipathicity of these antibacterial peptides could be used to dissociate the antimicrobial activity from the haemolytic activity [96,97]. Furthermore, peptide cyclization increased the selectivity for bacteria because of substantially reducing the haemolytic activity [98]. Antimicrobial and haemolytic activities of de novo designed cyclic β-sheet gramicidin S analogues have been successfully dissociated by systematic alterations in amphipathicity/hydrophobicity through D-amino acid substitutions [99,100]. Chen et al. [48,101] also demonstrated that in linear peptides the helix-destabilizing properties of D-amino acids offered a systematic approach to the controlled alteration of the hydrophobicity, amphipathicity, and helicity of amphipathic α-helical model peptides. Frecer et al. [102] reported the de novo design of a series of synthetic cyclic amphipathic cationic peptides for which a high affinity of binding to LA was predicted from molecular modelling. These V peptides were composed of two identical symmetric amphipathic LPS- and LA-binding motifs containing cationic residues, such as HBHPHBH and HBHBHBH (where B is a cationic residue, H is a hydrophobic residue, and P is a polar residue), that formed two strands of a β-hairpin joined by a G9S10G11 turn on one side and a disulfide bond between C1 and C19 bridging the N- and C-terminal residues on the other side (Cys1-Cys19 disulfide bridge linking the terminal residues) (Fig. 5). The structure of each peptide was cyclized via a disulfide bridge, and showed a β-sheet conformation, which would bind to the bisphosphorylated glucosamine disaccharide head group of LA, primarily by ion-pair formation between anionic phosphates of LA and the cationic side chains [68].


C1

C19

V2

V18

K3

K17

V4

V16

K5

K15

V6

V14

K7

K13

V8

V12

G9

G11

S10

NH2

Ac

Figure 5. Chemical structure of the cyclic cationic peptide V1. The V peptides contained seven alternating H and B or P residues with the general sequence Ac-C-HBHB(P)HBHGSG-HBHB(P)HBH-C-NH2, where Ac was an acetyl group. The two LPS- and LA-binding sites showed structural similarity to cyclic β-sheet defence peptides, such as protegrin 1, thanatin, and androctonin [73]. Peptides were further characterized by electrospray ionization mass spectrometry (ESI-MS) and amino acid analysis. MD simulations showed that the backbone conformations of free V peptides evolved from the initial β-hairpin with defined patterns of secondary structure into flexible random conformations. The patterns of the molecular shape fluctuations and torsional flexibility indicated high degrees of flexibility of the free V peptides in solution [102]. Antibacterial activity test, haemolytic activity assay, and cytotoxicity test were carried out. The therapeutic index, which is a widely used parameter to represent the specificity of antimicrobial reagents, was calculated by the ratio of minimal haemolytic concentration (MHC) (haemolytic activity) and minimal inhibitory concentration (MIC) (antimicrobial activity); thus, larger values in therapeutic index indicated higher antimicrobial specificity [59,103]. The therapeutic index could be increased either by increasing antimicrobial activity or decreasing haemolytic activity, while maintaining antimicrobial activity. High peptide hydrophobicity and amphipathicity also led to a higher peptide self-association in solution. When the self-association of a peptide in aqueous media was too strong, it would decrease the ability of the peptide to dissociate and penetrate into the biomembrane and to kill target cells. Temperature profiling in RP-HPLC from 5 to 80oC was used to measure self-association of small amphipathic molecules, including cyclic β-sheet peptides [104], accounting for dimerization of the peptides at 5 °C and the monomerization of peptides at 80 °C because of dissociation of the dimers [102]. A higher ability to self-associate in solution was correlated with weaker antimicrobial activity and stronger haemolytic activity of the peptides. In addition, self-associating ability was correlated with the secondary structure of peptides, i.e. disrupting the secondary structure by replacing the L-amino


acid with its D-amino acid counterpart decreased the peptide association parameter (PA) values [102]. Therefore, the D-amino acid substituted peptides possessed an enhanced average antimicrobial activity compared with L-diastereomers. QSAR models for antimicrobial cyclopeptides QSARs were obtained by associating the experimental biological potencies to physico-chemical molecular properties obtained from the peptide sequences [102]. A rational strategy was used to design cationic antimicrobial peptides via repeated sequences of alternating cationic and nonpolar residues. According to the above mentioned considerations, to achieve a high level of antimicrobial activity and selectivity toward bacteria instead of eukaryotic cells, systematic modifications of molecular properties were made by varying the amino acid residues of the amphipathic LPS- and LA-binding motifs [68] while preserving the size, symmetry, and amphipathic character of the peptides. In peptides V1 to V7, the molecular charges, amphipathicities, and lipophilicities of the peptides were modulated by varying the cationic (polar) amino acid residues in the center of the binding motifs, where B(P) was Lys or Arg (Ser or Gln), and the hydrophobic residues, where H was Ala, Val, Phe, or Trp, which preserved the symmetries, sizes, and amphipathic characters of the peptides with alternating polar and nonpolar residues. Lysine residues were previously shown to contribute mostly to the high affinity to LA when placed at the flanking basic residue position of the HBHB(P)HBH motif with a β-sheet conformation [68,102]. QSAR analysis of peptide sequences and their antimicrobial, cytotoxic, and haemolytic activities revealed that site-directed substitutions of residues in the hydrophobic face of the amphipathic peptides with less lipophilic residues selectively decreased the haemolytic effect without significantly affecting the antimicrobial or cytotoxic activity. On the other hand, the antimicrobial effect was enhanced by substitutions in the polar face with more polar residues, which increased the amphipathicity of the peptide [105]. The combination of three molecular properties (charge, amphipathicity, and lipophilicity) was found to correlate with the observed antimicrobial, haemolytic, and cytotoxic activities of the V peptides. Single-variate QSAR correlations of these properties to the biological effects could not be established, suggesting that the membrane disruption involved a concerted process [43,106,107]. The V peptides exhibited strong effects against five Gram-negative bacteria (Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa), with MICs in the nanomolar range, and low cytotoxic and


haemolytic activities at concentrations significantly exceeding their MICs. Then, simple properties derived from the peptide sequences, such as the molecular charge (QM), amphipathicity index (AI), and lipophilicity index (Πo/w), were correlated to the mean antimicrobial effect against Gram-negative bacteria by multivariate linear regression as shown in eq. 1 [102]. For antimicrobial effect: ln (MIC) = 9.49 . QM + 10.17 . AI - 0.05 . Πo/w - 22.16 (equation 1) The t test of the multivariate correlation equation revealed that the antimicrobial effect on bacteria was mainly determined by the V-peptide charge (QM) and amphipathicity (AI), i.e., by the number of cationic and polar residues forming the polar face of the V peptides and their distribution throughout the two symmetric amphipathic LPS- and LA-binding motifs [102]. In fact, a higher affinity to the outer bacterial membrane seemed to be a favourable prerequisite for the antimicrobial effects, since the V peptides displayed low micromolar Kd values and antimicrobial activities at concentrations in the nanomolar range. Kd accounted for the dissociation constant of the peptide-LA. The haemolytic activity of the peptides against human erythrocytes was determined as a main measure of peptide toxicity towards higher eukaryotic cells. Since both antimicrobial and haemolytic activities of the cationic peptides involved cell membrane lysis, and depended on the same physico-chemical properties [99,106] a similar correlation equation (eq. 2) was obtained for the haemolytic activities of the V peptides [102]. For haemolysis: ln (EC50) = - 5.34 . QM - 4.94 . AI - 0.23 . Πo/w + 31.87 (equation 2) In this case the correlation parameters and the t statistics showed that the haemolytic activity against eukaryotic cells was mainly influenced by the molecular lipophilicity, i.e., the sum of the lipophilicities of all residues, with the major contribution coming from the H residues, which formed the nonpolar face of the V peptides, which was predicted to acquire a β-hairpin-like structure in the peptide-LA complexes [102]. The EC50 values of the V peptides for cytotoxicity ranged from 40 M to 5.7 mM, which exceeded their MICs by up to 3 orders of magnitude. For the cytotoxic effects of the V peptides, the correlation (eq. 3) was obtained [102]. For cytotoxicity: ln (EC50) = 8.98 . QM + 11.74 . AI - 0.04 . Πo/w - 8.70 (equation 3)


Therefore, the correlation parameters and t statistics indicated that the cytotoxic activity was determined mainly by the peptide charge and the amphipathicity. Amphipathicity of the L-amino acid substituted peptides was determined by the calculation of hydrophobic moment [108] using the software package Jemboss [109], modified to include the determined hydrophobicity scale. Hydrophobicity coefficients were determined by RP-HPLC at pH 7 (phosphate buffer) [102]. Since the antimicrobial activities of the V peptides strongly increased with the increasing amphipathicity of the molecules at constant QM and Пo/w, then, aggregates of V peptides rather than individual molecules would behave as strong antimicrobials [102]. The ability of cationic peptides to form aggregates has been related to their antimicrobial potencies, as previously reported for dermaseptin S4 [110,111], protegrin-1 [112], and human defensins [59,113]. The validity of the QSAR model for the antimicrobial potencies of the V peptides against Gram-negative bacteria was verified with the set of cyclic cationic amphipathic peptides designed by Muhle and Tam [95], which were similar to the V peptides, e.g., cyclo(PACRCRAG-PARCRCAG) sequences constrained by two cross-linking disulfide bonds. These peptides displayed potent activities against Gram-negative bacteria (Escherichia coli and Pseudomonas aeruginosa). The MICs of these peptides were 20 nM for E. coli. Frecer’s correlation equation for the antimicrobial activity (eq. 1) was able to reproduce the qualitative rank order of antimicrobial potencies at low salt concentrations for eight of the nine peptides [95]. Then, on the basis of QSARs, new analogues that had strong antimicrobial effects but that lacked haemolytic activity have been proposed [102]. QSAR eq. 1 predicted rapid increases in the antimicrobial activity with an increase in the molecular charge QM over 4 è (in units of electron charge) when amphipathicity and hydrophobicity were kept constant at the levels of the most promising peptide, V4. Model analogues of V4 that shared the polar HKHQHKH motif and that differed only in the H residues (which retained the amphipathicity index of V4) were predicted to possess decreasing haemolytic activities with decreasing lipophilicities, while their predicted antimicrobial and cytotoxic activities remained unchanged. In other analogues of V4, the replacement of the two central Gln residues by more polar Asn residues was predicted to lead to significantly increased antimicrobial potencies due to the increased amphipathicity independent of the H residues (predicted MICs were lower than that of V4) [102].


Thus, variations in the H residues forming the hydrophobic face of the analogues of V4 mainly affected the haemolytic activity, which was shown to depend strongly on Πo/w, but did not affect the predicted antimicrobial activity of the analogues. Therefore, replacement of the H residues with less hydrophobic residues in the nonpolar face of the amphipathic analogues was appropriate for decreasing the haemolytic activities of the V peptides. On the other hand, directed substitutions of the B and P residues in the polar faces of the V peptides with more polar residues, which increased the amphipathic character (more negative AI values) of the peptide while keeping the net charge, the symmetry of the binding motifs, and the composition of the hydrophobic face, were predicted to bring about a significant increase in antimicrobial potencies [59,102]. The positively charged antimicrobial peptide cyclo[VKLdKVdYPLKVKL dYP] (GS14dK4), which is a diastereomeric lysine ring-size analogue of the naturally occurring antimicrobial peptide gramicidin S (Fig. 6), exhibited enhanced antimicrobial and markedly reduced haemolytic activities compared with gramicidin S itself [114]. The binding of GS14dK4 to various lipid bilayer model membranes has been recently studied using isothermal titration calorimetry [114]. Dynamic light scattering results indicated the absence of any peptide-induced major alteration in vesicle size or vesicle fusion under the experimental conditions. The binding of GS14dK4 was significantly influenced by the surface charge density of the phospholipid bilayer and by the presence of cholesterol. The presence of cholesterol markedly reduced the affinity of a peptide for phospholipid bilayers. The binding isotherms could be described quantitatively by a one-site binding model. The measured endothermic binding enthalpy (ΔH) varied strongly (+6.3 to +26.5 kcal/mol) and appeared to be inversely related to the order of the phospholipid bilayer system. However, the negative free energy (ΔG) of binding remained relatively constant (-8.5 to -11.5 kcal/mol) for all lipid membranes examined. The

H

O

OO

O

OO O

O

OO

N

HN

NH

HN

HN

N

HN

HN NH

N

NH2

NH2

Figure 6. Structure of gramicidin S.


relatively small variation of negative free energy of peptide binding together with a pronounced variation of positive enthalpy produced an equally strong variation of TΔS (+16.2 to +35.0 kcal/mol), indicating that GS14dK4 binding to phospholipids bilayers was primarily entropy driven [114]. The properties and SAR studies of a macrocyclic analogue of porcine protegrin-1 have been recently reported [115]. Protegrin-1 (PG-1) is an 18-residue β-hairpin peptide containing two disulfide bridges (Cys6–15 and Cys8-13) that belongs to the cathelicidin class of antimicrobial peptides (Fig. 7). These disulfides constrained the peptide backbone into a β-hairpin conformation, with a β-turn formed by residues 9–12, as detected by NMR. SAR studies [116] led to the discovery of analogue IB367 with Cys5–14 and Cys7–12 disulfide bridges, which has been clinically tested to treat ulcerative oral mucositis, ventilator associated pneumonia, and respiratory infections associated with cystic fibrosis. An approach to PG-1 peptidomimetics has been earlier reported [117,118] based on the use of β-hairpin-stabilizing organic templates. The template D-Pro-Pro was chosen for its ability to promote a β-hairpin loop structure. This design was used to prepare β-hairpin peptidomimetics [119-122]. The lead compound, containing the sequence cyclo(Leu-Arg-Leu-Lys-Lys-Arg-Arg-Trp-Lys-Tyr-Arg-Val-D-Pro-Pro), showed antimicrobial activity against Gram-positive and Gram-negative bacteria, but a much lower haemolytic activity than PG-1. SAR studies were carried out on over 100 single site substituted synthetic analogues, and the biological profiles were assessed. Some analogues showed slightly improved antimicrobial activities (2–4-fold lowering of MICs), whereas other substitutions caused large increases in haemolytic activity [115]. Frecer [123] quantitatively analysed antimicrobial and haemolytic activities of protegrin-1 mimetics-cyclic cationic peptides with β-hairpin fold synthesised by Robinson et al. [115] (Fig. 7).

Phe12

Arg9

Cys13

Cys8

Val14 Cys15

Cys6

Val16

Leu5

Gly17

Arg4

Arg18

Gly3Gly2

Arg1

Arg10

Arg11NH2

Trp8

Lys5

Lys9

Lys4

Tyr10

Leu3

Arg11

Arg2

Val12

Leu1

D-Pro13

L-Pro14Arg6

Arg7

Figure 7. Structure of protegrin-1 (PG-1) and cyclic β-hairpin peptides-analogues of PG-1.


The polar face of the cyclic lead peptide R1 [115] was formed by the side chains of cationic residues 2, 4, 6, 7, 9, 11 and D-Pro13 and Pro14, while the nonpolar face consisted of residues 1, 3, 5, 8, 10 and 12 with predominant lipophilic/aromatic character. For the QSAR models, the selected properties, which characterized peptide’s charge, lipophilicity, amphipathicity, size, shape and flexibility were used, including the following descriptors: charge (Q), overall lipophilicity (L), lipophilicity of polar and nonpolar faces (P and N), surface areas of polar and nonpolar faces (SP and SN), molecular mass of the polar and nonpolar faces (MwP and MwN), count of small lipophilic, highly lipophilic and aromatic residues forming the nonpolar face (CSL, CHL and CAR), total number of hydrogen bond donor and acceptor centres (HBdon and HBacc), total number of rotatable bonds (RotBon) and various amphipathicity descriptors (P/L, P/N, L/N, Q/L, Q/N, SP/SN, MwP/MwN, Q/CSL, Q/CHL and Q/CAR) [123]. These simple additive molecular descriptors were easily derived from peptide sequences and tabulated amino acid properties [124]. Frecer [123] assumed that the analogues adopted an amphipathic β-hairpin secondary structure, which was in agreement with the fact that protegrins and synthetic analogues with a constrained β-hairpin conformation displayed higher antimicrobial potencies than linear or nonconstrained counterparts [69,125]. The best models obtained by application of genetic function approximation algorithm correlated antimicrobial potencies (log MICa) to peptide's charge and amphipathicity index, while haemolytic effect (log %Hem) correlated well with the lipophilicity of residues forming the nonpolar face of the β-hairpin [123]. The lipophilicity of the nonpolar face N and the amphipathicity parameter Q/N showed some relation to the antimicrobial activity, while N and the intercorrelated count of highly lipophilic residues in the nonpolar face (CHL) appeared to be related to the haemolytic activity. This finding was consistent with the above-mentioned QSAR studies (eqs. 1-3), which suggested that charge and amphipathicity correlated with MIC of cyclic cationic peptides and overall lipophilicity was very important for the haemolytic activity [102]. A large set of QSAR models combining up to five descriptors in each correlation equation was prepared by the genetic function approximation (GFA) algorithm [126] of the Cerius2 package. The fitness of each generated model was evaluated by using the lack-of-fit score [127]. The best performing QSAR model of the antimicrobial effect of R1 analogues accounted for two descriptors, molecular charge Q and amphipathicity parameter Q/N, which are


determined by the cationic residues forming mainly the polar face and the lipophilicity of the nonpolar face, N (eq. 4) [123]. log MICa = 1.291 – 0.180 . Q + 1.438. (Q/N) (equation 4) Antimicrobial effect of cationic peptides related to molecular charge and amphipathicity has been previously reported [102,128]. Based on this QSAR model, the best variants of R1 lead should have the polar faces formed only by charged residues and the nonpolar faces by highly lipophilic residues in order to display strong antimicrobial activity. Thus, any analogue with the same charge as the lead peptide R1 (Q = 7 è) should show more potent antimicrobial activity than R1 when the lipophilicity of its nonpolar face N ≥ 6.84 (value of N for R1) [123]. MICa values were validated for the 97 peptides of Robinson et al. [115] for their Q and N descriptors [123]. The best QSAR model of the haemolytic effect for the R1 analogues related the lysis of human erythrocytes to the lipophilicity of the nonpolar face, N (eq. 5) [123]. log %Hem = -2.551 + 0.431 . N (equation 5) Therefore, the haemolytic potency of R1 analogues depended mainly on the lipophilicity of the nonpolar face, thus being almost independent on the charge and composition of the polar face. Based on this QSAR model, any analogue with N ≤ 6.84 should show lower levels of haemolytic activity than the lead peptide R1 [123]. The %Hem values of the 97 peptides of Robinson et al. [115] fitted this model. The combination of the QSAR models with the cyclic backbone of the protegrin analogues (constant turns, residues 6, 7 and 13, 14), sequence amphipathicity (regular alternation of cationic and nonpolar residues) and peptide symmetry provided a sufficient strategy for peptide design. The secondary structure of the backbone of the peptides NR7–NR9 was stabilized by a Cys3-Cys10 disulfide bridge in the form of a cyclic β-hairpin [123]. Tam et al. [128] showed by circular dichroism experiments that cyclic protegrins containing one to three cysteine bonds displayed some degree of β-strand structure in solution. The occurrence of the β-hairpin fold was essential for membrane permeation/disruption by PG-1 analogues as previously reported [128-130]. Recently, Bhonsle et al. [131] used 3D-QSAR for identification of descriptors defining bioactivity of antimicrobial peptides. The resulting 3D-physico-chemical properties were controlled by the placement of amino acids with well-defined properties (hydrophobicity, charge density, electrostatic


potential, and those mentioned above) at specific locations along the peptide backbone. These peptides exhibited different in vitro activity against Staphylococcus aureus and Mycobacterium ranae. The differences in the biological activity seem to be due to different physico-chemical interactions that occur between the peptides and the cell membranes of the bacteria. 3D-QSAR analyses showed that specific physico-chemical properties were responsible for antibacterial activity and selectivity. There were five physico-chemical properties specific to the S. aureus QSAR model, while five properties were specific to the M. ranae QSAR model. Accordingly, for any particular antimicrobial peptide, organism selectivity and potency are controlled by the chemical composition of the target cell membrane [131]. As described above, cationic peptide antibiotics possess amphiphilic structure, thereby displaying lytic activity against bacterial cell membranes. Naturally occurring antimicrobial peptides contain a large number of amino acid residues, which limit their clinical applicability. Recent studies indicated that it is possible to decrease the chain-length of these peptides without loss of activity, and suggested that a minimum of two positive ionizable (hydrophilic) and two bulky groups (hydrophobic) are required for antimicrobial activity. By employing the HipHop module of the software package CATALYST, these experimental findings have been translated into 3D pharmacophore models by finding common features among active peptides [132]. Positively ionizable and hydrophobic features were the important characteristics of compounds used for pharmacophore model development. Based on the highest score and the presence of amphiphilic structure, two separate hypotheses, Ec-2 and Sa-6 for Escherichia coli and Staphylococcus aureus, respectively, were selected for mapping analysis of active and inactive peptides against these organisms. The resulting models not only provided information on the minimum requirement of positively ionizable and hydrophobic features but also indicated the importance of their relative arrangement in space. The minimum requirement for positively ionizable features was two in both cases, but the number of hydrophobic features required in the case of E. coli was four, while for S. aureus it was found to be three [132]. Hypotheses were further validated using cationic steroid antibiotics, a different class of facial amphiphiles with the same mechanism of antimicrobial action as that of cationic peptide antibiotics. The results showed that cationic steroid antibiotics also require similar minimum features to be active against both E. coli and S. aureus.


Interaction of antimicrobial peptides in biomembranes Cytoplasmic membrane is the main target of some antimicrobial peptides [133]. In fact, all cationic amphipathic peptides interact with membranes [134,135]. Cationic peptides first bind to the negatively charged LPS or LA of Gram-negative bacteria [98,136], then permeate the membrane by different mechanisms, finally leading to bacteria death. The development of resistance to membrane active peptides whose target is the cytoplasmic membrane is not expected because this would imply severe changes in the lipid composition of cell membranes of microorganisms. The mechanism of bacterial membrane disruption by cationic amphipathic peptides should involve several molecular properties of the peptides: a net positive charge (attachment to anionic outer membrane constituents), amphipathicity (aggregation on the membrane surface), and lipophilicity (permeation into the membrane), as has been discussed above [137]. Many models have been proposed on the interaction of cationic amphipathic antimicrobials with the cytoplasmic membrane [61,136-139] because lethal action could be either from membrane disruption or from translocation through the membrane to target receptors inside the cell. Two main proposed mechanisms are: (i) The “barrel-stave” mechanism: the peptide may form transmembrane channels/pores, as their hydrophobic surfaces interact with the lipid core of the membrane and the hydrophilic surfaces point inwards, producing an aqueous pore [140]; (ii) The “carpet” mechanism: peptides lie at the interface parallel with the membrane allowing their hydrophobic surface to interact with the hydrophobic component of the lipid, and the positive charge residues can still interact with the negatively charged head groups of the phospholipid [141]. An NMR study of the amphipathic cyclic β-sheet antimicrobial peptide of gramicidin S [142] supported the interface model. However, neither of these mechanisms alone could fully explain the reported data. The mechanism of action depends upon the difference in membrane composition between prokaryotic and eukaryotic cells [143]. If the peptides formed pores/channels in the hydrophobic core of the eukaryotic bilayer, they would cause the hemolysis of human erythrocytes. On the contrary, for prokaryotic cells the peptides lysed cells in a detergent-like mechanism as described in the carpet mechanism. In fact, the extent of interaction between peptide and biomembrane depends on the composition of the lipid bilayer. Liu et al. [144,145] used a polyleucine-based α-helical transmembrane peptide to demonstrate that the peptide reduced the phase transition temperature to a higher extent in


phosphatidylethanolamine (PE) bilayers than in phosphatidylcholine (PC) or phosphatidylglycerol bilayers, indicating a higher disruption of PE organization. The zwitterionic PE is the main lipid component in prokaryotic cell membranes, and PC is the main lipid component in eukaryotic cell membranes [146]. According to the results, the carpet mechanism is essential for strong antimicrobial activity, and if there were a preference by the peptide for penetration into the hydrophobic core of the bilayer, the antimicrobial activity would actually decrease [143]. Recently, membrane interactions of designed cationic antimicrobial peptides were reported [147]. Novel cationic antimicrobial peptides typified by sequences such as KKKKKKAAX-AAXAAXAA-NH2, where X = Phe/Trp-displayed high antibacterial activity, but exhibited little or no haemolytic activity towards human red blood cells even at high doses. To clarify the mechanism of their selectivity for bacterial vs. mammalian membranes and to increase the understanding of the relationships between primary sequence and bioactivity, a library of derivatives was prepared by increasing segmental hydrophobicity, in which systematic substitutions of Ala for two, three, or four Leu residues were made. Conformationally constrained dimeric and cyclic derivatives were also synthesised. The peptides were examined for activity against pathogenic bacteria (Pseudomonas aeruginosa), haemolytic activity on human red blood cells, and insertion into models of natural bacterial membranes (containing anionic lipids) and mammalian membranes (containing zwitterionic lipids + cholesterol). Results were compared with the corresponding properties of the natural cationic antimicrobial peptides magainin and cecropin. Using circular dichroism and fluorescence spectroscopy, Gluckhov et al. [147] found that peptide conformation and membrane insertion were sequence dependent, both upon the number of Leu residues, and upon their positions along the hydrophobic core. Membrane disruption was likely enhanced by the fact that the peptides contained potent dimerization-promoting sequence motifs, as assessed by SDS-PAGE gel analysis. The overall results led to identify distinctions in the mechanism of actions of these cationic antimicrobial peptides for disruption of bacterial vs. mammalian membranes, the latter dependent on surpassing a "second hydrophobicity threshold" for insertion into zwitterionic membranes. Anticancer and cytotoxic activities There is a need for novel drugs for the treatment of infectious diseases, autoimmunity and cancer. Cyclic peptides constitute a class of compounds


that have made important contributions to the treatment of certain diseases. Penicillin, vancomycin, cyclosporin, the echinocandins and bleomycin are well-known cyclic peptides [148]. Cyclic peptides, compared to linear peptides, have been considered to have greater potential as therapeutic agents due to their increased chemical and enzymatic stability, receptor selectivity, and improved pharmacodynamic properties. They have been used as synthetic immunogens, transmembrane ion channels, antigens for Herpes Simplex Virus, potential immunotherapeutic vaccines for diabetes and Experimental Autoimmune Encephalomyelitis - an animal model of Multiple Sclerosis, as inhibitors against α-amylase and as protein stabilizers. Cyclic peptides as therapeutic agents in disease have been recently reviewed [148]. Isolation and anti-cancer effects of cyclotides obtained from Violaceae plants have been also reviewed [149]. A fractionation protocol was developed, leading to varv cyclotides from Viola arvensis (Violaceae). Separation methods included adsorption, ion exchange chromatography and solvent-solvent partitioning. Structures were determined on the basis of MS for cyclotide sequencing and mapping of disulfide bonds. Finally, to assess SARs, regarding their anti-cancer and cytotoxic effects, the three dimensional structures of cyclotides were characterized by homology modelling techniques [149]. Cytotoxic cyclotides were obtained from Viola tricolor [150]. Bioguided fractionation was carried out by RP-HPLC and a fluorometric cytotoxicity assay. Cyclotides were assayed against two human cancer cell lines, U-937 GTB (lymphoma) and RPMI-8226/s (myeloma). The most potent compounds isolated, which showed the lowest IC50 values, were: vitri A (IC50 = 0.6 μM and IC50 = 1 μM, respectively), varv A (IC50 = 6 μM and IC50 = 3 μM, respectively), and varv E (IC50 = 4 μM in both cell lines). Their sequences, determined by automated Edman degradation, quantitative amino acid analysis, and MS, were cyclo-GESCVWIPCITSAIGCSCKSKVCYRNGIPC (vitri A), cyclo-GETCVGGTCNTPGCSCSWPVCTRNGLPVC (varv A), and cyclo-GETCVGGTCNTPGCSCSWPVCTRNGLPIC (varv E) [150]. Cycloviolacin H4, a hydrophobic cyclotide, was isolated from the Australian native violet Viola hederaceae. Its sequence, cyclo-(CAESCVWIPCTVTALLGCSCSNNVCYNGIP), was determined by nanospray MS/MS and quantitative amino acid analysis. This cyclotide was classified into the bracelet subfamily of cyclotides due to the absence of a cis-Pro peptide bond in the circular peptide backbone. Cycloviolacin H4 exhibited the most potent haemolytic activity in cyclotides, and this activity correlated with the size of a surface-exposed hydrophobic patch. These findings provided insight into the factors that modulate the cytotoxic properties of cyclotides [151].


Recently, the sequences of 11 cyclotides, vibi A-K, isolated from the alpine violet Viola biflora, were determined by MS/MS sequencing of proteins and screening of a cDNA library of V. biflora in parallel [152]. To correlate amino acid sequence to cytotoxic potency, vibi D, E, G and H were analysed by a fluorometric microculture cytotoxicity assay using a lymphoma cell line. The IC50-values of the bracelet cyclotides vibi E, G and H ranged between 0.96 and 5.0 μM while the Möbius cyclotide vibi D was not cytotoxic at 30 μM [152]. Rational design, structure, and biological evaluation of cyclic peptides mimicking the vascular endothelial growth factor (VEGF) have been recently reported [153]. Angiogenesis is the development of a novel vascular network from a pre-existing structure. Blocking angiogenesis is an attractive strategy to inhibit tumor growth and metastasis formation. Based on structural and mutagenesis data, novel cyclic peptides were developed, which mimic, simultaneously, two regions of the VEGF crucial for the interaction with the VEGF receptors. The peptides, displaying the best affinity for VEGF receptor 1 on a competition assay, inhibited endothelial cell transduction pathway, migration, and capillary-like tubes formation. The specificity of these peptides for VEGF receptors was demonstrated by microscopy using a fluorescent peptide derivative. The resolution of the structure of some cyclic peptides by NMR and molecular modelling allowed the identification of various factors accounting for their inhibitory activity [153]. The interest in the application of the QSAR paradigm has steadily increased in recent decades and it may be useful in the design and development of DNA-binding molecules as new anticancer agents. Due to the great potential of DNA as a receptor, many classes of synthetic and naturally occurring molecules exert their anticancer activities through DNA-binding [154]. In the field of antitumour DNA-binding agents, a number of acridine and anthracycline derivatives are in the market as chemotherapeutic agents. However, the clinical application of such compounds has shown multi-drug resistance and secondary and/or collateral effects. Therefore, there has been increasing interest in discovering and developing small molecules that are capable of DNA-binding. Recently [154], the DNA-binding properties of different compound series have been discussed using 27 QSAR models. The most important determinants for the activity in these models were Hammett electronic (σ and σ+), hydrophobic, molar refractivity, and Sterimol width parameters. P-glycoprotein is implicated in multiple drug resistance exhibited by several types of cancer against a multitude of anticancer chemotherapeutic agents. Therefore, several research groups searched for effective


P-glycoprotein inhibitors. Cyclosporine A, aureobasidin A and related analogues were reported to possess potent inhibitory actions against P-glycoprotein. Recently, receptor surface analysis was used to construct two satisfactory receptor surface models for cyclosporine- and aureobasidin-based P-glycoprotein inhibitors [155]. These pseudoreceptors were combined to achieve satisfactory 3D-QSAR for 68 different cyclosporine and aureobasidin derivatives. Upon validation against an external set of 16 randomly selected P-glycoprotein inhibitors, the optimal 3D-QSAR was found to be self-consistent and predictive (r2

LOO = 0.673, r2PRESS = 0.600). The resulting

3D-QSAR was employed to probe the structural factors that control the inhibitory activities of cyclosporine and aureobasidin analogues against P-glycoprotein [155]. Acknowledgements Thanks are due to Universidad de Buenos Aires and CONICET (Argentina) for financial support; Ministerio de Ciencia, Tecnología e Innovación Productiva (MINCYT, Argentina) for electronic bibliography facilities. References 1. Pomilio, A.B., Battista, M.E., and Vitale, A.A. 2006, Curr. Org. Chem.,

10, 2075. 2. Gallo, R.L., Murakami, M., Ohtake, T., and Zaiou, M. 2002, J. Allergy Clin.

Immunol., 110, 823. 3. Ahn, J.-M., Boyle, N.A., MacDonald, M.T., and Janda, K.D. 2002, Mini Rev.

Med. Chem., 2, 463. 4. Monroc, S., Badosa, E., Besalú, E., Planas, M., Bardají, E., Montesinos, E., and

Feliu, L. 2006, Peptides, 27, 2575. 5. Montesinos, E. 2007, FEMS Microbiol. Lett., 270, 1. 6. Marr, A.K., Gooderham, W.J., and Hancock, R.E.W. 2006, Curr. Opin.

Pharmacol., 6, 468. 7. Kelso, M.J., and Fairlie, D.P. 2003, Current Approaches to Peptidomimetics. In:

Molecular Pathomechanisms and New Trends in Drug Research.Toth I. and Keri G. eds., Taylor and Francis Ltd., London and New York, Chapter 44, 579.

8. Leung, D., Abbenante, G., and Fairlie, D.P. 2000, J. Med. Chem., 43, 305. 9. Westermann, J.C., and Craik, D.J. 2008, Methods Mol. Biol., 494, 87. 10. Jones, R.M., Boatman, P.D., Semple, G., Shin, Y.-J., and Tamura, S.Y. 2003,

Curr. Opin. Pharmacol., 3, 530. 11. Hruby, V.J. 2002, Nature Rev. Drug Discov., 1, 847. 12. Giannis, A., and Kolter, T. 1993, Angew. Chem. Int. Ed. Engl., 32, 1244. 13. Gante, J. 1994, Angew. Chem. Int. Ed. Engl., 33, 1699.


14. Nakanishi, H., and Kahn, M. 2003, Design of Peptidomimetics. In: The Practice of Medicinal Chemistry. 2nd ed., Wermuth C.G. ed., Academic Press, London.

15. Hirschmann, R., Nicolaou, K.C., Pietranico, S., Leahy, E.M., Salvino, J., Arison, B., Cichy, M.A., Spoors, P.G., Shakespeare, W.C., Sprengeler, P.A., Hamley, P., Smith III, A.B, Reisine, T., Raynor, K., Maechler, L., Donaldson, C., Vale, W., Freidinger, R.M., Cascieri, M.R., and Strader, C.D. 1993, J. Am. Chem. Soc., 115, 12550.

16. Våbeno, J., Nikiforovich, G.V., and Marshall, G.R. 2006, Biopolymers, 84, 459. 17. Ball, J.B., and Alewood, P.F. 1990, J. Mol. Recogn., 3, 55. 18. Ball, J.B., Hughes, R.A., Alewood, P.F., and Andrews, P.R. 1993, Tetrahedron,

49, 3467. 19. Hutchinson, E.G., and Thornton, J.M. 1994, Protein Sci., 3, 2207. 20. Etzkorn, F.A., Travins, J.M., and Hart, S.A. 1999, Rare protein turns: γ-turn,

helix-turn-helix, and cis-proline mimics. In: Advances in Amino Acid Mimetics and Peptidomimetics. Abell A. Ed., Jai Press Inc., Stanford, Vol. 2, 126.

21. Gruber, C.W., Elliott, A.G., Ireland, D.C., Delprete, P.G., Dessein, S., Göransson, U., Trabi, M., Wang, C.K., Kinghorn, A.B., Robbrecht, E., and Craik, D.J. 2008, Plant Cell, 20, 2471.

22. Pelegrini, P.B., Quirino, B.F., and Franco, O.L. 2007, Peptides, 28, 1475. 23. Gruber, C.W., Cemazar M., Anderson, M.A., and Craik, D.J. 2007, Toxicon,

49, 561. 24. Mach, J. 2008, Plant Cell, 20, 2285. 25. Craik, D.J., Clark, R.J., and Daly, N.L. 2007, Expert Opin. Investig. Drugs,

16, 595 26. Craik, D.J., Cemazar, M., and Daly, N.L. 2007, Curr. Opin. Drug Discov. Devel.,

10, 176. 27. Wang, C.K., Kaas, Q., Chiche, L., and Craik, D.J. 2008, Nucleic Acids Res., 36,

D206. 28. Thongyoo, P., Roqué-Rosell, N., Leatherbarrow, R.J., and Tate, E.W. 2008, Org.

Biomol. Chem., 6, 1462. 29. Combelles, C., Gracy, J., Heitz, A., Craik, D.J., and Chiche, L. 2008, Proteins,

73, 87. 30. Shenkarev, Z.O., Nadezhdin, K.D., Lyukmanova, E.N., Sobol, V.A., Skjeldal, L.,

and Arseniev, A.S. 2008, J. Inorg. Biochem., 102, 1246. 31. Ireland, D.C., Wang, C.K., Wilson, J.A., Gustafson, K.R., and Craik, D.J. 2008,

Biopolymers, 90, 51. 32. Colgrave, M.L., Kotze, A.C., Ireland, D.C., Wang, C.K., and Craik, D.J. 2008,

Chembiochem, 9, 1939. 33. Colgrave, M.L., Kotze, A.C., Huang, Y.H., O'Grady, J., Simonsen, S.M., and

Craik, D.J. 2008, Biochemistry, 47, 5581. 34. Plan, M.R., Saska, I., Cagauan, A.G., and Craik, D.J. 2008, J. Agric. Food

Chem., 56, 5237. 35. Morita, H., Nagashima, S., Takeya, K., and Itokawa, H. 1993, Chem. Pharm.

Bull., 41, 992.


36. Morita, H., Nagashima, S., Uchiumi, Y., Kuroki, O., Takeda, K., and Itokawa, H. 1996, Chem. Pharm. Bull., 44, 1026.

37. Saviano, G., Benedetti, E., Cozzolino, R., De Capua, A., Laccetti, P., Palladino, P., Zanotti, G., Amodeo, P., Tancredi, T., and Rossi, F. 2004, Biopolymers, 76, 477.

38. Rossi, F., Zanotti, G., Saviano, M., Iacovino, R., Palladino, P., Saviano, G., Amodeo, P., Tancredi, T., Laccetti, P., Corbier, C., and Benedetti, E. 2004, J. Pept. Sci., 10, 92.

39. Cozzolino, R., Palladino, P., Rossi, F., Cali, G., Benedetti, E., and Laccetti, P. 2005, Carcinogenesis, 26, 733.

40. Battista, M.E., Vitale, A.A., and Pomilio, A.B. 2000, Molecules, 5, 489. 41. Pomilio, A.B., Battista, M.E., and Vitale, A.A. 2001, Theochem, 536, 243. 42. Thompson, S.J., Hattotuwagama, C.K., Holliday, J.D., and Flower, D.R. 2006,

Bioinformation, 1, 237. 43. Eros, D., Kövesdi, I., Orfi, L., Takács-Novák, K., Acsády, G., and Kéri, G. 2002,

Curr. Med. Chem., 9, 1819. 44. Modi, S. 2003, Drug Discov. Today, 8, 621. 45. Obrezanova, O., Csanyi, G., Gola, J.M., and Segall, M.D. 2007, J. Chem. Inf.

Model., 47, 1847. 46. Obrezanova, O., Gola, J.M., Champness, E.J., and Segall, M.D. 2008, J. Comput.

Aided Mol. Des., 22, 431. 47. Frecer, V., Berti, F., Benedetti, F., and Miertus, S. 2008, J. Mol. Graph. Model.,

27, 376. 48. Chen, Y., Mant, C.T., and Hodges, R.S. 2002, J. Pept. Res., 59, 18. 49. Akamatsu, M., and Fujita, T. 1992, J. Pharm. Sci., 81, 164. 50. Akamatsu, M., Katayama, T., Kishimoto, D., Kurokawa, Y., Shibata, H., Ueno,

T., and Fujita T. 1994, J. Pharm. Sci., 83, 1026. 51. Osakai, T., Hirai, T., Wakamiya, T., and Sawada, S. 2006, Phys. Chem. Chem.

Phys., 8, 985. 52. Peng, T., Wang, R., and Luhua Lai, L. 1999, J. Mol. Model., 5, 189. 53. Wang, R., Fu, Y., and Lai, L. 1997, J. Chem. Inf. Comp. Sci., 37, 615. 54. Guan, P., Doytchinova, I.A., Walshe, V.A., Borrow, P., and Flower, D.R. 2005,

J. Med. Chem., 48, 7418. 55. Weininger, D. 1988, J. Chem. Inf. Comp. Sci., 28, 31. 56. Sadowski, J., and Gasteiger, J. 1993, Chem. Rev., 93, 2567. 57. Hattotuwagama, C.K., and Flower, D.R. 2006, Bioinformation, 1, 257. 58. Andres, E., and Dimarcq, J.L. 2004, J. Int. Med., 255, 519. 59. Hancock, R.E.W., and Sahl, H.G. 2006, Nature Biotech., 24, 1551. 60. Beisswenger, C., and Bals, R. 2005, Curr. Protein Pept. Sci., 6, 255. 61. Sitaram, N., and Nagaraj, R. 2002, Curr. Pharm. Des., 8, 727. 62. Park, Y., and Hahm, K.S. 2005, J. Biochem. Mol. Biol., 38, 507. 63. Takada, H., and Kotani, S. 1989, Crit. Rev. Microbiol., 16, 477. 64. Parillo, J.E. 1993, N. Engl. J. Med., 328, 1471. 65. Gough, M., Hancock, R.E.W., and Kelly, N.M. 1996, Infect. Immun., 64, 4922.


66. Ahmad, A., Yadav, S.P., Asthana, N., Mitra, K., Srivastava, S.P., and Ghosh, J.K. 2006, J. Biol. Chem., 281, 22029.

67. Brown, K.L., and Hancock, R.E.W. 2006, Curr. Opin. Immunol., 18, 24. 68. Frecer, V., Ho, B., and Ding, J.L. 2000, Eur. J. Biochem., 267, 837. 69. Chen, X., Dings, R.P., Nesmelova, I., Debbert, S., Haseman, J.R., Maxwell, J.,

Hoye, T.R., and Mayo, K.H. 2006, J. Med. Chem., 49, 7754. 70. Hilpert, K., Elliot, M.R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q.,

Winkler, D.F., and Hancock, R.E.W. 2006, Chem Biol., 13, 1101. 71. Bowdish, D.M.E., Davidson, D.J., Lau, Y.E., Lee, K., Scott, M.G., and Hancock,

R.E.W. 2005, J. Leukoc. Biol., 77, 451. 72. McPhee, J.B., and Hancock, R.E.W. 2005, J. Pept. Sci., 11, 677. 73. Lee, M.K., Cha, L., Lee, S.H., and Hahm, K.S. 2002, J. Biochem. Mol. Biol., 35, 291. 74. Zikou, S., Koukkou, A.I., Mastora, P., Sakarellos-Daitsiotis, M., Sakarellos, C.,

Drainas, C., and Panou-Pomonis, E. 2007, J. Pept. Sci., 13, 481. 75. Hammami, R., Ben Hamida, J., Vergoten, G., and Fliss, I. 2008, Nucleic Acids

Res., Epub ahead of print. PMID: 18836196. 76. Wang, G., Li, X., and Wang, Z. 2008, Nucleic Acids Res., Epub ahead of print.

PMID: 18957441. 77. Dennison, S.R., Morton, L.H., Harris, F., and Phoenix, D.A. 2007, Biophys.

Chem., 129, 279. 78. Dennison, S.R., Morton, L.H., Harris, F., and Phoenix, D.A. 2008, Chem. Phys.

Lipids, 151, 92. 79. Chen, Y., Vasil, A.I., Rehaume, L., Mant, C.T., Burns, J.L., Vasil, M.L.,

Hancock, R.E.W., and Hodges, R.S. 2006, Chem. Biol. Drug Des., 67, 162. 80. Chen, Y., Guarnieri, M.T., Vasil, A.I., Vasil, M.L., Mant, C.T., and Hodges, R.S.

2007, Antimicrob. Agents Chemother., 51, 1398. 81. Ganz, T., and Lehrer, R. I. 1994, Curr. Opin. Immunol., 6, 584. 82. Fattorini, L., Gennaro, R., Zanetti, M., Tan, D., Brunori, L., Giannoni, F.,

Pardini, M., and Orefici, G. 2004, Peptides, 25, 1075. 83. Mootz, H.D., and Marahiel, M.A. 1997, J. Bacteriol., 179, 6843. 84. Sitaram, N. 2006, Curr. Med. Chem., 13, 679. 85. Montesinos, E., and Bardají, E. 2008, Chem. Biodivers., 5, 1225. 86. De Smet, K., and Contreras, R. 2005, Biotechnol. Lett., 27, 1337. 87. Klüver, E., Schulz-Maronde, S., Scheid, S., Meyer, B., Forssmann, W.G.,

Adermann, K. 2005, Biochemistry, 44, 9804. 88. Mookherjee, N., Rehaume, L., and Hancock, R.E.W. 2007, Expert Opin. Therap.

Targets, 11, 993. 89. Selsted, M.E., and Ouellette, A.J. 2005, Nat. Immunol., 6, 551. 90. Cole, A.M., and Lehrer, R.I. 2003, Curr. Pharm. Des., 9, 1463. 91. Redman, J.E., Wilcoxen, K.M., and Ghadiri, M.R. 2003, J. Comb. Chem., 5, 33. 92. Boddy, C.N. 2004, Chem. Biol., 11, 1599. Comment on: Chem. Biol., 2004, 11,

1635. 93. Devine, D.A., and Hancock, R.E.W. 2002, Curr. Pharm. Des., 8, 703. 94. Jin, Y., Hammer, J., Pate, M., Zhang, Y., Zhu, F., Zmuda, E., and Blazyk, J.

2005, Antimicrob. Agents Chemother., 49, 4957.


95. Muhle, S.A., and Tam, J.P. 2001, Biochemistry, 40, 5777. 96. Ramamoorthy, A., Thennarasu, S., Tan, A., Gottipati, K., Sreekumar, S., Heyl,

D.L., An, F.Y., and Shelburne, C.E. 2006, Biochemistry, 45, 6529. 97. Schmitt, M.A., Weisblum, B., and Gellman, S.H. 2007, J. Am. Chem. Soc., 129,

417. 98. Oren, Z., and Shai, Y. 2000, Biochemistry, 39, 6103. 99. Kondejewski, L.H., Lee, D.L., Jelokhani-Niaraki, M., Farmer, S.W., Hancock,

R.E.W., and Hodges, R.S. 2002, J. Biol. Chem., 277, 67. 100. Lee, D.L., Powers, J.P.S., Pflegerl, K., Vasil, M.L., Hancock, R.E.W., and

Hodges, R.S. 2004, J. Pept. Res., 63, 69. 101. Chen, Y., Guarnieri, M.T., Vasil, A.I., Vasil, M.L., Mant, C.T., and Hodges, R.S.

2007, Antimicrob. Agents Chemother., 51, 1398. 102. Frecer, V., Ho, B., and Ding, J. L. 2004, Antimicrob. Agents Chemother., 48,

3349. 103. Jenssen, H., Hamill, P., and Hancock, R.E.W. 2006, Clin. Microbiol. Rev., 19,

491. 104. Lee, D.L., Mant, C.T., and Hodges, R.S. 2003, J. Biol. Chem., 278, 22918. 105. Brown, K.L, Mookherjee, N., and Hancock, R.E.W. 2007, Antimicrobial, Host

Defence Peptides and Proteins. Encyclopedia of Life Sciences. John Wiley & Sons, Ltd, Chichester.

106. Mei, H., Liao, Z.H., Zhou, Y., and Li, S.Z. 2005, Biopolymers, 80, 775. 107. Tian, F., Zhou, P., Lv, F., Song, R., and Li, Z. 2007, J. Pept. Sci., 13, 549. 108. Eisenberg, D., Weiss, R.M., and Terwilliger, T.C. 1982, Nature, 299, 371. 109. Carver, T., and Bleasby, A. 2003, Bioinformatics, 19, 1837. 110. Feder, R., Dagan, A., and Mor, A. 2000, J. Biol. Chem., 275, 4230. 111. Rotem, S., Radzishevsky, I., and Mor, A. 2006, Antimicrob. Agents Chemother.,

50, 2666. 112. Roumestand, C., Louis, V., Aumelas, A., Grassy, G., Calas, B., and Chavanieu,

A. 1998, FEBS Lett., 421, 263. 113. Skalicky, J.J., Selsted, M.E., and Pardi, A. 1994, Proteins, 20, 52. 114. Abraham, T., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2005,

Biochemistry, 44, 2103. 115. Robinson, J.A., Shankaramma, S.C., Jetter, P., Kienzl, U., Schwenderer, R.A.,

Vrijbloed, J. W., and Obrecht, D. 2005, Bioorg. Med. Chem., 13, 2055. 116. Chen, J., Falla, T.J., Liu, H., Hurst, M.A., Fujii, C.A., Mosca, D.A., Embree, J.R.,

Loury, D.J., Radel, P.A., Chang, C.C., Gu, L., and Fiddes, J.C. 2000, Biopolymers, 55, 88.

117. Shankaramma, S.C., Athanassiou, Z., Zerbe, O., Moehle, K., Mouton, C., Bernardini, F., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2002, ChemBioChem, 3, 1126.

118. Shankaramma, S.C., Moehle, K., James, S., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2003, Chem. Commun., 1842.

119. Favre, M., Moehle, K., Jiang, L., Pfeiffer, B., and Robinson, J.A. 1999, J. Am. Chem. Soc., 121, 2679.


120. Descours, A., Moehle, K., Renard, A., and Robinson, J.A. 2002. ChemBioChem, 3, 318.

121. Athanassiou, Z., Dias, R.L.A., Moehle, K., Dobson, N., Varani, G., and Robinson, J.A. 2004, J. Am. Chem. Soc., 126, 6906.

122. Fasan, R., Dias, R.L.A., Moehle, K., Zerbe, O., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2004, Angew. Chem., Int. Ed., 43, 2109.

123. Frecer, V. 2006, Biorg. Med. Chem., 14, 6065. 124. Dawson, R.M.C., Elliott, D.C., Elliott, W.H., and Jones, K.M. 1986, Data for

Biochemical Research. 3rd Ed., Oxford Science Publications, 1. 125. Jenssen, H., Lejon, T., Hilpert, K., Fjell, C., Cherkasov, A., and Hancock, R.E.W.

2007, Chem. Biol. Drug Design, 70, 134. 126. Rogers, D., and Hopfinger, A.J. 1994, J. Chem. Inf. Comput. Sci., 34, 854. 127. Friedman, J. 1990, Multivariate Adaptive Regression Splines. Revised edition, 1st

ed: 1988, Technical Report 102, Laboratory for Computational Statistics, Department of Statistics, Stanford University: Stanford, CA.

128. Tam, J.P., Wu, C., Yang, J.-L. 2000, Eur. J. Biochem., 267, 3289. 129. Lai, J.R., Huck, B.R., Weisblum, B., and Gellman, S.H. 2002, Biochemistry, 41,

12835. 130. Mani, R., Waring, A.J., Lehrer, R.I., and Hong, M. 2005, Biochim. Biophys. Acta,

1716, 11. 131. Bhonsle, J.B., Venugopal, D., Huddler, D.P., Magill, A.J., and Hicks, R.P. 2007,

J. Med. Chem., 50, 6545. 132. Sundriyal, S., Sharma, R.K., Jain, R., and Bharatam, P.V. 2008, J. Mol. Model.,

14, 265. 133. Yu, L., Ding, J.L., Ho, B., and Wohland, T. 2005, Biochim. Biophys. Acta, 1716,

29. 134. Wu, M., and Hancock, R.E.W. 1999, J. Biol. Chem., 274, 29. 135. Glukhov, E., Stark, M., Burrows, L.L., and Deber, C.M. 2005, J. Biol. Chem.,

280, 33960. 136. Hancock, R.E.W., and Rozek, A. 2002, FEMS Microbiol. Lett., 206, 143. 137. Blondelle, S.E., Lohner, K., and Aguilar, M. 1999, Biochim. Biophys. Acta, 1462,

89. 138. Sitaram, N., and Nagaraj, R. 1999, Biochim. Biophys. Acta, 1462, 29. 139. Zhang, L., Rozek, A., and Hancock, R.E.W. 2001, J. Biol. Chem., 276, 35714. 140. Ehrenstein, G., and Lecar, H. 1977, Q. Rev. Biophys., 10, 1. 141. Pouny, Y., Rapaport, D., Mor, A., Nicolas, P., and Shai, Y. 1992, Biochemistry,

31, 12416. 142. Salgado, J., Grage, S.L., Kondejewski, L.H., Hodges, R.S., McElhaney, R.N.,

and Ulrich, A.S. 2001, J. Biomol. NMR, 21, 191. 143. Chen, Y., Mant, C.T., Farmer, S.W., Hancock, R.E.W., Vasil, M.L., and Hodges,

R.S. 2005, J. Biol. Chem., 280, 12316. 144. Liu, F., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2004, Biochemistry,

43, 3679. 145. Liu, F., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2004, Biophys. J.,

87, 2470.


146. Devaux, P.F., and Seigneuret, M. 1985, Biochim. Biophys. Acta, 822, 63. 147. Glukhov, E., Burrows, L.L., and Deber, C.M. 2008, Biopolymers, 89, 360. 148. Katsara, M., Tselios, T., Deraos, S., Deraos, G., Matsoukas, M., Lazoura, E.,

Matsoukas, J., and Apostolopoulos, V. 2006, Curr. Med. Chem., 13, 2221. 149. Göransson, U., Svangård, E., Claeson, P., and Bohlin, L. 2004, Curr. Protein

Pept. Sci., 5, 317. 150. Svangård, E., Göransson, U., Hocaoglu, Z., Gullbo, J., Larsson, R., Claeson, P.,

and Bohlin, L. 2004, J. Nat. Prod., 67, 144. 151. Chen, B., Colgrave, M.L., Wang, C., and Craik, D.J. 2006, J. Nat. Prod., 69, 23. 152. Herrmann, A., Burman, R., Mylne, J.S., Karlsson, G., Gullbo, J., Craik, D.J.,

Clark, R.J., and Göransson, U. 2008, Phytochemistry, 69, 939. 153. Gonçalves, V., Gautier, B., Coric, P., Bouaziz, S., Lenoir, C., Garbay, C., Vidal,

M., and Inguimbert, N. 2007, J. Med. Chem., 50, 5135. 154. Verma, R.P., and Hansch, C. 2008, J. Pharm. Sci., 97, 88. 155. Zalloum, H.M., and Taha, M.O. 2008, J. Mol. Graph. Model., 27, 439.



2. QSPR studies on amino acids: Application to proteins

Francisco Torrens1,* and Gloria Castellano2

1Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna P. O. Box 22085, E 46071 València, Spain; 2Departamento de Ciencias Experimentales y

Matemáticas, Facultad de Ciencias Experimentales, Universidad Católica de Valencia San Vicente Mártir, Guillem de Castro 94, E 46003 València, Spain

Abstract. Valence topological charge-transfer (CT) indices are applied to the calculation of pH at the pI isoelectric point. The combination of CT indices allows the estimation of pI. The model is generalized for molecules with heteroatoms. The ability of the indices for the description of molecular charge distribution is established by comparing them with the pI of 21 amino acids. Linear correlation models are obtained. The CT indices improve multivariable regression equations for pI. The variance decreases by 95%. No superposition of the corresponding Gk–Jk and Gk

V–JkV

pairs is observed in most fits, which diminishes the risk of collinearity. The inclusion of heteroatoms in π-electron system is beneficial for the description of pI, owing to either the role of the additional p orbitals provided by heteroatom or role of steric factors in π-electron conjugation. The use of only CT and valence CT indices Gk, Jk, Gk

V, JkV gives limited results for modelling

pI of amino acids. Furthermore, the inclusion of the numbers of acidic and basic groups improves all models. The effect is specially

Correspondence/Reprint request: Dr. Francisco Torrens, Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna, P. O. Box 22085, E 46071 València, Spain

Francisco Torrens & Gloria Castellano 36

noticeable for amino acids with more than two functional groups. The fitting line obtained for the 21 amino acids can be used to estimate the isoelectric point of lysozyme and its fragments, by only replacing (1+Δn/nT) with (M+Δn)/nT. For lysozyme, the results of smaller fragments can estimate that of the whole protein with 1–13% errors. Introduction During the simulation of pH at the pI isoelectric point of n = 21 amino acids, Pogliani introduced indices D and Dv, and the concept of fragmentary molecular connectivity indices [1,2]. He defined the terms: XpI = (χ/0χv) (1+Δn/nT), where Δn = nA – nB, nA = number of acidic groups (two for Asp and Glu, one for all others), nB = number of basic groups (two for His and Lys, three for Arg, and one for all others), and nT = nA + nB (total number of functional groups); for nT = 2, Δn = 0 [3–5]. There were eight such terms following the type of index that enters in numerator χ. He defined the nomenclature for χ = Dv → X ≡ DXv, etc. The best single descriptor for pI was 0Xv with Q = 2.12, F = 267, r = 0.966, s = 0.46, u = (16, 28). He improved statistic Q at expenses of statistics F and u, with the linear combination of X terms made up of connectivity indices, which can be derived by the aid of both forward and full combinatorial techniques: DXv, 0X, 0Xv, 1X: Q = 2.53, F = 95, r = 0.980, s = 0.39, u = (3.1, 2.8, 4.7, 2.8, 26). Average <u> dropped to 7.9, the utility of 0Xv dropped dramatically, and only the unitary index maintained a good utility. He used the vector of orthogonalized terms: Ω = (1Ω, 2Ω, 3Ω, 4Ω, U0), where 1Ω ≡ 0Xv, 2Ω ← DXv, 3Ω ← 1X, 4Ω ← 0X. The vector showed: u = (19, 1.3, 1.0, 2.8, 33). Parameters 1Ω ≡ 0Xv and U0 ≡ Ω 0 ≡ 1 were important. He obtained an enhanced utility for 1Ω and U0: 19 and 33. The statistical score of the molar masses for pI was Q = 0.002 and F = 0.14. He confirmed a small interrelation between the eight terms: <rIM(pI:X)> = 0.560, rw(DX, Xt) = 0.004, and rs(DX, 1X) = 0.975, where rw and rs stood for the weakest and strongest interrelations, respectively. The 0Xv term was trivial, nothing other than (1 + Δn/nT) (cf. [6] for a review). He discovered the term: X’pI = [(1χv)0.5–180χt

v]/D(0.04χtv+Δn/nT), whose modelling power was remarkable: Q = 3.41,

F = 693, r = 0.987, s = 0.29, <u> = 58, u = (26, 90), and the correlation vector C = (77.99429, 5.75382). He wrote the modelling equation: pI = 5.75 + 77.99X’pI, which was a highly dominant dead-end term. The 0Xv term was mainly based on valence-type molecular connectivity indices. The generation and decomposition of amino-acid and peptide radicals are processes of biological importance, due to their connection to the oxidative damage caused by ionizating radiation or oxidizing agents [7,8].

QSPR studies on amino acids: Application to proteins 37

Experimental studies showed that amino-acid and peptide radical cations can be generated by the electrospray technique and peptide cationization using Cu2+ [9]. The mass spectra obtained in these cases are rich and differ considerably from those of protonated systems, which can provide useful information in peptide sequencing. The group of Sodupe performed quantum chemical calculations on nine amino acids and the smallest N-glycylglycine peptide [10,11]. They discussed the influence of intramolecular hydrogen bonds and amino-acid side chain on the localization of the electron hole upon oxidation and subsequent fragmentation process. They showed that for systems involving aromatic amino acids, oxidation is mainly produced at the side chain, whereas for nonaromatic ones oxidation is produced either at the basic NH2 or CO groups, the nature of the electron hole depending on the existent intramolecular hydrogen bonds. In earlier publications, topological charge-transfer (CT) indices [12] were applied to the calculation of the molecular dipole moment of hydrocarbons [13], valence-isoelectronic series of benzene, styrene [14,15] and cyclopentadiene [16], phenyl alcohols [17], 4-alkylanilines [18] and amino acids [19]. In the present report, the valence CT indices have been applied to the calculation of pH at the pI isoelectric point of 21 amino acids. The next section presents the CT indices and their generalization for heteroatoms. Following that, the fractal and fractal hybrid orbital (HO) analyses of tertiary structure of protein molecule are presented. Then, the phylogenesis of avian birds and the 1918 influenza virus are examined. Next, distinct molecular surfaces and hydrophobicity of amino-acid residues in proteins are reviewed. Following that, volumetric studies of L-valine and L-leucine in aqueous solutions of NaBr are revised. Then, the pH-dependent properties of proteins using pKa calculations and accurate, conformation-dependent predictions of solvent effects on protein ionization constants are analyzed. Next, a simple method for protein structural classification is resumed. Following that, use of amino-acid composition to predict ligand-binding sites is reported. Then, the calculation results are presented and discussed. The last section summarizes the perpectives. Computational method The most important matrices that delineate the labelled chemical graph are the adjacency (A) [20] and distance (D) matrices, wherein Dij = lij if i = j, “0” otherwise; lij is the shortest edge count between vertices i and j [21]. In A, Aij = 1 if vertices i and j are adjacent, “0” otherwise. The D[-2] matrix is that whose elements are the squares of the reciprocal distances Dij

-2. The intermediate matrix M is defined as the matrix product of A by D[-2]: M = AD[–2].


The CT matrix C is defined as C = M – MT where MT is the transpose of M [22]. By agreement Cii = Mii. For i ≠ j, the Cij terms represent a measure of the intramolecular net charge transferred from atom j to i. The topological CT indices Gk are described as the sum of absolute values of the Cij terms defined for the vertices i,j placed at a topological distance Dij equal to k:

Gk = Cij δ k,Dij( )j = i+1

N

∑i =1

N −1

∑

(1)

where N is the number of vertices in the graph, Dij are the entries of the D matrix, as well as δ is the Kronecker δ function being δ = 1 for i = j and δ = 0 for i ≠ j. The Gk represent the sum of all the Cij terms, for every pair of vertices i and j at topological distance k. Other topological CT index, Jk, is defined as: Jk =

Gk

N −1 (2)

The index represents the mean value of CT for each edge, since the number of edges for acyclic compounds is N – 1. When heteroatoms are present, some way of discriminating atoms of different kinds needs to be considered [23]. In valence CT-index terms, the presence of each heteroatom is taken into account by introducing its electronegativity in the corresponding entry of the main diagonal of the adjacency matrix A. For each heteroatom X its entry Aii is redefined as: A

ii

V = 2.2 χ X − χC( ) (3) to give the valence adjacency AV matrix, where χX and χC are the electronegativities of heteroatom X and carbon, respectively, in Pauling units. The subtractive term keeps Aii

V = 0 for the C atom, and the factor gives Aii

V = 2.2 for O, which was taken as standard. From AV instead of A, MV, CV, Gk

V and JkV are calculated following the former procedure. The Cii

V, GkV and

JkV are graph invariants.

The enzyme protein lysozyme (129 amino-acid residues, molecular weight 14307g·mol–1) has been taken from the Protein Data Bank code 2LYM (cf. [24] for a review). The charge on lysozyme is +12.0e at pH 4.0, +8.0e at pH 7.0, +4.0e at pH 10.0 and decreases rapidly as the isoelectronic point at pH 11.35 is approached [25].


Fractal analysis of tertiary structure of protein molecule A method for the computation of a dimension index D was implemented in our program TOPO and applied to calculate the solvent-accessible surfaces (SASs) of molecules [26–30]. TOPO distinguished external from internal atoms and used the feature to give two fractal-like dimension indices, viz. D, and D’ [31]. The D’–D difference was a sensitive method to elucidate the occurrence of atoms that were hidden to solvents. For molecules with buried (solvent-excluded) atoms the difference was greater (e.g., faujasite). The procedure was compared with our version of code GEPOL, which provided high-quality results [32–41]. TOPO systematic error was easily corrected by simple addition of a small constant value (0.011). Correlation models between indices D and D’, globularity G, rugosity G’, dipole moment and other properties made clear the existence of a homogeneous molecular structure in each series. Additional applications were the extrapolation of D to infinite polymers, the variation of D with generations of dendrimers and a revision of D for lysozyme. A comparative analysis with the results obtained with our version of program SURMO2, which does not consider the cavity, allows characterizing the cavity [42]. The method was applied to the lysozyme secondary structures and cavity-like space (cf. Figure 1). The dipole moments calculated for the helices were greater than for the sheet. For helices, the main contribution to the water-accessible surface area was the hydrophobic term, while the hydrophilic component part dominates in the sheet. The molecular globularity G was the topological index that quantitatively differentiated better between helices and sheet (cf. Table 1).

A

B

C

D

ES

Figure 1. Ribbon image of lysozyme linking Cα skeleton: A–D) helices, E) β-sheet and S) binding site.


Table 1. Topological indices for lysozyme secondary-structure regions [G’ (Å–1)].

Structure G G ref. G' G' ref. D D ref. D'

Mean of helices A–D 0.472 0.442 1.092 1.161 1.581 1.597 1.749

Antiparallel β-sheet E 0.404 0.381 1.107 1.173 1.629 1.646 1.821

All molecule 0.203 0.188 1.033 1.113 1.908 1.930 2.201

All molecule (cavity

not considered)

0.776 0.188 0.242 1.113 1.542 1.930 2.201

Cavity 0.140 – 2.058 – 3.242 – –

The cavity-like space showed the greatest fractal-like index, indicating the maximal sensitivity to solvent size. It was suggested that the catalytic activity of the enzyme was located at the cavity-like space. The tertiary structures of 43 proteins, selected to cover the five structural classes of protein molecule, were analyzed with a geometrical theory called fractal theory, with the intention of devising a new tool for quantitative description of the tertiary structure of protein [43]. A brief introduction to fractal theory was reported. It was demonstrated that the principles dictating the folding of the local backbone structure and the global backbone structure were well characterized in terms of the representation of fractal theory (cf. Figure 2) [44]. Comparison of the fractal character D of protein molecules with that of the ideal Gaussian chain (GC) revealed several characters of the principles (cf. Table 2). The proteins in the structural class of β-type were quantitatively distinguished from other classes with this representation. A curious discovery that several proteins took the fractal dimension greater than two was reported and discussed.

Figure 2. Measurement of the length of a protein.


Table 2. Fractal dimension of the solvent-accessible surface for some proteins.

Structural class Number of residues D

Only α-helix 136 1.39

Almost exclusively β-sheet 191 1.29

α-Helix and β-sheet tend to be segregated

throughout the chain

180 1.34

α-Helix and β-sheet tend to alternate

throughout the chain

292 1.34

Neither α-helix nor β-sheet 109 1.40

Mean of the five structural classes 211 1.34

Gaussian chain – 1.50 Fractal hybrid orbital analysis of tertiary structure of protein The bond angles and fractal dimensions for ideal hybrid orbitals (HOs) were reported (cf. Table 3). The hybridizations of C atom were reported (cf. Table 4). Table 3. Bond angles and fractal dimensions for ideal hybrid orbitals [θ: bond angle (º)].

Ideal hybrid orbital n s-ratio Example polymer θ D

sp 1 0.500 –C≡C–C≡ 180.00 1.000

sp2 2 0.333 –CH=CH–CH= 120.00 1.262

sp3 3 0.250 –CH2–CH2–CH2– 109.47 1.413

p ∞ 0.000 – 90.00 2.000 Table 4. Hybridizations of carbon [C–H energy (kJ·mol–1), length (Å), bond angle (º), JC–H (Hz)].

Carbon

hybrid

s-ratio Example

molecule

C–H energy C–H length Bond angle pKa JC–H

sp 0.500 CH≡CH 506 1.057 180.00 25 249

sp2 0.333 CH2=CH2 444 1.079 120.00 42 156

sp3 0.250 CH4 423 1.094 109.47 47 125


The concept of fractal was applied to a number of properties of proteins [45]. The structure and shape of the polypeptide chain of proteins were determined by the hybridized states of atomic orbitals (AOs) in the molecular chain. The fractal dimensions, in the range of short distances (1–10Å) of the tertiary structures of some proteins covering various structural classes of protein molecules, were analyzed and compared with a GC (cf. Table 5). The interpretation was given in terms of steric repulsion. The calculated s-ratios in the spn HOs were computed from the fractal dimensions. The tertiary structures of eight proteins covering four structural classes were analyzed. A mean value of ca. 0.29 predicted sp2.46 HOs, halfway between planar sp2 and tetrahedral sp3 HOs. The proteins in the β-structural class were quantitatively distinguished from other classes with the representation. They showed a higher s-ratio (ca. 0.32), which predicted ca. sp2.1 HOs rather similar to planar sp2 HOs. A comparison of the proteins with a GC was interpreted in terms of steric repulsion. Table 5. Contribution of s AO in spn HOs from fractal dimension for proteins (Nr: No. of residues).

Secondary

structure

Protein Nr D s n

α-Helix Haemoglobin (deoxy) 141 1.40 0.26 2.9

Myoglobin (sperm whale,

deoxy)

153 1.42 0.25 3.1

β-Sheet Immunoglobulin 208 1.26 0.33 2.0

Trypsin (native, pH 8) 223 1.30 0.31 2.2

α-Helix and

β-sheet in separate

regions

Lysozyme (hen egg-white) 129 1.42 0.25 3.1

Ribonuclease A 124 1.33 0.29 2.4

α-Helix and

β-sheet in

alternate regions

Adenylate kinase (porcine

muscle) 194 1.36 0.28 2.6

Phosphoglycerate kinase (horse) 408 1.33 0.29 2.4

Mean 198 1.34 0.29 2.5

Gaussian chain – – 1.50 0.21 3.8


The calculated s-ratios in the spn HOs were also computed for 81 proteins (cf. Table 6) [46]. Iron proteins were compared with the self-avoiding random walk (SAW) model, and two classes were quantitatively distinguished: ferric hemeproteins and iron–sulphur (Fe–S) proteins [47]. A curious discovery that a protein took sp0.5 HOs was discussed [48]. A dependence of fractal HOs on ionic strength was observed [49]. The calculated s-ratios in the spn HOs were also computed for 43 proteins selected to cover the five structural classes of protein molecules [50]. It was demonstrated that the principles dictating the folding of the local and global backbone structures were well characterized, in terms of the representation given by fractal theory. Comparison of the fractal character of protein molecules with that of the ideal GC revealed several features of these principles. β-type proteins were distinguished quantitatively from those in other classes. They showed a greater s-ratio (ca. 0.32) in the sp2.20 HOs rather similar to planar sp2 HOs. Table 6. Contribution of s AO in the spn HOs from fractal dimension of SAS surface of Fe proteins.

Protein Nr D s-ratio n

Ferric hemeproteins 155 1.56 0.18 4.8

Iron–sulphur proteins

(Fe2S2·Cys4)

102 1.30 0.31 2.3

Mean of iron proteins 138 1.47 0.23 4.0

Various peoteins 196 1.48 0.22 3.6

Mean of Tables 5 and 6 186 1.46 0.23 4.1 Phylogenesis of avian birds and the 1918 influenza virus Lysozyme is an enzyme with 129 residues. The amino-acid compositions of certain avian lysozymes was determined. The amino-acid sequence of hen egg-white lysozyme was annotated [51]. Certain discrepancies exist between this sequence and that reported in [52] (at residues 40, 41, 42, 46, 48, 58, 65, 66, 92 and 93). Crystallographic analysis [53] gave results for residues 40, 41, 42, 58, 59, 92 and 93, which are in agreement with the former. Discrepancies at residues 46, 48, 65 and 66 are a difference between Asp or Asn; from the electron density maps it could not be determined whether these residues are amide or free acid. The amino-acid sequences of duck, Japanese quail and turkey egg-white lysozymes were determined [54–56]. Amino-acid sequences for human urine and milk lysozymes were also determined.


Comparative studies of sequences for lysozymes of different origins are interesting from the viewpoint of structure–function relationships (Asp-101 of hen lysozyme, which is known to be implicated at the substrate binding site, is replaced by Gly in turkey lysozyme). Trp-62 of hen lysozyme, which also plays an important role in substrate binding, is replaced by Tyr in human lysozyme. The differences between the avian species sequences, which are compared, are expressed as percentage of different amino acids in lysozyme. The greater the differences, the farther in time must be the separation between species. Grouping level b can be identified with biological time. The obtained phylogenetic tree is represented by scheme: (1,…,5) → (1,4,5)(2,3) → (1,5)(2,3)(4) → (1)(2,3)(4)(5) → (1)(2)(3)(4)(5). The scheme is in agreement with data obtained in morphological studies. Optimality criterion SS associated with different proposals for phylogenetic trees allows the equipartition conjecture to be validated or invalidated in phylogenesis. If, in the calculation of entropy associated with the phylogenetic tree, a species is systematically omitted, the difference between entropies with and without the species can be considered as a measure of species entropy. The contributions may be studied with the equipartiton conjecture. It is not within the scope of the simulation method to replace biological tests of drugs or field data in palaeontology, but such simulation methods can be useful to assert priorities in detailed experimental research. Available experimental and field data should be examined by different classification algorithms to reveal possible features of real biological significance [57]. Scheme (cf. Figure 3) is in agreement with data obtained in morphological studies, and with the method based on entropy production and the conjecture of equipartition of production of entropy. Each band of parallel edges indicates a split. The distance between any two taxa x and y corresponds to the sum of weights of all splits that separate x and y. An arsenal of effective medicines and others in developing phase is available. Research in viral genomes expedites progress [58–61]. In the search of the keys of the origin of the 1918 virus hemagglutinin (HA), the gene sequences of HA subtype H1 of several strands of influence virus were analyzed [62–64]. Its phylogenetic tree was built [65,66]. The samples of 1918 strand are inscribed in that family of influenza virus adapted to man (cf. Figure 4). Distance between the 1918 gene H1 and the known avian family reflects that it was originated in a strand of avian influenza virus, although it evolved in an unidentified host before emerging in 1918. Biomacromolecular structural data are maintained by the Protein Data Bank (PDB) [67]. Classroom applications were described [68–71]. Three-dimensional structures can be displayed by program RasMol [72]. Program WPDB compresses PDB structure files into a set of indexed files [73,74]. Program BABEL converts


Figure 3. Dendrogram (binary tree) for distances as different amino acids in lysozyme.

molecular modelling file formats [75]. Database system RELIBASE+ analyzes protein-ligand structures in PDB [76]. Classroom applications of WPDB were described [77]. The successor of RasMol and Chime [78] is Jmol [79–86]. Molecular surfaces/hydrophobicity of amino-acid residues in proteins Hydrophobicity was a useful concept to rationalize the role played by amino-acid residues, in terms of buried or exposed conformations, with regard to the aqueous environment in proteins [87]. The relationship of this concept, with distinct approaches to represent the molecular surface, was analyzed by computing reliable surface areas for three definitions, viz. the


Figure 4. Familiar dendrogram of influenza. van der Waals, SAS, and solvent-excluded molecular surfaces. The surface areas were obtained for all the naturally occurring amino acids by first setting a proper reference standard state and then calculating the values for a database of proteins containing a total of 4 297 residues. Despite the great differences in the molecular surfaces, proper indices were defined for handling the information of interest to study the hydrophobic behaviour of amino acids provided by the surfaces. Volumetric studies of L-valine and L-leucine in aqueous solutions of NaBr Densities (ρ) of the amino acids, viz. L-valine, and L-leucine, in aqueous solutions of sodium bromide (NaBr, 0.05–1.0mol·kg–1) were measured at 298.15K [88]. From the densities, apparent molar volumes (Vφ) and partial


molar volumes of the amino acids (Vφº) at infinite dilution were evaluated. The data were combined with the known values of Vφº of glycine and L-alanine in aqueous NaBr solutions at 298.15K in order to determine the group contributions. The trends of transfer volumes (ΔVφº) were interpreted in terms of solute–cosolute interactions on the basis of a cosphere overlap model. Pair and triplet interaction coefficients were also calculated from transfer parameters. Analysys/predictions of solvent effects on protein ionization constants The results of protein pKa calculations were routinely being analyzed to understand the pH-dependence of protein characteristics, e.g., stability and catalysis [89]. Systems of functionality important titratable groups were identified from protein from pKa calculations, but the rationalization of the behaviour of the systems was inherently problematic due to a lack of theoretical tools and methods. A number of methods for analyzing the results of protein pKa calculations were reported, which were embedded in a graphical user interface (pKaTool). Methods for assessing the reliability of protein pKa calculations and for analysing the roles of individual residues in determining active-site pKa values and the pH-dependence of protein stability were reported. The published methods were freely available to academic researchers. Predicting how aqueous solvent modulated the conformational transitions and influenced the pKa values that regulated the biological functions of biomolecules remained an unsolved challenge [90]. To address the problem, it was developed FDPB_MF, a rotamer repacking method that exhaustively sampled side-chain conformational space and rigorously calculated multibody protein–solvent interactions. FDPB_MF predicted the effects on pKa values of various solvent exposures, large ionic strength variations, strong energetic couplings, structural reorganizations and sequence mutations. FDPB_MF achieved high accuracy, with root-mean-square deviations within 0.3 pH unit of the experimental values measured for turkey ovomucoid third domain, hen lysozyme, Bacillus circulans xylanase, as well as human and Escherichia coli thioredoxins. FDPB_MF provided a faithful, quantitative assessment of electrostatic interactions in biological macromolecules. A simple method for protein structural classification Since the concept of structural classes of proteins was proposed, the problem of protein classification was tackled by many groups [91]. Most of


the classification criteria were based only on the helix/strand contents of proteins. A method for protein structural classification based on the secondary-structure sequences was reported. It was a classification scheme that confirmed existing classifications. A mathematical model was constructed to describe protein secondary-structure sequences, in which each protein secondary-structure sequence corresponded to a transition probability matrix that characterized and differentiated protein structure numerically. Its application to a set of real data indicated that the method classified protein structures correctly. The final classification result was shown schematically. Therefore, it was visual to observe the structural classifications, which was different from traditional methods. Use of amino-acid composition to predict ligand-binding sites A novel method, for predicting the binding sites for druglike compounds on the surface of proteins, was developed on the basis of the specific amino-acid composition observed at the ligand-binding sites of ligand–protein complexes determined by X-ray analysis [92]. A profile, representing the preference of each of the 20 standard amino acids at the binding sites of druglike molecules, was obtained for a small set of high-quality complex structures. An index termed propensity for ligand binding (PLB) was created from the profiles. The PLB index was used to predict the propensity of binding for 804 ligands, at all potential binding sites on the proteins whose structures were determined by X-ray analysis. If the sites with the first two highest PLB indices were taken into consideration, the successfully predicted sites reached a high percentage of 86. The PLB prediction was relatively simple, but the validation study showed that it is both fast and accurate to detect ligand-binding sites, especially the binding sites of druglike molecules. The PLB index was used to predict the ligand-binding sites of uncharacterized protein structures and also to identify novel drug-binding sites of known drug targets. Many biochemical processes and phenomena involving amino acids and proteins are stereospecific; e.g., L- and D-enantiomers of amino acids have different tastes [93,94]. Computation results and discussion The molecular CT indices Gk, Jk, Gk

V and JkV (with k < 6) are reported in

Table 7 for 21 amino acids. Hydroxyproline (4-hydroxypyrrolidine- 2-carboxylic acid, Hyp) differs from proline (Pro) by the presence of a hydroxyl (–OH) group attached to the Cγ atom. The Gk indices contain both CT and size effects, e.g., Gk(Pro) < Gk(Hyp). The size effect is eliminated in the


Table 7. Values of the Gk and Jk charge-transfer indices up to fifth order for 21 amino acids (AA).

AA N G1 G2 G3 G4 G5

Ala 6 2.5000 2.6667 0.5000 0.0000 0.0000

Arg 12 4.2500 6.2222 1.2500 0.4622 0.2639

Asn 9 4.0000 4.7778 0.9375 0.3689 0.1250

Asp 9 4.0000 4.7778 0.9375 0.3689 0.1250

Cys 7 2.5000 2.8889 0.6875 0.1111 0.0000

Gln 10 4.0000 5.0000 1.0000 0.3422 0.2708

Glu 10 4.0000 5.0000 1.0000 0.3422 0.2708

Gly 5 2.0000 2.3333 0.2500 0.0000 0.0000

His 11 3.7500 7.5000 1.2569 0.5211 0.3472

Hyp 9 3.0000 2.8889 0.8750 0.3422 0.0625

Ile 9 3.0000 3.3333 1.0000 0.3422 0.0625

Leu 9 3.5000 2.8889 0.9375 0.3511 0.1250

Lys 10 2.5000 2.8889 0.8125 0.3111 0.2014

Met 9 2.5000 2.8889 0.8125 0.3111 0.1458

Phe 12 3.2500 9.2222 1.2500 0.6844 0.4306

Pro 8 2.0000 2.6667 0.6528 0.2222 0.0000

Ser 7 2.5000 2.8889 0.6875 0.1111 0.0000

Thr 8 3.0000 3.1111 0.8750 0.2222 0.0000

Trp 15 4.2500 13.3889 2.0278 0.9989 0.6425

Tyr 13 4.5000 10.5556 1.6250 0.9867 0.4861

Val 8 3.0000 3.1111 0.8750 0.2222 0.0000

AA J1 J2 J3 J4 J5

Ala 0.5000 0.5333 0.1000 0.0000 0.0000

Arg 0.3864 0.5657 0.1136 0.0420 0.0240

Asn 0.5000 0.5972 0.1172 0.0461 0.0156

Asp 0.5000 0.5972 0.1172 0.0461 0.0156

Cys 0.4167 0.4815 0.1146 0.0185 0.0000

Gln 0.4444 0.5556 0.1111 0.0380 0.0301

Glu 0.4444 0.5556 0.1111 0.0380 0.0301

Gly 0.5000 0.5833 0.0625 0.0000 0.0000

His 0.3750 0.7500 0.1257 0.0521 0.0347


Table 7. Continued

Hyp 0.3750 0.3611 0.1094 0.0428 0.0078

Ile 0.3750 0.4167 0.1250 0.0428 0.0078

Leu 0.4375 0.3611 0.1172 0.0439 0.0156

Lys 0.2778 0.3210 0.0903 0.0346 0.0224

Met 0.3125 0.3611 0.1016 0.0389 0.0182

Phe 0.2955 0.8384 0.1136 0.0622 0.0391

Pro 0.2857 0.3810 0.0933 0.0317 0.0000

Ser 0.4167 0.4815 0.1146 0.0185 0.0000

Thr 0.4286 0.4444 0.1250 0.0317 0.0000

Trp 0.3036 0.9563 0.1448 0.0713 0.0459

Tyr 0.3750 0.8796 0.1354 0.0822 0.0405

Val 0.4286 0.4444 0.1250 0.0317 0.0000

AA G1V G2

V G3V G4

V G5V

Ala 4.5000 3.3222 1.2333 0.0000 0.0000

Arg 7.1500 6.9306 1.9722 0.6922 0.3497

Asn 6.8000 6.0889 1.5458 0.5058 0.2130

Asp 7.9000 6.0889 1.6625 0.5408 0.1250

Cys 4.5000 3.3222 1.4181 0.3861 0.0000

Gln 6.8000 5.7611 1.8472 0.5147 0.2062

Glu 7.9000 6.3111 1.9694 0.5835 0.2942

Gly 4.5000 2.9361 0.4944 0.0000 0.0000

His 8.6500 7.6583 1.8625 0.4696 0.3174

Hyp 7.8000 4.3583 1.1556 0.4597 0.0625

Ile 5.0000 3.5444 1.6028 0.8810 0.2385

Leu 5.5000 3.3222 1.4236 0.6036 0.4770

Lys 5.1000 3.3750 1.4153 0.4836 0.2663

Met 4.5000 3.3222 1.4181 0.4949 0.3103

Phe 5.2500 9.6556 1.7361 0.5642 0.3519

Pro 5.6000 3.7028 1.1361 0.5222 0.0000


Table 7. Continued

Ser 6.2000 3.4278 1.2875 0.1111 0.0000

Thr 6.2000 3.9778 1.4722 0.4972 0.0000

Trp 6.9500 13.1028 2.5111 0.8436 0.4239

Tyr 7.2000 10.5444 1.8611 0.7289 0.4399

Val 5.0000 3.3222 1.6028 0.7722 0.0000

AA J1V J2

V J3V J4

V J5V

Ala 0.9000 0.6644 0.2467 0.0000 0.0000

Arg 0.6500 0.6301 0.1793 0.0629 0.0318

Asn 0.8500 0.7611 0.1932 0.0632 0.0266

Asp 0.9875 0.7611 0.2078 0.0676 0.0156

Cys 0.7500 0.5537 0.2363 0.0644 0.0000

Gln 0.7556 0.6401 0.2052 0.0572 0.0229

Glu 0.8778 0.7012 0.2188 0.0648 0.0327

Gly 1.1250 0.7340 0.1236 0.0000 0.0000

His 0.8650 0.7658 0.1863 0.0470 0.0317

Hyp 0.9750 0.5448 0.1444 0.0575 0.0078

Ile 0.6250 0.4431 0.2003 0.1101 0.0298

Leu 0.6875 0.4153 0.1780 0.0755 0.0596

Lys 0.5667 0.3750 0.1573 0.0537 0.0296

Met 0.5625 0.4153 0.1773 0.0619 0.0388

Phe 0.4773 0.8778 0.1578 0.0513 0.0320

Pro 0.8000 0.5290 0.1623 0.0746 0.0000

Ser 1.0333 0.5713 0.2146 0.0185 0.0000

Thr 0.8857 0.5683 0.2103 0.0710 0.0000

Trp 0.4964 0.9359 0.1794 0.0603 0.0303

Tyr 0.6000 0.8787 0.1551 0.0607 0.0367

Val 0.7143 0.4746 0.2290 0.1103 0.0000


Jk, e.g., J2(Pro) > J2(Hyp). The effect of heteroatoms is included in both GkV

and JkV, e.g., G4

V(Pro) > G4V(Hyp).

The calculated and experimental isoelectric points pI for the 21 amino acids are listed in Table 8. For the Gk,Jk chosen database the following best linear model turns out to be: pI = 12.0 − 12.7J1 − 22.5J4 n = 21 r = 0.478 s = 1.598 F = 2.7 MAPE =17.73% AEV = 0.7718 (4) where the mean absolute percentage error (MAPE) is 17.73% and the approximation error variance (AEV) is 0.7718. The inclusion of N improves the correlation pI = 7.13 + 0.751N – 7.99J1 – 15.7J3 – 81.7J4 n = 21 r = 0.629 s = 1.499 F = 2.6 MAPE = 16.95% AEV = 0.6065 (5) and AEV decreases by 21%. However, the model is limited to small N because N increases with both nA and nB, resulting inadequate for polypeptides and proteins. The database Gk,Jk,Gk

V,JkV improves the model:

pI = 16.9 + 0.421G2 − 22.2J4 − 2.62J1

V − 9.53J2V −13.9J3

V − 23.1J4V

n = 21 r = 0.699 s = 1.475 F = 2.2 MAPE = 14.73% AEV = 0.5465

(6) and AEV decreases by 29%. The inclusion of N improves the correlation: pI = −11.1+ 3.35N + 5.56G3 −18.2G5 + 3.39J2 − 66.6J4 − 3.31G2

V

+ 0.955G4V − 8.12J1

V + 24.7J2V − 22.3J3

V − 50.7J4V − 14.0J5

V

n = 21 r = 0.958 s = 0.781 F = 7.5 MAPE = 8.25% AEV = 0.1754 (7) and AEV decreases by 77%. However, the model is inadequate for proteins because N, G3, G5, G2

V and G4V increase with nA and nB. The use of

(1+∆n/nT) = 0.5 for Arg, 4/3 for Asp and Glu, 2/3 for His and Lys, as well as one for all others improves the fit: pI = 14.8 − 9.01 1 + Δn nT( ) n = 21 r = 0.965 s = 0.462 F = 259.8

MAPE = 5.29% AEV = 0.0682

(8)


Table 8. Calculated and experimental values of pH at isoelectric point pI for 21 amino acids (AA).

AA pI (Eq. 8) pI (Eq. 12) Experiment

Ala 5.80 5.76 6.00

Arg 10.31 10.33 10.76

Asn 5.80 5.66 5.41

Asp 2.79 2.65 2.77

Cys 5.80 5.79 5.07

Gln 5.80 5.85 5.65

Glu 2.79 2.86 3.22

Gly 5.80 5.90 5.97

His 8.80 8.38 7.59

Hyp 5.80 5.81 5.80

Ile 5.80 5.96 6.02

Leu 5.80 5.99 5.98

Lys 8.80 9.28 9.74

Met 5.80 5.98 5.74

Phe 5.80 5.86 5.48

Pro 5.80 6.07 6.30

Ser 5.80 5.56 5.68

Thr 5.80 5.65 5.60

Trp 5.80 5.57 5.89

Tyr 5.80 5.58 5.66

Val 5.80 5.81 5.96

and AEV decreases by 91%. The correlation coefficient represents the 96.8% of that of the correlation of the means (n = 4, r = 0.997). The pI isoelectric points (calculated with Equation 8) for the 21 amino acids are also included in Table 8. For Equation (8) the absolute relative error results 5%. The pI isoelectric points (calculated with Equation 8 and experimental) for the 21 amino acids are shown in Figure 5. For Equation (8) the two amino acids farthest from the experimental value are His and Lys, with an absolute error of ca. 1.1 units. The variation of the pI isoelectric point as a function of (1+Δn/nT) for the 21 amino acids (cf. Figure 6) shows that some amino acids appear superposed.


4

6

8

10

pI F

itted

4 6 8 10

pI Experimental

Eq. 12

Eq. 8

Figure 5. Isoelectric points pI for the 21 amino acids with Equations (8/12). Line represents bisector.

4

6

8

10

12

pI

0.2 0.4 0.6 0.8 1 1.2 1.4

1+Δn/n T

Lysozyme fragments

Amino acids

Arg

Lys

His

Glu

Asp

A

Lysozyme

B=D=E=BSC

Ala=Asn=Cys=Gln=Gly=Hyp=Ile=Leu=Met=Phe=Pro=Ser=Thr=Trp=Tyr=Val

Figure 6. Variation of pI vs. (1+Δn/nT) for 21 amino acids, lysozyme and its fragments. The fitting line corresponds to the amino acids.


The fitting line corresponds to the 21 amino acids; both amino acids that are the farthest are His and Lys (nB = 2). The inclusion of Gk,Jk improves the fit:

pI = 15.3 − 8.99 1+ Δn nT( )−1.00J2 n = 21 r = 0.971 s = 0.435 F = 147.9

MAPE = 5.10% AEV = 0.0450 (9) and AEV decreases by 93%. The inclusion of Jk

V improves the fit: pI = 16.8 − 8.59 1+ Δn nT( )− 0.958J2 −8.98J3 −1.16J1

V n = 21 r = 0.977 s = 0.407 F = 85.7 MAPE = 4.70% AEV = 0.0450 (10) AEV decreases by 94% and allows studying polypeptides, proteins and protein fragments. The inclusion of Gk

V improves the fit:

pI = 16.0 − 8.94 1 + Δn nT( )− 0.828J2 − 9.77J3 + 0.619G4V n = 21 r = 0.981

s = 0.378 F = 100.1 MAPE = 4.12% AEV = 0.0425 (11) and AEV decreases by 94%. However, the model is inadequate for proteins because G4

V increases with nA and nB. No superposition of the corresponding Gk–Jk or Gk

V–JkV pairs is observed in Equations (4–6,8–11), which decreases

the risk of collinearity in the fits, given the close relationship between each pair Gk–Jk in Equation (2) [95,96]. The simultaneous inclusion of Gk

V,JkV improves the fit:

pI = 16.6 − 8.71 1 + Δn nT( )− 0.787J2 − 9.52J3 + 0.485G4

V − 0.801J1V − 2.73J4

V

n = 21 r = 0.989 s = 0.306 F = 103.9 MAPE = 4.20% AEV = 0.0380 (12) and AEV decreases by 95%. However, the model is inadequate for proteins because G4

V increases with nA and nB. The pI isoelectric points (calculated with Equation 12) for the 21 amino acids are also included in Table 8. For Equation (12) the absolute relative error decreases to 4%. The pI isoelectric points (calculated with Equation 12 and experimental) for the 21 amino acids are displayed in Figure 5. For Equation (12) the error is reduced for most amino acids; in particular for His and Lys the error decreases to 0.6 units. The molecular CT indices are collected in Table 9 for lysozyme, five fragments of its tertiary structure and its binding site. In general, the CT indices


Table 9. Values of Gk and Jk charge-transfer indices up to fifth order for lysozyme and its fragments.

Fragment N G1 G2 G3 G4 G5 α-Helix A 87 28.5000 39.3889 10.9097 6.4083 4.5069 α-Helix B 83 26.2500 45.5000 11.5417 6.5189 4.7258

3.010-Helix C 39 13.2500 14.2222 5.2500 3.2222 2.1181 α-Helix D 61 20.2500 24.6667 8.3750 5.1022 3.6111β-Sheet E 104 39.0000 55.5556 13.8750 8.4000 6.4167

Binding site S 105 30.2500 60.1111 13.5833 5.0556 2.5906Lysozyme 1001 349.2500 541.7222 137.0833 85.9639 64.8828

Fragment J1 J2 J3 J4 J5

α-Helix A 0.3314 0.4580 0.1269 0.0745 0.0524 α-Helix B 0.3201 0.5549 0.1408 0.0795 0.0576 3.010-Helix C 0.3487 0.3743 0.1382 0.0848 0.0557α-Helix D 0.3375 0.4111 0.1396 0.0850 0.0602 β-Sheet E 0.3786 0.5394 0.1347 0.0816 0.0623 Binding site S 0.2909 0.5780 0.1306 0.0486 0.0249 Lysozyme 0.3493 0.5417 0.1371 0.0860 0.0649

Fragment G1V G2

V G3V G4

V G5V

α-Helix A 56.5000 50.9778 19.8069 11.3040 7.0642 α-Helix B 51.3500 56.1472 19.0750 10.6837 6.83043.010-Helix C 27.9500 19.9000 9.2583 5.2535 3.2994 α-Helix D 41.5500 34.1083 14.6944 8.9512 5.1353 β-Sheet E 79.6000 74.1861 22.9611 13.2918 8.6063Binding site S 59.5500 70.9806 19.1278 7.0843 3.1818 Lysozyme 683.4500 691.8222 230.6111 139.8265 91.6138

Fragment J1V J2

V J3V J4

V J5V

α-Helix A 0.6570 0.5928 0.2303 0.1314 0.0821 α-Helix B 0.6262 0.6847 0.2326 0.1303 0.08333.010-Helix C 0.7355 0.5237 0.2436 0.1382 0.0868 α-Helix D 0.6925 0.5685 0.2449 0.1492 0.0856 β-Sheet E 0.7728 0.7203 0.2229 0.1290 0.0836 Binding site S 0.5726 0.6825 0.1839 0.0681 0.0306 Lysozyme 0.6835 0.6918 0.2306 0.1398 0.0916


Table 10. Values of the pH at the isoelectric point, pI for lysozyme fragments not included in the fit.

Fragment Residues pI Experiment

α-Helix A 5–15 12.95 – α-Helix B 24–34 10.89 –

3.010-Helix C 80–85 10.31 – α-Helix D 88–96 11.02 –

Total α-helix 5–15,24–34,88–96 11.62 –Total helix 5–15,24–34,80–85,88–96 11.29 – β-Sheet E 41–54 10.87 –

Total helix+sheet 5–15,24–34,41–54,80–85,88–96 11.21 – Binding site S 34,35,37,44,57,59,62,63,101,107,114 11.00 –

Total helix+sheet+BS 5–15,24–35,37,41–54,57,59,62,63, 80–85,88–96,101,107,114

11.17 –

Lysozyme 1–129 11.49 11.35 do not distinguish α-helices, 3.010-helix, β-sheet and binding site. In particular both Jk and Jk

V indices for the whole molecule are similar to those for the α-helices and, specially, for α-helix D. The pI isoelectric points for lysozyme and its fragments not included in the fit are calculated by a modification of Equation (8): pI = 14.8 − 9.01 M + Δn( ) nT (13) where M is the number of amino-acid residues in the protein or fragment. The choice seems sensible as pI values are strongly dependent on the type of side-chain functional groups. The pI isoelectric points (calculated and experimental) for lysozyme and its fragments not included in the fit are reported in Table 10. The calculation result for α-helix A (M = 11 residues) is an estimate for that of the whole lysozyme (M = 129 residues) with a relative error of 13%. Furthermore, the inclusion of the other two α-helices (A+B+D, M = 31 residues) reduces the error to 1%. The variation of the pI isoelectric point for lysozyme (experiment) and its fragments (calculation) as a function of (M+Δn)/nT (Figure 6) shows that some fragments appear superposed. Both lysozyme and its fragments lie in the fitting line obtained for the amino acids.


Perspectives From the present results and discussion the following perspectives can be drawn. 1. The detailed comparison of the sequences (primary structures) of enzyme

lysozyme has allowed for the reconstruction of a molecular phylogenetic tree for birds. Single- and complex-linkage perform a binary taxonomy of the parameters that separates avian birds by the following scheme with successive branchings: (1,…,5) → (1,4,5)(2,3) → (1,5)(2,3)(4) → (1)(2,3)(4)(5) → (1)(2)(3)(4)(5). Genetic analyses were applied in judicial investigations [97,98].

2. The inclusion of heteroatoms in the π-electron system was beneficial for the description of the isoelecric point, owing to either the role of the additional p orbitals provided by the heteroatom or the role of steric factors in the π-electron conjugation.

3. The use of only charge-transfer and valence charge-transfer indices Gk,Jk,Gk

V,JkV gave limited results for modelling the isoelectric point of

amino acids. However, the inclusion of (1+∆n/nT) improved all the models. The effect is especially noticeable for those amino acids with more than two functional groups, viz. Arg, Asp, Glu, and, specially, His, and Lys. Moreover, the fractional index casts some light on the importance of the side-chain functional groups in the pI simulations of functional-rich molecules. The satisfactory modelling of the pI of 21 amino acids by the aid of a fractional index, based mainly on the Δn index, shows how to bypass the problem to derive and work with an extended set of charge-transfer indices (here, m = 20) as, in this case, a good description can be obtained with only one index.

4. The fitting line obtained for the 21 amino acids can be used to estimate the isoelectric point of lysozyme and its fragments, by only replacing (1+Δn/nT) with (M+Δn)/nT.

5. For lysozyme, the results of smaller fragments can estimate that of the whole protein with 1–13% errors. An extension of the present study to other enzymes and proteins would give an insight into a possible generality of these conclusions, because most globular, water-soluble proteins are ionic, e.g., lysozyme (charge +8.0e) and bovine serum albumin (anionic) at pH 7.0. The present study may be also of interest in charge-migration peptide studies.

6. Work is in progress on the further elucidation of the value of Δn in the fractional indices for a better definition of indices, which are highly dependent on side-chain functional groups.


Acknowledgements The authors acknowledge financial support from the Spanish MEC (Project Nos. CTQ2004-07768-C02-01/BQU and CCT005-07-00365) and EU (Program FEDER). References 1. Pogliani, L. 1992, J. Pharm. Sci., 81, 334. 2. Pogliani, L. 1992, J. Pharm. Sci., 81, 967. 3. Pogliani, L. 1996, J. Phys. Chem., 100, 18065. 4. Pogliani, L. 1997, Med. Chem. Res., 7, 380. 5. Pogliani, L. 1999, J. Chem. Inf. Comput. Sci., 39, 104. 6. Pogliani, L. 2000, Chem. Rev., 100, 3827. 7. Stadtman, E. R. 1993, Annu. Rev. Biochem., 62, 797. 8. Berlett, B. S., and Stadtman, E. R. 1997, J. Biol. Chem., 272, 20313. 9. Chu, I. K., Rodriquez, C. F., Lau, T.-C., Hopkinson, A. C., and Siu, K. W. M.

2000, J. Phys. Chem. B, 104, 3393. 10. Simon, S., Gil, A., Sodupe, M., and Bertrán, J. 2005, J. Mol. Struct. (Theochem),

727, 191. 11. Gil, A., Bertran, J., and Sodupe, M. 2006, J. Chem. Phys., 124, 154306. 12. Torrens, F. 2003, Comb. Chem. High Throughput Screen., 6, 801. 13. Torrens, F. 2001, J. Comput.-Aided Mol. Design, 15, 709. 14. Torrens, F. 2003, J. Mol. Struct. (Theochem), 621, 37. 15. Torrens, F. 2004, Mol. Diversity, 8, 365. 16. Torrens, F. 2005, Molecules, 10, 334. 17. Torrens, F. 2003, Molecules, 8, 169. 18. Torrens, F. 2004, Molecules, 9, 1222. 19. Torrens, F., and Castellano, G. 2007, Synthetic Organic Chemistry XI, J. A.

Seijas and M. P. Vázquez-Tato (Eds.), MDPI, Basel (Switzerland), 1. 20. Randi’c, M. 1975, J. Am. Chem. Soc., 97, 6609. 21. Hosoya, H. 1971, Bull. Chem. Soc. Jpn., 44, 2332. 22. Gálvez, J., García, R., Salabert, M. T., and Soler, R. 1984, J. Chem. Inf. Comput.

Sci., 34, 520. 23. Kier, L. B., and Hall, L. H. 1976, J. Pharm. Sci., 65, 1806. 24. Torrens, F., and Castellano, G. 2007, Biomedical Data and Applications, A. S.

Sidhu, T. S. Dillon, and E. Chang (Eds.), Springer, Berlin, in press. 25. Bergers, J. J., Vingerhoeds, M. H., van Bloois, L., Herron, J. N., Janssen, L. H.

M., Fischer, M. J. E., and Crommelin, D. J. A. 1993, Biochemistry, 32, 4641. 26. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, High Performance Computing

II, M. Durand, and F. El Dabaghy (Eds.), Elsevier, Amsterdam, 549. 27. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, J. Chim. Phys. Phys.-Chim.

Biol., 88, 2435.


28. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, Advances in Biomolecular Simulations, R. Lavery, J.-L. Rivail, and J. Smith (Eds.), AIP Conference Proceedings Series No. 239, American Institute of Physics, New York, 1991, 118.

29. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 1998, J. Mol. Graphics. Model., 16, 57.

30. Torrens, F. 2001, Int. J. Mol. Sci., 2, 72. 31. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 2001, J. Comput. Chem., 22, 477. 32. Pascual-Ahuir, J. L., Silla, E., Tomasi, J., and Bonaccorsi, R. 1987, J. Comput.

Chem., 8, 778. 33. Pascual-Ahuir, J. L., and Silla, E. 1990, J. Comput. Chem., 11, 1047. 34. Silla, E., Villar, F., Nilsson, O., Pascual-Ahuir, J. L., and Tapia, O. 1990, J. Mol.

Graphics, 8, 168. 35. Silla, E., Tuñón, I., and Pascual-Ahuir, J. L. 1991, J. Comput. Chem., 12, 1077. 36. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1992, Protein Eng., 5, 715. 37. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1993, Chem. Phys. Lett., 203, 289. 38. Pascual-Ahuir, J. L., Silla, E., and Tuñón, I. 1994, J. Comput. Chem., 15, 1127. 39. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1994, J. Phys. Chem., 98, 377. 40. Scharlin, P., Battino, R., Silla, E., Tuñón, I., and Pascual-Ahuir, J. L. 1998, Pure

Appl. Chem., 70, 1895. 41. Pascual-Ahuir, J. L., Silla, E., and Tuñón, I. 1998, J. Mol. Struct. (Theochem),

426, 331. 42. Terryn, B., and Barriol, J. 1981, J. Chim. Phys. Phys.-Chim. Biol., 78, 207. 43. Isogai, Y., and Itoh, T. 1984, J. Phys. Soc. Jpn., 53, 2162. 44. Torrens, F. 2000, Encuentros en la Biología, 8(64), 4. 45. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 1998, Information Technology

Applications in Biomedicine, S. Laxminarayan (Ed.), IEEE, Washington, 1. 46. Torrens, F. 2000, Zh. Fiz. Khim., 74, 125. 47. Torrens, F. 2000, Russ. J. Phys. Chem. (Engl. Transl.), 74, 115. 48. Torrens, F. 2001, Complexity Int., 8, torren01. 49. Torrens, F. 2001, Synthetic Organic Chemistry V, O. Kappe, P. Merino, A.

Marzinzik, H. Wennemers, T. Wirth, J.-J. vanden Eynde and S.-K. Lin (Eds.), MDPI, Basel, 1.

50. Torrens, F. 2002, Molecules, 7, 26. 51. Canfield, R. E. 1963, J. Biol. Chem., 238, 2698. 52. Jollès, J., Hermann, J., Niemann, B., and Jollès, P. 1967, Eur. J. Biochem., 1, 344. 53. Blake, C. C. F., Mair, G. A., North, A. C. T., Phillips, D. C., and Sarma, V. R.

1967, Proc. R. Soc. London, Ser. B, 167, 365. 54. Hermann, J., and Jollès, J. 1970, Biochim. Biophys. Acta, 200, 178. 55. Kaneda, M., Kato, T., Tominaga, N., Chitani, K., Narita, K. 1969, J. Biochem.

(Tokyo), 66, 747. 56. LaRue, J. N., and Speck, J. C., Jr. 1969, Fed. Proc., 28, 662. 57. Torrens, F. 2000, Encuentros en la Biología, 8(60), 3. 58. Jones, P. S. 1998, Antivir. Chem. Chemother., 9, 283. 59. Ellis, R. W. 1999, Vaccine, 17, 1596. 60. Root, M. J., Kay, M. S., and Kim, P. S. 2001, Science, 291, 884.


61. Bernstein, J. M. 2000, Antiviral Chemotherapy: General Overview, Wright State University School of Medicine, Dayton (OH).

62. Crosby, A. W. 2003, America’s Forgotten Pandemic: The Influenza of 1918, Cambridge University, Cambridge.

63. Reid, A. H., and Taubenberger, J. K. 2003, J. Gen. Virol., 84, 2285. 64. Kash, J. C., Basler, C. F., García-Sastre, A., Cartera, V., Billharz, R., Swayne, D.

E., Przygodzki, R. M., Taubenberger, J. K., Katze, J. G., and Tumpey, T. M. 2004, J. Virol. 78, 9499.

65. Tumpey, T. M., Basler, C. F., Aguilar, P. V., Zeng, H., Solórzano, A., Swayne, D. E., Cox, N. J., Katz, J. M., Taubenberger, J. K., Palese, P., and García-Sastre, A. 2005, Science, 310, 77.

66. Taubenberger, J. K., Reid, A. H., Lourens, R. M., Wang, R., Jin, G., and Fanning, T. G. 2005, Nature (London), 437, 889.

67. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. 1977, J. Mol. Biol., 112, 535.

68. Torrens, F., Sánchez-Marín, J., and Sánchez-Pérez, E. 1989, Actes del II Sympòsium sobre l'Ensenyament de les Ciències Naturals, S. Riera (Ed.), Documents No. 11, Eumo, Vic, 595.

69. Torrens, F., Sánchez-Marín, J., and Sánchez-Pérez, E. 1989, Actes del II Sympòsium sobre l'Ensenyament de les Ciències Naturals, S. Riera (Ed.), Documents No. 11, Eumo, Vic, 669.

70. Torrens, F., Sánchez-Pérez, E., and Sánchez-Marín, J. 1989, Enseñanza de las Ciencias, Extra-III Congreso(1), 267.

71. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, Colloquy University Pedagogy, Horsori, Barcelone, 375.

72. Sayle, R. A., and Milner-White, E. J. 1995, Trends Biochem, Sci, 20, 374. 73. Shindyalov, I. N., and Bourne, P. E. 1995, J. Appl. Crystallogr., 28, 847. 74. Shindyalov, I. N., and Bourne, P. E. 1997, CABIOS, 13, 487. 75. Walters, P., and Stahl, M. 1996, Program BABEL, University of Arizona, Tucson

(AZ). 76. Hendlich, M. 1998, Acta Crystallogr., Sect. D, 54, 1178. 77. Tsai, C. S. 2001, J. Chem. Educ., 78, 837. 78. MDL 2007, Program Chime, MDL Information Systems, San Leandro (CA). 79. Claros, M. G., Fernández-Fernández, J. M., González-Mañas, J. M., Herráez, Á.,

Sanz, J. M., and Urdiales, J.L. 2001, BioROM 1.0 and 1.1, Sociedad Española de Bioquímica y Biología Molecular, Malaga.

80. Claros, M. G., Fernández-Fernández, J. M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Oliver, J., Pons, G., Pujadas, G., Roca, P., Rodríguez, S., Sanz, J. M., Segués, T., Urdiales, J. L. 2002, BioROM 2002, Sociedad Española de Bioquímica y Biología Molecular, Malaga.

81. Claros, M. G., Alonso, T., Corzo, J., Fernández-Fernández, J. M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Oliver, J., Pons, G., Roca, P., Sanz, J. M., Segués, T., Urdiales, J. L., Valle, A., and Villalaín, J. 2003, BioROM 2003, Sociedad Española de Bioquímica y Biología Molecular–Roche Diagnostics, Malaga.


82. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Castro, E., Corzo, J., Fernández-Fernández, J. M., Figueroa, M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Herrera, R., Moya, A., Oliver, J., Pons, G., Roca, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2004, BioROM 2005, Sociedad Española de Bioquímica y Biología Molecular–Universidad Miguel Hernández–Universidad del País Vasco, Malaga.

83. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Castro, E., Corzo, J., Fernández-Fernández, J. M., Figueroa, M., García-Mondéjar, L., García-Vallvé, S., Garrido, M. B., González-Mañas, J. M., Herráez, Á., Herrera, R., Miró, M. J., Moya, A., Oliver, J., Palacios, E., Pons, G., Roca, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2005, BioROM 2006, Sociedad Española de Bioquímica y Biología Molecular–Pearson Educación, Malaga.

84. Herráez, Á. 2006, Biochem. Mol. Biol. Educ. 34, 255. 85. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Carrero, I., Castro, E., Corzo,

J., Fernández-Fernández, J. M., Figueroa, M., García-Vallvé, S., Garrido, M. B., González-Mañas, J. M., Herráez, Á., Herrera, R., Miró, M. J., Moya, A., Oliver, J., Palacios, E., Pons, G., Roca, P., Salgado, J., Sancho, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2006, BioROM 2007, Sociedad Española de Bioquímica y Biología Molecular–Pearson Educación, Malaga.

86. Miró, M. J., Méndez, M. T., Raposo, R., Herráez, Á., Barrero, B., and Palacios, E. 2007, III Jornada Campus Virtual UCM: Innovación en el Campus Virtual, Metodologías y Herramientas, Complutense, Madrid, 304.

87. Pacios, L. F. 2001, J. Chem. Inf. Comput. Sci., 41, 1427. 88. Pal, A., and Kumar, S. 2005, Indian J. Chem., Sect. A, 44, 1589. 89. Nielsen, J. E. 2007, J. Mol. Graphics Model., 25, 691. 90. Barth, P., Alber, T., and Harbury, P. B. 2007, Proc. Natl. Acad. Sci. USA, 104,

4898. 91. Liu, N., and Wang, T. 2007, J. Mol. Graphics Model., 25, 852. 92. Soga, S., Shirai, H., Kobori, M., and Hirayama, N. 2007, J. Chem. Inf. Model.,

47, 400. 93. Solms, J., Vuataz, L., and Egli, R. H. 1965, Experientia, 21, 692. 94. Schiffman, S. S., Clark, T. B., III, and Gagnon, J. 1982, Physiol. Behav., 28, 457. 95. Box, G. E. P., Hunter, W. G., MacGregor, J. F., and Erjavec, J. 1973,

Technometrics, 15, 33. 96. Hocking, R. R. 1976, Biometrics, 32, 1. 97. Ou, C.-Y., Ciesielski, C. A., Myers, G., Bandea, C. I., Luo, C.-C., Korber, B. T.

M., Mullins, J. I., Schochetman, G., Berkelman, R. L., Economou, A. N., Witte, J. J., Furman, L. J., Satten, G. A., MacInnes, K. A., Curran, J. W., and Jaffe, H. W. 1992, Science, 256, 1165.

98. De Oliveira, Tm, Pybus, O. G., Rambaut, A., Salemi, M., Cassol, S., Ciccozzi, M., Rezza, G., Gattinara, G. C., d’Arrigo, R., Amicosante, M., Perrin, L., Colizzi, V., Perno, C. F., and Benghazi Study Group 2006, Nature (London) 444, 836.



3. Molecular topology in QSAR and drug design studies

J. Gálvez and R. García-Domenech

Unidad de Diseño de Fármacos y Conectividad Molecular, Dept. Química Física, Facultad de Farmacia, Universitat de Valencia, Avd. V.A. Estellés, s/n 46100 Burjassot-Valencia, Spain

Abstract. The use of graph theoretical approaches to describe the chemical structure of organic compounds has accomplished more and more relevance along the later years. Since the early times in which such formalism was used to predict simple properties on simple molecules, such as boiling points of alkanes, up to the design- for instance- of novel lead anticancer drugs, a significant progress was achieved and a long path has been covered. The aim of this review is depicting some of the milestones of such a path and somehow forecast which will be the challenges and expectancies for the future of the topics.

1. Introduction Since Corvin Hansch [1] introduced, at the beginning of the sixties in the past century, his famous equation relating some experimental properties of chemical compounds with certain electronic and steric parameters of molecules, thereby introducing the quantitative structure-activity relationships (QSAR), the developments of such methods has been dizzy. The introduction Correspondence/Reprint request: Dr. J. Gálvez, Unidad de Diseño de Fármacos y Conectividad Molecular Dept. Química Física, Facultad de Farmacia, Universitat de Valencia, Avd. V.A. Estellés, s/n 46100 Burjassot-Valencia, Spain. E-mail: [email protected] and [email protected]

J. Gálvez & R. García-Domenech 64

of more and more capable computers made possible that progress up to the point that nowadays, the so called in silico approaches, stand as a essential tool. The initial QSAR technique, assumed that there was a relationship between the properties of a molecule and its structure, this relationship being profiled by a set of physicochemical parameters. QSAR based on this very original concept has been viewed as the classic as shown in the works of Hansch and Fujita [1]. A variety of modern QSAR methods are founded on the basis of the classic QSAR. These methods have become powerful tools to predict chemical properties or possible biological functions of unknown compounds, thus accelerating the process of designing novel compounds and their optimization. Other approaches such as the Free-Wilson’s [2] were digging in the same concept and also yielded good results. The Free-Wilson analysis relies on the additive nature of the properties of molecular fragments, what not always is possible. However, in those cases in which it is applicable it yields very good results. Another significant approach in the field of QSAR was the introduction by Cramer in 1988 of three-dimensional molecular parameters, what allowed the later developed as 3D-QSAR [3]. This initiated a new era at the level of both conceptual evolution and practical applications. The effects from different conformers, stereoisomers or enantiomers of chemical compounds in 3D-QSAR models permitted the comparison molecular structures thereby setting up a representative structural group known as the pharmacophore [4]. The first, and currently one of the most widely used models employing 3D-QSAR to draw pharmacophores was the Comparative Molecular Field Analysis (CoMFA) method proposed by Cramer et al. [3]. Other 3D-QSAR approaches, such as Comparative Molecular Similarity Indices Analysis (CoMSIA) [5] or Self Organizing Molecular Field Analysis (SomFA) [6], have also been developed, some of which incorporate comparisons of different sets of molecular descriptors. Along with the rapid development of computational science, a pull of new techniques based on formalisms such as molecular mechanics, molecular dynamics, docking, scoring and pharmacophore analysis, are now widely used in the area of drug discovery. These computational techniques have been proven to assist in the design of novel, more potent inhibitors because they can visualize the mechanism of ligand-receptor interactions. Molecular topology (MT), a discipline usually considered within the QSAR methods, has demonstrated to be an excellent tool for a quick and accurate prediction of many physicochemical and biological properties [7-9]. One of the most interesting advantages of MT is the straightforward calculation of molecular descriptors to work with. Within this mathematical formalism a molecule is assimilated to a graph, where each vertex represents

Molecular topology and drug design 65

one atom and each axis one bond. Starting from the interconnections between the vertices, an adjacency topological matrix can be built up, whose ij elements take the values either one or zero, depending if the vertex i is connected or unconnected to the vertex j, respectively (see Figure 1). The manipulation of this matrix gives origin to a set of topological indices or topological descriptors which characterize each graph and allow the developments of QSPR [10–12] and QSAR [13–18] analysis as well. The singularity of molecular topology can be drawn in the following items: a) It is a completely mathematical framework in which molecular structure

is profiled. b) It is a very efficient approach for drug discovery either by screening of

large databases of compounds or by designing novel compounds by following the inverse process (properties→structure). Furthermore, it is easily computerizable.

The item b) is a direct consequence of the item a). The use of molecular topology in the search of QSAR has grown along the last years in an exponential way. Altogether the topological scope covers over 20% of the overall papers on QSAR. A recent search carried out with the Scifinder Scholar database, disclosed that about 3000 papers out of 15000 dealing on QSAR, were devoted to topological descriptors. (see Figure 2). In the current work, we focus on the contribution of molecular topology to QSAR studies obtained by our research group in the last years and its application to drug design.

Figure 1. The chemical graph and adjacency matrix of the isopentane. 2. Methods and applications Molecular topology, MT, can be defined as a part of chemistry consisting of the topological description of molecular structures. Such description deals basically with the connectivity of the atoms forming the molecule and should


Figure 2. Bibliographic research obtained with Scifinder Scholar data base. White bars: papers including the key QSAR; Black bars: papers including the keys QSAR + molecular topology. yield numerical descriptors which are invariant under deformation of the structure. Note that, although graph theory is usually the main source of descriptors and concepts feeding MT, this is a much broader concept and actually it is not a part of graph theory. Throughout this section somewhat different molecular connectivity indices will be introduced. In order to outline the particular QSAR techniques used with this methodology, descriptors will be defined before explaining the modeling tools applied with them. Diverse statistical and molecular techniques will be sketched here. 2.1. Descriptors The following types of indices, which have been mainly used in this research, are described in increasing order of complexity. Discrete invariants These are natural numbers calculated from what chemists understand qualitatively as the chemical structure. N is the number of non-hydrogen atoms, i.e., the number of molecular graph vertices [19,20]. Vk, where k is 3 or 4, is the number of vertices of degree k, i.e., the number of atoms having


k bonds, σ or π, to non-hydrogen atoms [20]. PRk for k between 0 and 3 is the number of pairs of ramifications at distance k, i.e., the number of pairs of single branches at distance k in terms of bonds [20]. L is the length, i.e., the maximum distance between non-hydrogen atoms measured in bonds, and is thus the diameter of the molecular graph defined as max(dij) [20]. W is the Wiener number, i.e., the sum of the distances between any two non-hydrogen atoms measured in bonds [21]. Connectivity indices Throughout the present section the connectivity indices defined as in Eq 1 will be used [22, 23]. Some of them are slightly different from the previously defined indices. The connectivity index of order k [23] may be derived from the adjacency matrix and is normally written as, kχt, The order k is between 0 and 4 and is the number of connected non-hydrogen atoms which appear in a given sub-structure.

∑ ∏=

−

∈⎟⎟⎠

⎞⎜⎜⎝

⎛δ=χ

tk

j

n

j iit

k

1

2/1

S Eq. 1a

In eq 1a, δi is the number of simple (i.e., sigma) bonds of the atom i to non-hydrogen atoms, Sj represents the jth sub-structure of order k and type t, and knt is the total number of sub-graphs of order k and type t that can be identified in the molecular structure. The types used are path (p), cluster (c), and path-cluster (pc). A sub-graph of type p is formed by a path, a sub-graph of type c is formed by a star, while the pc sub-graph can be defined as every tree which is neither a path nor a star. Alternatively: a pc sub-graph is any tree containing at least a star and a path. As an example, Table 1 displays all the p, c, and pc sub-graphs found in a simple molecular structure. The use of the valence delta, δv, instead of δ enables the encoding of π and lone-pair electrons [22] in the form given in Eq. 1b.

∑ ∏=

−

∈⎟⎟⎠

⎞⎜⎜⎝

⎛δ=χ

tk

j

n

j iit

k

1

2/1

S

vv

Eq. 1b

Here δv is just the degree of a vertex in a pseudograph and in this context the old definition, δv = Zv / (Z - Zv - 1), for the δi

v of higher row atoms holds [22]. The values listed in Table 2 and used in this approach were adopted because of their general performance.


Table 1. Types of sub-graphs present in the 2-methylpropanol structure.

Type Order 1 Order 2 Order 3 Order 4 Order 5

OH OH OH

OH OH OH

Path OH OH

OH OH

Cluster OH

Path-Cluster

OH

Table 2. Values of δv for the different heteroatoms present in the listed groups.

Group δv Group δv NH4

+ 1 H3O+ 3NH3 2 H2O 4 -NH2 3 -OH 5 -NH- 4 -O- 6=NH 4 =O 6-N- 5 O (nitro) 6 =N- 5 O (carboxyl) 6=N+= (azide) 4 -F 7=N- (azide) 6 -Cl 0.690 -N= (nitro) 6 -Br 0.254 -S- 1.33 -I 0.15=S 0.99 =P- 0.560 S (-SO2-) 2.67 P(5) 2.22

Topological Charge Indices (TCI) The Topological Charge Indices Gk and Jk of order k =1-5 are defined for a given graph by Eq. 2 [24], in which N is the number of non-hydrogen atoms, and cij = mij – mji, is the charge term between vertices i and j. δ represents here the Krönecker delta symbol, i.e., if α = b, then d(a,b)=1, and if α ≠ b then d(a,b)=0, and finally, dij is the topological distance between vertices i and j.


∑ ∑=1-N

1=

N

1+)d,(cG

iij

j=ijik kδ

and

1NGJ

−= k

k

Eq. 2.

The variables mij are the elements of the NxN matrix M obtained as the product of two matrices, i.e., M = A·Q. The elements of M expanded in terms of the elements of A and Q are given in Eq. 3.

∑=N

1=qam

hhjihij

Eq. 3.

A is the adjacency matrix in which elements aih are: 0 if either i = h or i is not linked to h; 1 if i is linked to h by a single bond; 1.5 if linked to h by an aromatic bond; 2 if linked to h by a double bond; and 3 if linked to h by a triple bond. Q is the inverse squared distance or Coulombian matrix. Its elements, qhj, are 0 if h = j and otherwise qhj = 1/dhj

2, where dhj is the topological distance between vertices h and j. Thus, Gk represents the overall sum of the cij charge terms for every pair of vertices i and j separated by a topological distance k. The valence Topological Charge Indices Gk

v and Jkv are defined in a similar way, but using

Av, the electronegativity-modified adjacency matrix, instead of A. The elements of A and Av are identical except for the main diagonal where A has zeroes and Av the corresponding Pauling electronegativity values, EN, weighed for EN(Cl) = 2 for each heteroatom. To illustrate the calculation of the topological charge indices, let us consider the n-butane. Its hydrogen-depleted graph is: •⎯•⎯•⎯•. If we number each vertex of this graph in the following way: 1-2-3-4, we can write the A, Q ad M matrices, which are used to derive the following G values: G1

= |c12| + |c23| + |c34| = 1/4 + 0 + 1/4 = 0.500, G2 = |c13| + |c24| = 1/9 + 1/9 = 0.2222, and G3 = |c14| = 0.

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

0100101001010010

A

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

014/19/11014/1

4/11019/14/110

Q

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=∗=

1014/14/124/19/109/104/124/1

4/1101

QAM

Differences and quotients of connectivity indices The difference of connectivity indices, kDt, with k = 0-4, and t = p, c, pc, are defined in the following way, [25]

vD tk

tk

tk χ−χ= Eq. 4.


The quotient of connectivity indices, kCt, with k = 0-4, and t = p, c, pc, are defined in the following way [20]

vC

tk

tk

tk

χχ

=

Eq.5

All descriptors used in this work were obtained with the aid of the Desmol11 program developed by us and available by e-mail request. 2.2. QSAR algorithm

2.2.1. Multilinear regression analysis, MLR Several multilinear descriptive functions, Eq. 6 have been obtained by the linear correlation of biological properties with the aforementioned descriptors.

iioi XAAP ∑+= Eq. 6 where Pi is a property, Xi are the topological indices, and Ao and Ai are the regression coefficients of the equation obtained. The Furnival-Wilson algorithm [26] is used to obtain subsets of descriptors and equations with the least Mallows parameter, Cp [27]. This algorithm combines two methods of computing the residual sums of squares for all possible regressions to form a simple leap and bound technique for finding the best subsets without examining all possible ones. The result is a reduction by several orders of magnitude in the number of operations required to find the best subsets. The predictive ability of the selected mathematical models was evaluated through cross-validation, using the leave-one-out [28] method. To do this, one compound in the set was removed and the model was recalculated using the remaining N-1 compounds as training set. The property was then predicted for the removed element. This process was repeated for all the compounds in the set to obtain a prediction for each. A plot of the residual vs. cross-validation residual (cv) allowed the detection of outliers. In order to evidence the possible existence of fortuitous regressions, the randomization test is adopted in this paper. Thus, the values of the property of each compound are randomly permuted and linearly correlated with the aforementioned descriptors. This process is repeated as many times as needed. The usual way to represent the results of a randomization test is plotting the correlation coefficients versus predicted ones, r2 and Q2 respectively.


The predictive ability of the equations obtained can be better assessed by an external test. A random sub-set of molecules is initially chosen (the “external test set") and the modeling study is carried out with the remaining molecules (the "training set"). The predictive performance of the model is assessed by the results obtained when it is applied to the external test set. MLR applications

Prediction of plasmatic protein binding for a group of antineoplastics A set of 41 highly heterogeneous antineoplastic drugs was used in this study. The multilinear regression analysis was performed by means of the BMDP software [29]; The dependent variable, namely plasmatic protein binding (PPB) was correlated against the topological descriptors. The selected equation was: PPB(%) = 194.9 + 7.2 4χpc + 3.6 G1 -10.6 G5

V – 124.3 J1V –

704.0 J4V – 8.5 2Dp Eq. 7

N=41 r2=0.805 Q2=0.732 SEE=15.5 F=23.3 p<0.0001 The J and G labels correspond to the topological charge indices, which take into account the charge transfers inside the molecule, whereas χpc and

2Dp stand for the connectivity and difference of connectivity indices, respectively. The first encodes information abot the topological ensemble of the molecule and the second on electronic properties. Table 3 and Figure 3 show the outcome for the predicton of PPB for each drug. Altogether there is a pretty good predictive asset, as far as all the drugs Table 3. Results of prediction of plasmatic protein binding obtained by multilinear regression analysis for a group of antineoplastics.

Compound PPBexp(%) PPBcalc(%)* PPBcalc(cv)(%)

Dacarbazine 5.0 15.4 17.0Gemcitabine 10.0 33.6 37.8 Cytarabine 13.0 32.1 33.9 Cyclophosphamide 15.0 48.7 51.2 Temozolomide 15.0 -2.1 -5.8 Mercaptopurine 19.0 18.6 18.5 Aminoglutethimide 25.0 53.1 57.2


Table 3. Continued

Cladribine 25.0 9.9 7.3 Busulfan 30.0 53.0 62.4 Nimustine 34.0 26.6 26.0Topotecan 35.0 37.7 38.0 Anastrozole 45.0 31.8 29.1Methotrexate 50.0 71.0 74.7 Capecitabine 54.0 50.0 49.8Letrozol 60.0 55.3 54.6 Irinotecan 65.0 87.6 90.3 Vinblastine 75.0 74.4 74.0 Vincristine 75.0 78.3 80.3 Mitoxantrone 76.0 58.9 56.8 Doxorubicin 80.0 75.1 74.7 Melphalan 80.0 84.7 85.1 Hydroxycarbamide 80.0 72.2 69.0Epirubicin 85.0 75.1 74.1 Exemesttane 90.0 90.1 90.1Formestane 93.0 93.0 93.1 Raltitrexed 93.0 68.7 65.5 Medroxyprogesteron 94.0 92.7 92.4 Cyproterone 95.0 94.3 94.1 Docetaxel 95.0 95.5 95.7 Etoposide 95.0 88.4 86.5 Flutamide 95.0 80.6 77.5 Imatinib 95.0 103.7 107.0Paclitaxel 95.0 98.9 99.9 Idarubicin 96.0 76.0 73.9Amsacrine 97.0 99.8 100.2 Bicalutamide 98.0 88.7 86.9 Chlorambucil 99.0 72.4 70.0 Estramustine 99.0 90.7 89.9 Tamoxifen 99.0 102.5 103.2 Thiotepa 99.0 89.3 80.2 Toremifen 99.0 110.4 112.8

* Values obtained from Eq. 7


Figure 3. Observed versus calculated values of the PPB.

Figure 4. Cross-validated residuals versus residuals for the PPB model.

showing large PPB rate (23 compounds with PPB above 80%) are predicted above 70% (excepting raltitrexed which is a clear outlier). On the other hand, all drugs with values below 20% (namely six compounds) are predicted to lay under a 35% threshold. As in the top values, there is a clear outlier here: cyclofosfamide.


The uniform distribution shown at both sides of the tendency line (see Figure 3) is indicative of the well balanced selection of variables carried out in the model. The predictive ability of the selected mathematical model was evaluated through cross-validation, using the leave-one-out. Table 3 and Figure 4 show the obtained results. The value of Q2=0.732 is accepted as satisfactory. Prediction of the phenoloxidase inhibition by a group of benzaldehyde thiosemicarbazone A topological-mathematical model has been arranged to search for new derivatives of benzaldehyde thiosemicarbazone and related compounds acting as phenoloxidase inhibitors. Phenoloxidase, PO, also known as tyrosinase, is a key enzyme in different metabolic processes of microorganisms and other animals and plants [30,31]. In insects, PO is related to three important biochemical functions, including cuticule sclerotization, defensive encapsulation and the mechanization of alien organisms [32]. By using multilinear regression analysis a function with two descriptors, 1χv, 4χp

v and r2=0,940 was capable to predict adequately the IC50 for each compound. The best linear regression equation obtained, including its statistical parameters, was: pIC50 = -3.132 + 5.716 1χv – 16.581 4χp

v Eq. 8 N=44 r2=0.940 Q2=0.931 SEE=0.500 F=321.1 p<0.00001 The presence of the 1χv and 4χp

v indices in the equation reflects the influence of branching and position of the substituent in aromatic ring, respectively. Figure 5 shows the predicted results obtained with each one of the compounds in the training and test set. The results of the randomness tests, Figure 6, suggest a high stability of the model (all regressions were rather poor except for the selected equation). For more details about this study, see the reference [33]. Others results of prediction of biological properties obtained by our research group The most interesting results obtained recently by the methodology described in the preceding paragraphs are shown in the Table 4. Additional details are available in the references cited.


Figure 5. Relationship of pIC50exp with pIC50calc from prediction function obtained using multilinear regression analysis, Eq. 8. Open circles represent predictions for the training set; solid circles represent predictions for the test set.

Figure 6. Validation of the mathematical model obtained for the pIC50. Correlation coefficients, r2, versus prediction coefficients, Q2, obtained by randomization test.


Table 4. QSAR topological models to predict pharmacological properties.

Property

Predictive equation Ref.

IC50.(μM) L. donovani

Log IC50 = 13.32 +0.814 4χp + 1.381 G3v – 32.16 J3

v + 0.0018W – 0.717N + 0.332PR1 – 0.263PR2 N= 48 r2 = 0.806 SEE=0.366 F=10.6 p<0.0001

[34]

IC50.(μM) K1 strain of P. falciparum

Log IC50 = 3.154 – 0.338 1χv + 0.381 4χpc N= 54 r2 = 0.842 Q2 = 0.825 SEE=0.308 F=136.4 p<0.0001

[35]

LogLD50

(oral in rat)

Log LD50 = 15.13 + 0.79 G1 – 1.83 4Cp – 1.10 1χv – 0.28 G2 – 8.89 0C + 1.43 G3

v – 20.04 J3v – 0.13 4χpc

v N=39 r2=0.821 Q2=0.701 SEE=0.40 F=17.2 p<0.0001

[36]

Log LD50 (i.p. in rat)

Log LD50 = 8.40 + 9.35 J1 – 2.03 4Cp – 10.11 0C + 1.20 1D + 0.83 G5

v – 0.39 4χpcv

N=39 r2=0.721 Q2=0.613 SEE=0.54 F=13.8 p<0.0001 [36]

Toxicity to Chlorella vulgaris

pC = -4,494 + 0,568 0χv -0,113 G1v -1,161 G5

v+ 10,071 J4 + 0,188 V3

N=70 r2=0.928 Q2=0.918 SEE=0.405 F=180 p<0.0001 [37]

2.2.2. Linear discriminant analysis, LDA The objective of the linear discriminant analysis, LDA, is to find a linear combination of variables that allows discrimination between two or more categories or objects. Generally, two sets of compounds are considered in the analysis first, a set of compounds with proven pharmacological activity, and second, a set compounds known to be inactive. Introductory accounts of LDA are given by Kachigan [38] and McFarland and Gans [39].The selection of the descriptors is based on the Fisher-Snedecor parameter, and the classification criterion is the shortest Mahalanobis distance (i.e., the distance of each case from the mean of all cases used in the regression equation). Variables used in computing the linear classification functions are chosen stepwise. At each step either the variable that adds the most to the separation of the groups is entered into the discriminant function, or the variable that adds the least to the separation of the groups is removed from the discriminant function. The quality of the discriminant function is evaluated by Wilks' λ, which is a multivariate analysis of variance parameter that tests the equality of group means for the variable(s) in the discriminant function. Minimization of Wilks' parameter allows selecting the predictors to be entered or deleted in the discriminant function [40]. The technique is described in detail by Tabachnick and Fidell [41]. The discriminant ability of the selected function is stated by:


- The Classification matrix, in which each case is classified into a group according to the classification function. The number of cases classified into each group and the percent of correct classifications are shown.

- The Jack-knifed classification matrix, in which each case is classified into a group according to the classification functions computed from all the data except the case being classified.

- Cross validation with random sub-samples and classifying new cases. Here, the cases in each group are randomly subdivided into two separate sets, the first of which is then used to estimate the classification function, and the second of which is classified according to the function. By observing the proportion of correct classifications produced for the second set, one obtains an empirical measure for the success of the discrimination.

- Use of an External set test, which entails the use of an external compound set to check the validity of the selected discriminant functions.

LDA applications

Inhibition of Trypanosoma cruzi hexokinase by bisphosphonates The American Trypanosomiasis or Chagas disease is a pathology whose causative agent is the protozoan parasite Trypanosoma Cruzi. The infection is transmitted by hemiptera bloodfeeding insect vector, subfamily Triatominae which affects human and domestic-wild animals. An estimate 18 million people are infected and one hundred million have high risk to get the infection in fifteen south-American countries from Mexico to Argentina [42]. Nowadays the therapy for this pathology is based on two drugs: nifurtimox and benznidazol, which have a hazardous level of toxicity and are only effective on the acute stage of the disease. That is the reason why it’s necessary to find new drugs against this disease. Hexokinase is the first enzyme involved in glycolysis in most organisms including the etiological agents of Chagas disease (Trypanosoma cruzi) and African sleeping sickness (Trypanosoma brucei). Recent studies have shown that bisphosphonates analogues, are potent inhibitors of T. cruzi hexokinase, which can represent a novel target to find new compounds actives against T. cruzi [43]. In this work, the inhibition of T. cruzi hexokinase, TcHK, of a group of bisphosphonate derivatives was investigated to obtain a QSAR model of prediction using molecular topology and linear discriminant analysis. A group of 42 bisphosphonate derivatives inhibitors of TcHK was selected.


Tables 5 and 6 show the structures and inhibitory activity expressed as values of IC50 (μM) obtained for the in vitro assays for each compound reported through the papers [43, 44]. To obtain the discriminat function, we apply the LDA to a training group comprised of 35 compounds and to validate it, to a 7 compounds test group. The training series is comprised by two subgroups: an active group (compounds with values of IC50<20μM) and an inactive group (compounds Table 5. Structures of bisphosphonates studied in order of decreasing potency in TcHK inhibition.

PO3H2

H2O3P OHN

4

PO3H2

H2O3P

HN

N

Et

5

PO3H2

PO3H2

HNBr

6

PO3H2

PO3H2

HN

H3C(H2C)6

7

PO3H2

H2O3POHN

N

8

PO3H2

PO3H2

HN

9

PO3H2

H2O3POH

N+

10

PO3H2

H2O3POH

F F

11

PO3H2

PO3H2

HNO2N

12

PO3H2

H2O3P OH

nonane-n

13

PO3H

PO3H2

HN

F

F

14

PO3H2

PO3H2

HN

15

PO3H2

PO3H2

HN

F

16

PO3H2

PO3H2

HN

O

17

PO3H2

H2O3POH

N

18

PO3H

PO3H2

HN

HO

19

H2O3P

PO3H2

OH

Ph

20

PO3H2

PO3H2

HNN

21

PO3H2

H2O3POH

N+

N

22

PO3H2HNPh

23

PO3H2

PO3H2

NH

OH

24

PO3H2

O O

Ph

25

PO3H2

SO3H

Ph

26

PO3H2

H2O3P

NHO

OH

27

PO3H2

PO3H2

HN

NHO

28

PO3H2

H2O3POH

N+

29

PO3H2

H2O3POH

30

PO3H2

H2O3P

N+

OH

31

PO3H2

H2O3P OH

N+

32

PO3H2

H2O3P OH

N

N

33

PO3H2H2O3P

NH

34

PO3H2

PO3H2

HN

35

PO3H2

PO3H2

HN

Cl

Cl 36

PO3H2

PO3H2N

37

PO3H2

PO3H2N

HN

O

38

PO3H2

PO3H2N

HN

O

39

PO3H2

H2O3P OH

NO

Ph

40

PO3H2

PO3H2

N

H3CH2C

H3C(H2C)3 41

PO3H2

H2O3POHN

H

N

42

PO3H2

PO3H2

HN

43

PO3H2

H2O3P

HN

44

PO3H2

H2O3P

HN

OH

45


Table 6. Results of prediction obtained by lineal discriminant analysis with IC50 for TcHK and the bisphosphonate derivatives analysed.

Compound IC50(μM) 1χ G4V J2

V J5V DF Class.

Active group training (IC50<20μM)

5 0.81 9.847 1.869 0.252 0.055 1.19 A

6 0.95 6.661 0.782 0.187 0.029 2.25 A

7 1.45 9.721 1.278 0.193 0.045 3.28 A

8 1.82 8.500 2.024 0.383 0.045 -1.85 I

9 1.95 8.321 0.959 0.197 0.038 2.23 A

10 2.29 10.761 2.222 0.321 0.045 3.07 A

11 2.29 9.933 1.884 0.326 0.046 0.88 A

12 2.4 7.604 0.776 0.171 0.035 2.41 A

13 2.75 8.561 1.126 0.191 0.040 2.81 A

14 2.75 9.149 1.286 0.206 0.040 3.31 A

15 3.47 8.821 0.967 0.254 0.038 1.04 A

16 3.63 8.738 1.267 0.212 0.039 2.91 A

17 4.07 7.721 1.248 0.230 0.037 1.67 A

18 12.6 6.618 0.813 0.338 0.036 -4.08 I

19 14.8 7.221 0.976 0.230 0.031 1.56 A

Inactive group training (IC50>20μM)

4 300 6.618 0.804 0.324 0.028 -1.87 I

22 300 7.972 1.906 0.360 0.058 -4.59 I

23 300 6.689 0.808 0.208 0.039 -0.45 I

24 300 8.209 1.149 0.313 0.043 -2.13 I

26 300 8.321 1.149 0.295 0.040 -0.82 I

27 300 9.100 1.495 0.334 0.042 -0.52 I

28 300 8.732 1.365 0.265 0.044 0.55 A

29 300 9.557 2.080 0.403 0.041 -0.43 I

30 300 9.442 1.718 0.172 0.086 -3.47 I

31 300 8.689 1.858 0.338 0.043 -0.34 I

32 300 6.618 1.222 0.368 0.031 -2.83 I

33 300 7.833 1.343 0.321 0.038 -1.14 I

35 300 7.911 1.590 0.224 0.061 -1.93 I


Table 6. Continued

36 300 7.077 0.833 0.218 0.039 -0.30 I

37 300 7.121 2.162 0.255 0.058 -1.26 I

38 300 9.284 1.276 0.268 0.050 -0.55 I

39 300 7.221 1.071 0.258 0.034 0.29 A

40 300 9.512 1.594 0.261 0.068 -3.00 I

42 300 7.118 0.677 0.290 0.030 -1.11 I

43 300 5.693 0.990 0.234 0.053 -4.55 I

Test group

20 34.7 8.232 0.957 0.233 0.030 2.69 A

21 45.7 6.667 0.692 0.349 0.067 -11.10 I

25 300 8.735 2.027 0.392 0.059 -4.83 I

34 300 8.581 2.115 0.305 0.059 -1.83 I

41 300 7.141 2.180 0.241 0.081 -5.51 I

44 300 5.710 0.913 0.233 0.047 -3.53 I

45 300 8.078 1.551 0.180 0.080 -4.33 I with IC50>20μM). The test series is arranged by compounds with values of IC50>20μM and randomly selected from the inactive group. The discriminant function selected was: DF = 5.261 + 1.013 1χ + 3.011 G4

v – 32.49 J2v – 207.4 J5

v Eq. 9 N=35 F=5.90 λ(Wilks’ lambda) = 0.556 From here, a given compound will be selected as a potential inhibitor TcHK if DF > 0, otherwise it is classified as “inactive”. The classification matrix is very significantly for the training set (86.7% of correct prediction for the active group, 13 out of 15 correctly classified, and 90.0% for the inactive group, 18 correct out of 20 (see Table 6). An easy way to evaluate the quality of the selected discriminant function is to apply it into an external test group. In our case we used 7 compounds which haven’t been included in the discriminant analysis, all of them with values of IC50 > 20μM. Table 6 shows the results obtained for each compound. As shown, all the compounds, except the number 20, are correctly classified as inactive, DF<0. For more details see ref. [45].


Inhibition of parasitary activity against Leishmania donovani Leichmaniasis is a parasitarian disease caused by the Leishmania protozoo. The transmission pathway is by biting of a mosquito belonging to the Phlebotominae subfamily, being the main reservoirs some, either wild or domestic, mammalians, including the humans for some species such as L. donovani, L. tropica [46]. The disease is endemic in 88 countries into four continents. Nowadays is a real public health trouble concerning to 12 million people and an incidence of 1.5-2 million new records per year [47]. Leichmaniasis is considered by WHO as one of the orphan diseases to be included into the program TDR (Special Programme for Research and Training in Tropical Diseases) [48]. In this item, we shall focuse our interest in some dinitrobenzene sulphonamides, all of them derivatives of the herbicide orizaline. These compounds exhibit a proven anti-Leishmania activity, by inhibiting the polimerization of tubuline in purified parasites, thereby hindering the growth of parasite in the phases G2/M of the cell cycle [49]. The study consisted of the use of the discriminant analysis to get a topological model capable to classify correctly the activity/inactivity of a given compound on Leishmania Donovani. Once arranged the model, it could be applied to the search of new potentially active compounds. A set of 57 compounds (37 of which were 3,5-dinitrobenzene derivatives and 20 were selected from the Maybridge Organics Compounds database [50], according to the outcome of Werbovetz [49]). Table 7 illustrates both, the chemical structure as well as the antiparasitary activity of the selected compounds, expressed as IC50 (μM). In order to get the discriminant function we applied the LDA to the training set comprised of all the active compounds (IC50 values <20μM) plus 80% of the inactive compounds (IC50>20μM). The remaining 20% of molecules was used as an external validation test to check the performance of the discriminant function chosen, DF, which was: DF = -7.74 – 5.32 4χpc -0.542 G1

v + 5.11 G3 + 5.16 G4v + 18.52 J2

v – 141.58 J3

v + 2.51 3Dc Eq. 10 N=48 F=8.97 λ(Wilks’ lambda) = 0.389 According to this equation, a compound should be classified as active if DF > 0, otherwise it is classified as inactive. The resulting classification matrix is very significant because 100% of the active compounds are correctly


Table 7. Chemical Structure of the studied compounds and results of the classification as for their activity against L. donovani. (A= Active; I= Inactive).

NRR

NO2NO2

X Compound

R X IC50expa DFb Clasif.

Active group training (CI50<20μM)

46 S

O2N

NO2Cl

F

F

FCl

0.5 0.87 A

48 S

NO2

O2N

O OCl

F

F

FCl

2.3 5.23 A

33 n-propyl SO2HN

F

2.5 3.09 A

11 n-butyl SO2HN 2.6 3.66 A

02 n-propyl SO2HN

Cl

Cl

3.7 5.72 A

01 n-propyl SO2HN 5 1.42 A

32 n-propyl SO2HN

Cl

Cl5 4.55 A

17 n-propyl SO2HN

Cl

5.5 3.53 A

12 n-butyl SO2HN F 5.6 4.69 A


Table 7. Continued

23 n-butyl SO2HN

F

5.7 5.29 A

30 n-propyl SO2HN

OCH3

8.1 2.69 A

36 n-pentyl SO2NH2 9 2.9 A

35 n-ethyl SO2HN 11 1.32 A

37 n-hexyl SO2NH2 12 3.83 A

16 n-propyl SO2HN Cl

13 3.34 A

26 n-butyl SO2NH2 20 1.82 A Inactive group training (CI50>20μM)

04 n-propyl SO2N(Me)

21 -6.38 I

44 S

NH

O O Cl

Cl

Cl

O

O

F

F

F

25 -3.34 I

05 n-propyl SO2N(CH2CH3)2 27 -7.46 I

15 n-propyl SO2HN CH3

32 2.88 I

29 n-propyl SO2HN OCH3 32 1.83 I

45 N

NS

NN

O

F

F

F

32 -5.88 I

54 N

HO

NH

F

Cl Cl

37 -5.93 I


Table 7. Continued

55

HNO

N

S

O

Cl

39 -4.74 I

09 n-propyl SO2HN 43 -3.97 I

19 n-propyl SO2NH(CH2)4CH3 43 -1.21 I

22 S

NO2

N

n-propyl

n-propyl

O

O

HN

43 -6.71 I


10 n-propyl SO2HN 50 -1.22 I

06 n-propyl SO2NH(CH2)2CH3 54 -3.25 I 07 n-propyl SO2N(CH2CH2CH3)2 55 -6.14 I

03 n-propyl N

SO2HN 60 1.35 I

Oryzalin n-propyl SO2NH2 65 -0.7 I

43 SN

S

S

HN

OO

Cl

Br

66 -4.05 I

24 H, n-propyl SO2NH2 67 -1.28 I 25 n-ethyl SO2NH2 69 -0.48 I

53 O

NO2

Cl

O

Cl

Cl

Cl

Cl

80 -3.34 I

27 S

NO2

N

n-propylO

O

H2Nn-propyl

90 -9.36 I

57 S

NO2

Cl

F

F

FCl

98 -3.17 I

13 n-propyl C N >100 -4.55 I


Table 7. Continued

14 n-propyl N

C

OH

NH2

>100 -1.45 I

34 n-propyl SO2N

>100 -0.25 I

38 N N

H

HN

O

F

F

F

OOH

>100 -7.15 I

39 NH

CH3O

NO2

F

FF

CH3O

>100 -1.28 I

40 Cl

NO2

O

Cl

O

>100 -4.54 I

49

O

S

O

F

F

F

F

>100 -3.31 I

50 F

FS

NO2

O

S Br

OCH3

F

>100 -3.05 I

51 OCH3

NH

O

NN

OCH3

NO2

>100 -8.57 I

Test group

42 N

N

S

N

OHO

Cl

Cl

CH3

N

21 -9.01 I

31 n-propyl SO2HN

CH3

23 3.68 A



Table 7. Continued

21 n-propyl SO2N 47 -8.72 I

18 n-propyl SO2N((CH2)3CH3)2 48 -3.39 I

56

O

NH

Cl

Cl

S

NO2

O 60 -9.48 I

41 N

N S

Cl

O

O

ClCl

72 -14.41 I

28 n-propyl CONH2 76 -3.23 I

47 N

N N

SH

SCH3

HO

>100 -9.33 I

52

N

S

HO

OH

CH3

HN

O

OCH3

>100 -8.90 I

a Values of IC50 (μM) from reference [49]. b Values of the discriminant function, Eq. 10. classified whereas 90.6 % of the inactive (namely 29 out of 32 molecules) were also correctly placed. Altogether, the rate of correct classification is 93.8 %. To check the performance of the equation to disclose novel structures, an external validation test is necessary. For making so, a set of 10 compounds not included in the training set, randomly selected and all of them showing IC50 values above 20μM, were tested by the model. As observed in Table 7, column 5, all compounds except nº31, are correctly positioned by the model as inactive what imply a rate of success of 90%. For more details see ref. [34]. 2.3. Molecular topology and drug design The use of molecular topology as a tool for the design and selection of new drugs has been the main objective of our group since over twenty years


[51-54]. The outcome of such a task was the finding of many new active compounds in different therapeutical scopes as illustrated in Table 8. Details on assays and protocols can be found in the references therein. Some of these compounds can be considered as new leads which can be the starting point for the design of new and more effective drugs. Table 8. New biological activities discovered through virtual screening. For details see the references in the last column.

Activity found

Selected drugs Ref

Cytostatic 6-azuridine, quinine [55]

Antibacterial

1-Chloro-2,4-dinitrobenzene, 3-Chloro-5-nitroindazole, 1-Phenyl-3-methyl-2-pyrazolin-5-one, neohesperidin, amaranth, mordant brown 24, hesperidin, morine, niflumic acid, silymarine, fraxine

[56,57]

Antifungal Neotetrazolium chloride, benzotropine mesilate, 3-(2-Bromethyl)-indole, 1-Chloro-2,4-dinitrobenzene [58]

Hypoglycaemic 3-Hydroxybutyl acetate 4-(3-Methyl-5-oxo-2-pyrazolin-1-yl) benzoic acid 1-(Mesitylene-2-sulfonyl) 1H-1,2,3-triazole

[59]

Antivirals (anti-Herpes)

3,5-dimethyl-4-nitroisoxazole, nitrofurantoin, (pyrrolidinocarbonylmethyl)piperazine, nebularine, cordycepin, adipic acid, thymidine, α−thymidine, inosine, 2,4-diamino-6-(hydroxymethyl)pteridine, 7-(carboxymethoxy)-4-methylcoumarin, 5-methylcytidine

[60]

Antineoplastic Carminic acid, tetracycline, piromidic acid, doxycycline [61]

Antimalarial Monensin, nigericin, vinblastine, vincristine, vindesine, ethylhydrocupreine, quinacrine, salinomycin [62]

Antitoxoplasma

Cefamandole nafate Prazosin Andrographolide Dibenzothiophene sulfone 2-Acetamido-4-methyl-5 thiazolesulfonyl chloride

[63]

Antihystaminic

Benzydamine 4-(1-Butylpentyl)pyridine N-(3-Bromopropyl)phtalimide N-(3-Chloropropyl)phtalimide N-(3-Chloropropyl)piperidine hydrochloride 5-Bromoindole

[64]

Bronchodilator

Griseofulvin, anthrarobin, 9,10-Dihydro-2-methyl-4H-benzo [5,6] cyclohept [1,2-d] oxazol-4-ol, 2-Aminothiazole, Maltol, esculetin, fisetin, hesperetin, 4-methyl-umbellipheryl-4-guanidine benzoate

[65]

Analgesics 2-(1-propenyl)phenol, 2',4' dimethylacetophenone, p-chlorobenzohydrazide, 1-(p- chlorophenyl) propanol, 4-benzoyl-3-methyl-1-phenyl-2-pyrazolin-5-one

[66,67]


The overall process for topologically driven drug design can be summarized in the following steps: - Step I: Selection of the therapeutical group and the most representative

drugs therein (training set). - Step II: Search of the physicochemical, biological and pharmacological

information for every drug in the training set. - Step III: Calculation of the adequate topological descriptors. - Step IV: Performing of the QSAR throughout different statistical

techniques such as multilinera regression, MRL, discriminant analysis, LDA, neural networks, NN, ... etc.

- Step V: Selection of the best topological model. Usually it is formed by both, predictive equations as well as discriminant functions.

- Step VI: Applcation of the model to the search of new drugs, either through database screening, chemical libraries from combinatorial chemistry or de novo design if necessary.

- Step VII: Finally, the selected compounds should be tested at the laboratoty to confirm their predicted activity. Typically the outcome is used for further refinement of the molecular candidates until goals fulfilment.

Just as an example, the most significant results recently achieved in antineoplastics and antimalarials are shown. Antineoplastic agents Particular relevance in this field has been the design of a novel lead compound named MT477, (see Figure 7), which has shown very potent activity in vivo against human cell carcinoma [68]. MT477 is a novel thiopyrano[2,3-c]quinoline that has been identified using molecular topology screening as a potential anticancer drug with a high activity against protein kinase C (PKC) isoforms. The objective of this study was to determine the mechanism of action of MT477 and its activity against human cancer cell lines. MT477 interfered with PKC activity as well as phosphorylation of Ras and ERK1/2 in H226 human lung carcinoma cells. It also induced poly-caspase-dependent apoptosis. Antimalarials agents against liver stages of Plasmodium Each year, the malaria parasite Plasmodium falciparum infects 300 to 660 million persons worldwide and causes several million deaths [69]. New


Figure 7. Chemical structure of MT477. The chemical name of MT477 is dimethyl 5,6-dihydro-7-methoxy-5,5-dimethyl-6-(2-(2,5-dioxopyrrolidin-1-yl)acetyl)-1H-1-(4,5- dimethoxycarbonyl-1,3-dithiolo-2-spiro) thiopyrano[2,3 ]quinoline-2,3-dicarboxylate. antimalarial drugs are urgently needed, especially considering the increasing prevalence of drug-resistant P. falciparum strains and the lack of effective vaccines and vector control measures. The Plasmodium liver stage is an interesting drug target, as it precedes the emergence of blood stages that cause the symptoms and complications of malaria. Drugs that inhibit parasite maturation within hepatocytes could be used for short-term prophylaxis in areas of endemicity (refugees and travelers, etc.). We conducted a quantitative structure-activity relationship (QSAR) study based on a database of 127 compounds previously tested against the liver stage of Plasmodium yoelii in order to develop a model capable of predicting the in vitro antimalarial activities of new compounds. Topological indices were used as structural descriptors, and their relation to antimalarial activity was determined by using linear discriminant analysis. A topological model consisting of two discriminant functions DF1 and DF2 was created: DF1= 3.17 + 1.000χv +1.281χv -14.04J1

v -22.94J3v +96.23J4

v -65.98J5 + 1.880D - 23.534Dc +0.294Cc -0.51PR3 +0.38V3 Eq.11 N= 76 F= 7.95 λ= 0.42 DF2= 81.90 -3.454χp

v +7.41G5v -70.211C -1.44PR1 -1.21V3 +2.44V4 Eq. 12

N= 28 F= 12.4 λ= 0.21


The first function, DF1, discriminated between active and inactive compounds, and the second, DF2, identified the most active among the active compounds. The model was then applied sequentially to a large database of compounds with unknown activity against liver stages of Plasmodium. Seventeen drugs that were predicted to be active or inactive were selected for testing against the hepatic stage of P. yoelii in vitro (see Table 9). Antiretroviral, antifungal, and cardiotonic drugs were found to be highly active (nanomolar 50% inhibitory concentration values), and two ionophores completely inhibited parasite development. The 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) assay was performed on hepatocyte cultures for all compounds, and none of these compounds were toxic in vitro. For both ionophores, the same in vitro assay as those for P. yoelii has confirmed their in vitro activities on Plasmodium falciparum. For more details see the ref. [70].

Table 9. Predicted drug activity on liver stage of Plasmodium. yoelii yoelii.

Drugs (Therapeutic Category) DF1(Class) DF2(Class) IC50 exp (nM)c

Active drugs Monensin (Antibacterial/Ionophore) 4.22 (A) -21.62 (NC) < 10-3

Nigericin (Ionophore) 3.36 (A) -24.22 (NC) < 10-3

Delaverdine (Antiviral) -0.21 (NC) 6.75 (HA) 0.846 Mibefradil (Antihypertensive) 7.59 (A) 6.66 (HA) 0.873 Licochalcone A (Estrogenic flavonoid) 8.26 (A) 7.99 (HA) 0.927 Miconazole (Antifungal) 1.46 (A) 1.77 (HA) 2.03 Dobutamine (Cardiotonic) 5.74 (A) 1.52 (HA) 3.7 Ritonavir (Antiviral) 8.80 (A) -5.38 (A) 34.2 Saquinavir (Antiviral) 9.14 (A) -6.76 (A) 35.2 Epoximicin (Antineoplastic) 8.17 (A) 7.92 (HA) 3.95 x103 Indinavir (Antiviral) 8.89 (A) -10.55 (A) 5 x103 Vinblastine (Antineoplastic) 1.21 (A) 38.71 (NC) 7.95 x103 Nordihydroguaiaretic acid (Antineoplastic) 9.03 (A) 2.88 (HA) 3 x104 Inactive drugs Fenbendazol (Antihelminthic) -3.33 (I) / 3 x104

Quinacrine (Anthelminthic/Antimalarial) -0.93 (I) / 3 x104 Rimandine (Antiviral) -2.69 (I) / 35.6 Thiophanote (Anthelminthic) -4.04 (I) / 3 x104

Atovaquone (ref. drug) 9.37(A) 3.63(HA) 57 Primaquine (ref. drug) 0.91(A) 1.79(HA) 75.7

a A=active; I=inactive; HÁ= highly active; NC= non classified. 3. Conclusions The results outlined here, clearly demonstrate that molecular topology (MT) based QSAR has become a powerful tool for the prediction of


properties and the design and selection of novel drugs. Furthermore, the fact that MT consists of a pure mathematical description of molecular structure, beyond any geometrical or physical profile, is also an important asset of the approach. The reasons explaining why MT works with such a level of efficacy remains as an open question and constitutes probably a good challenge to be addressed in the near future. 4. Acknowledgements The authors acknowledge financial support from the Fondo de Investigación Sanitaria, Ministerio de Sanidad, Spain (project: SAF2005-PI052128). We also thank prof. Eduardo Castro for his help in our research and his contributions into the field. 5. References 1. Hansch, C. and Fujita, T., 1964, J.Amer.Chem.Soc., 86, 1616. 2. Free, S.M., Wilson, J.W., 1964, J. Med. Chem., 7, 395. 3. Cramer, R.D. , Patterson, D.E., Bunce, J.D., 1988, J. Am. Chem. Soc., 110, 5959. 4. Güner, O.F., 2002, Curr. Top. Med. Chem., 2, 1321. 5. Klebe, G., Abraham, U., Mietzner, T., 1994, J. Med. Chem., 37, 4130. 6. Robinson, D.D., Winn, P.J., Lyne, P.D., Richards, W.G., 1999, J. Med. Chem.,

42, 573. 7. Kier, L.B., Hall, L.H., 1976, Molecular Connectivity in Chemistry and Drug

Research. Academic Press, London. 8. Devillers, J., 2000, Current Opinion in Drug discovery and Development, 3(3),

275. 9. Diudea M.V., Florescu, M.S., Khadikar, P.V., 2006, Molecular Topology and its

Applications, EfiCon Press, Bucarest. 10. L. Pogliani, 2000, Chem. Rev., 100, 3827. 11. Ivanciuc, O., Balaban, A.T., 1998, Tetrahedron, 54, 9129. 12. Hosoya, H., Gotoh, M., Murakami, M., Ikeda, S., 1999, J. Chem. Inf.

Comput.Sci., 39, 192. 13. Garcia-Domenech, R., Galvez, J., de Julian-Ortiz, J. V., Pogliani, L., 2008,

Chem. Rev., 108(3), 1127. 14. Basak, S. C., Mills, D., Gute, B. D., Natarajan, R., 2006, Topics in Heterocyclic

Chemistry, 3, 39. 15. Balaban, A. T., Motoc, I., Bonchev, D., Mekenyan, O., 1983, Topics in Current

Chemistry, 114, 21. 16. Estrada, E., Uriarte, E., 2001, Current Medicinal Chemistry, 8(13), 1573. 17. Marrero-Ponce, Y., 2004, Bioorganic & medicinal chemistry, 12(24), 6351. 18. Dudek, A.Z., Arodz, T., Galvez, J., 2006, Combinatorial Chemistry: High

Throughput Screening, 9(3), 213.


19. Galvez, J., Garcia-Domenech, R., 1994, Farmaindustria, 357. 20. Gálvez, J., García-Domenech, R., Julián-Ortiz, J. V. de, Soler, R., 1995, J. Chem.

Inf. Comp. Sci., 35, 272. 21. Wiener, H., 1947, J. Am. Chem. Soc., 69, 17. 22. Kier, L.B., Hall, L.H., Molecular Connectivity in Structure-Activity Analysis,

Wiley, New York, 1986. 23. Randić, M., 1975, J. Am. Chem. Soc., 97, 6609. 24. Gálvez, J., García-Domenech, R., Salabert, M. T., Soler, R., 1994, J. Chem. Inf.

Comp. Sci., 34, 520. 25. Kier, L. B., Hall, L. M., 1989, Pharm. Res., 6, 497. 26. Furnival, G.M., Wilson, R.W., 1974, Technometrics, 16, 499. 27. Hocking, R.R., 1972, Technometrics, 14, 967. 28. Allen, D.M., 1974, Technometrics, 16, 125. 29. Dixon, W.J., Brown, M.B.L. Engelmanand R.I. Jennrich, BMDP Statistical

Software Manual, Vol I. University of California, Berkeley. Press 1990, 3390-4358.

30. Sanchez–Ferrer, A., Rodriguez-Lopez, J. N., García–Cánovas, F., García–Carmona, F., 1995, Biochim. Biophys. Acta, 1247, 1.

31. Chase, M. R., Raina, K., Bruno, J., Sugumaran, M., 2000, Insect Biochem. Molec., 30, 953.

32. Ashida, M., Brey, P.T., 1995, Proc. Natl. Acad. Sci. U.S.A., 92, 10698. 33. García-Domenech, R., Calvo-Chamorro, M.L., Cuervo-Arias, A.Y., Gómez-

Sucerquia, L.J., Ortega-Chávez, V., Pérez-Torrado, E., Gálvez, J., 2008, Afinidad, 538, 430.

34. García-Domenech, R., Domingo-Puig, C., Esteve-Martinez, M.A., Schmitt, J., Vera-Martinez, J., Chindemi, A.L., Galvez, J., 2008, Anales de la Real Academia de Farmacia, 74, 345.

35. Garcia-Domenech, R., Lopez-Peña, W., Sanchez-Perdomo, Y., Sanders, J.R., Sierra-Araujo, M.M., Zapata, C., Galvez, J., 2008, Internacional Journal of Pharmaceutics, 363, 78.

36. Garcia-Domenech, R., Alarcon-Elbal, P., Bolas, G., Bueno-Mari, R., Chorda-Olmos, F. A., Delacour, S. A., Mourino, M. C., Vidal, A., Galvez, J., 2007, SAR and QSAR in Environmental Research, 18, 745.

37. García-Domenech, R., Villanueva, A., Gálvez, J., 2008, Organic Chemistry : An Indian Journal,

38. Kachigan, S.L., 1991, Multivariate Statistical Analysis, Radius Press, New York 39. McFarland, J.W., Gans, D.J., 1986, Journal of Medicinal Chemistry, 30, 46. 40. Wold, S., Eriksson, L., 1995, Statistical validation of QSAR results. In: Van de

Waterbeemd H (ed) Chemometric methods in molecular design. VCH, New York.

41. Tabahnick, B.G., Fidell, L.S., 1990, Using Multivariate Statistics, Harper Collins, New York.

42. Organization WH. Burdens and Trends in Chagas disease. Available from: www.who.int/ctd/chagas/burdens


43. Hudock, M.P., Sanz-Rodriguez, C.E., Song, Y., Chan, J.M., Zhang, Y., Odeh, S., et al. 2006, J. Med. Chem., 49(1), 215.

44. Racagni, G.E., Machado de Doménech, E.E., 1983, Mol. Biochem. Parasitol., 9(2), 181.

45. García-Domenech, R., Espinoza, N., Galarza, R.F., Moreno-Padilla, M.J., Rojas-Ruiz, B., Roldan-Arroyo, LL., Sanchez-Lavado, M.I., Gálvez, J., 2008, Ars Pharmaceutica, 49(3), .

46. Cheng, C., 1986, General Parasitology, Academic Press College Division. Orlando.

47. Weekly epidemiological record. WHO 2002, (44), 77, 365-372 48. Weekly epidemiological record. WHO 2002, (25), 77, 205-212. 49. Delfín, D.A., Bhattacharjee, A.K., Yakovich, A.J., Werbovetz, K.A., 2006, J.

Med. Chem., 49, 4196. 50. Data base: Maybridge Organics Compounds, http://www.maybridge.com 51. Arviza, M.P., 1985, Predicción e interpretación de algunas propiedades

fisicoquímicas y biológicas de un grupo de barbitúricos y sulfonamidas por el método de conectividad molecular. Doctoral Thesis, Universidad de Valencia, Spain.

52. Bernal, J., 1988, Desarrollo de un nuevo método de diseño molecular asistido por ordenador. Su aplicación a fármacos betabloqueantes y benzodiazepinas. Doctoral Thesis, Universidad de Valencia, Spain.

53. Gálvez, J., García-Domenech, R., Bernal, J. M., García-March, F. J., 1991, An. Real Acad. Farm., 57, 533.

54. Garcia-Domenech,R., Gálvez, J., Garcia-March, F.J., Moliner, R., 1991, Drug. Invest., 3(5), 344.

55. Gálvez, J., García-Domenech, R., Gómez-Lechón, M.J., Castell, J.V., 2000, J. Mol. Struc. (Theochem), 504, 241.

56. de Gregorio-Alapont, C., García-Domenech, R., Gálvez, J., Ros, M.J., Wolski, S., García, M.D., 2000, Bioorg. Med. Chem. Lett., 10, 2033.

57. Gálvez, J., García-Domenech, R., Gregorio Alapont, C.de, Julián-Ortiz, J. V. de, Salabert-Salvador, M. T., Soler-Roca, R., 1996, In Advances in Molecular Similarity. Vol. I , Carbó-Dorca, R., Mezey, P. G., JAI Press Inc : London.

58. Pastor L., García-Domenech, R., de Gregorio Alapont, C., Gálvez, J., 1998, Bioorg. Med. Chem. Lett., 8, 2577.

59. Antón-Fos, G. M., García-Domenech, R., Pérez-Giménez F., Peris-Ribera, J. E., García-March, F., Salabert-Salvador, M. T., 1994, Arzneim-Forsch/Drug Res., 44, 821.

60. De Julian-Ortiz, J.V., Galvez, J., Muñoz-Collado, C., Garcia-Domenech, R., Jimeno-Cardona, C., 1999, J. Med. Chem., 42, 3308.

61. Gálvez, J., Gómez-Lechón, M. J., García-Domenech, R., Castell, J. V., 1996, Bioorg. & Med. Chem. Lett., 6, 2301.

62. Mahmoudi, N., de Julian-Ortiz, J.V., Ciceron, L., Galvez, J., Mazier, D., Danis, D., Derouin, F, Garcia-Domenech, R., 2006, J. Antimic. Chemother., 57, 489.

63. Gozalbes, R., Gálvez, J., García-Domenech, R., Derouin, F., 1999, SAR QSAR Environ. Res., 10, 47.


64. Casabán-Ros, E., Antón-Fos, G. M., Gálvez, J., Duart, M. J., García-Domenech, R., 1999, Quant. Struct.-Act. Relat., 18, 35.

65. 65a: Rios-Santamarina, I., García-Domenech, R., Cortijo, J., Santamaria, P., Morcillo, E.J., Gálvez, J., 2002, Internet Electron. J. Mol. Des., 1, 70. 65b: Ríos-Santamarina, I., García-Domenech, R., Gálvez, J., Santamaría, P., Cortijo, J., Morcillo, E.J., 1998, Bioorg. Med. Chem. Lett., 8, 477.

66. García-Domenech, R., García-March, F.J., Soler, R.M., Gálvez, J., Antón-Fos, G.M., de Julián-Ortiz, J.V., 1996, Quant. Struct.-Act. Relat., 15, 201.

67. Gálvez, J., García-Domenech, R., de Julián-Ortiz, J.V., Soler, R., 1994, J. Chem. Inf. Comput. Sci., 34, 1198.

68. Jasinski, P., Welsh, B., Galvez, J., Land, D., Zwolak, P., Ghandi, L., Terai, K., Dudek, A. Z., 2008, Investigational New Drugs., 26, 223.

69. Snow, R. W., Guerra, C. A., Noor, A. M., Myint, H. Y., Hay, S. I., 2005, Nature, 434, 214.

70. Mahmoudi, N., Garcia-Domenech, R., Galvez, J., Farhati, K., Franetich, J.F., Sauerwein, R., Hannoun, L., Derouin, F., Danis, M., Mazier, D., 2008, Antimicrobial Agents and Chemotherapy, 52(4), 1215.


QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 95-116 ISBN: 978-81-308-0404-0

Editor: Eduardo A. Castro

4. QSAR models constructed by optimal descriptors and by multiple regression analysis for the prediction of carbonic

anhydrase II inhibitory activity of substituted aromatic sulfonamides

Georgia Melagraki1, Antreas Afantitis1,2, Andrey A. Toropov3

Haralambos Sarimveis1 and Olga Igglessi – Markopoulou1 1School of Chemical Engineering, National Technical University of Athens 2Department of Chemoinformatics, NovaMechanics Ltd, Nicosia, Cyprus

3Uzbek Academy of Science Institute of Geology and Geophysics Tashkent, Uzbekistan

Abstract. In this work two different approaches are presented for the modeling and prediction of Carbonic Anhydrase II (CA II) inhibitory activity of substituted aromatic sulfonamides. The predictive models have been obtained by means of optimal descriptors calculated with Simplified molecular input line entry system (SMILES) and by multiple linear regression analysis (MLR). Both models were shown to significantly describe and predict the CAII inhibitory activity. The statistical results obtained from the two approaches are the following: SMILES based optimal descriptors: n=30, R2=0.76, s=0.252, F=89 (training set); n=17, R2

pred =0.81, s=0.210, (test set); MLR: n=30, R2=0.77, Q2=0.66, s=0.268, F=20.8 (training set); n=17, R2

pred=0.76, s=0.292 (test set). Correspondence/Reprint request: Dr. Andrey A. Toropovc, Uzbek Academy of Science Institute of Geology and Geophysics, Tashkent, Uzbekistan. E-mail: [email protected]

Georgia Melagraki et al. 96

The carbonic anhydrases (CAs) are ubiquitous zinc enzymes, present in proeukaryotes and eukaryotes. These enzymes catalyze a very simple physiological reaction, the interconversion between carbon dioxide and the bicarbonate anion and are thus involved in crucial physiological processes accociated with pH control, ion transport and fluid secretion [1]. The CA enzymes found in mammals are divided into four broad subgroups, which, in turn consist of several isoforms. Carbonic anhydrase II is an isoform of cytosolic CAs and it is one of fourteen forms of human α carbonic anhydrases. Sulfonamides present an important class of biologically active compounds. Among them, those that inhibit the zinc enzyme carbonic anhydrase possess many applications as diuretic, antiglaucoma, antiepileptic or antithyroid drugs. Acetazolamide, methazolamide, ethoxzolamide and dichlorophenamide are among sulfonamides possessing carbonic anhydrase (CA, EC 4.2.1.1) inhibitory properties that have been used for therapeutic purposes. More recently, indisulam is in Phase II clinical trials as an anticancer agent to treat solid tumors [2]. Quantitative structure – activity relationship (QSAR) is a tool for numerically estimating biochemical endpoints of interest for substances for which experimental data are missing. Carbonic anhydrase II inhibitory activity (Ki) of substituted aromatic sulfonamides is an important endpoint from biochemical and medicinal point of view. Various attempts have been reported in literature for modelling and predicting Carbonic Anhydrase II inhibitory activity. Several efforts have been previously reported in literature for the quantitative description of the CAII inhibitory activity. Among other scientists, Supuran and his group have presented a great amount of work in the field [3,4]. A review on the subject [5] indicates the number and importance of these attempts and could be a useful guide to the interested reader. A summary of the several QSAR studies that have been recently conducted in the field is given next. A QSAR study of two sets of carbonic anhydrase inhibitors was presented by A.T. Balaban et al. [6], using a variety of molecular descriptors including topological indices. The first set consisted of 29 benzenesulphonamides, the second set included 35 sulphanilamide Schiff bases and statistically significant models have been defined for both sets. P.V. Khadikar et al. [7] presented QSAR studies on 29 benzene sulphonamide carbonic anhydrase inhibitors using distance-based topological indices. The need of including the hydrophobic parameter (logP) for the topological modeling of CA II inhibition has been investigated. P.V. Khadikar et al. [8] used topological and quantum – theoretical descriptors for the prediction of CA II on a large database of 95 compounds [8].

Optimal descriptors for aromatic sulfonamides 97

A dataset of 47 para-substituted aromatic sulphonamides has been investigated by using information topological indices [9]. Multi – parameter models have been proposed with significant accuracy. More recently, P.V. Khadikar has published a study on the para-substituted compounds by using information, distance – based, connectivity indices and combinations of these three categories [10]. Their work resulted in a nine – parameter and an eight – parameter model with R2 values of 0.8375 and 0.8343 respectively [10]. In the present study two different techniques are employed for the development of a quantitative relationship between CAII inhibitory activity and sulfonamides structure: optimal descriptors calculated with Simplified molecular input line entry system (SMILES) and multiple linear regression analysis (MLR). The predictive ability of both approaches is tested using various statistical techniques. The utilized database contains the structures of substituted aromatic sulfonamides which are shown in Table 1. Table 1. Structures and numbers of para-substituted aromatic sulfonamides under consideration.

No. Structure No. Structure

1

OHO

SNH2

OO

25

NH

SNH2

OO

O

2

NHO

SNH2

OO

NH2

26

NH

SNH2

OO

O

F F

F

FF

3

NHO

SNH2

OO

NH

O

NHCl

Cl

27 SNH2

OO

NHO

NH


Table 1. Continued

4

NHO

SNH2

OO

NH

O

NH O

CH3

28 SNH2

OO

NHONH

5

NHO

SNH2

OO

NH

O

NH

OO CH3

29 S

NH2

OO

NH

ONH

6

NHO

SNH2

OO

NH

O

NHBr

30 SNH2

OO

NH

O NH

ClCl

7

NHO

SNH2

OO

NH

O

NH

31 SNH2

OO

NH SO

O

8

NHO

SNH2

OO

NH

O

NHO

32 S

NH2 OO

NHS O

O

9

NHO

SNH2

OO

NH

O

NH

33 SNH2

OO

NH SO

O


Table 1. Continued

10

NHO

SNH2

OO

NH

O

NHNHS

O

O

34 SNH2

OO

NH SO

O F

11

NHO

SNH2

OO

NH

O

NHNHS

O

O

CH3

35 SNH2

OO

NH SO

O

NH CH3

O

12

NHO

SNH2

OO

NH

O

NHNHS

O

O

CH3

36 SNH2

OO

NH2

13

NHO

SNH2

OO

NH

O

NHNHS

O

O

F

37 SNH2

OO

NH NH2

14

NHO

SNH2

OO

NH

O

NHNHS

O

O

Cl

38 SNH2

OO

NH2

15

NH

SNH2

OO

NH2

F

39 SNH2

OO

NH2


Table 1. Continued

16

NH

SNH2

OO

NH2

Cl

40 SNH2

OO

NH2

F

17

NH

SNH2

OO

CH3

O

41 SNH2

OO

NH2

Cl

18

NH

SNH2

OO

O

F F

F

42 SNH2

OO

NH2

Br

19

NH

SNH2

OO

OCH3

43 SNH2

OO

NH2

I

20

NH

SNH2

OO

OCH3

44 SNH2

OO

ClSCl

NH2

O

ONH2

21

NH

SNH2

OO

OCH3

CH3

45 SNH2

OO

S

ClO

ONH2NH2

22

NH

SNH2

OO

OCH3

46 SNH2

OO

OH


Table 1. Continued

23

NH

SNH2 O

O

O CH3

CH3

47 SNH2

OO

OH 24

NH

SNH2

OO

O

CH3

2. Method The two models were developed based on the methodologies described below: 2.1. SMILES based optimal descriptors Optimal descriptors calculated with SMILES [11-13] utilized in the present study have been calculated as N DCW= П CW(SAk) (1) k=1

where SAk is an attribute of the SMILES. There is the following hierarchy of the SMILES attributes: i. Global attributes; ii. Attributes which contain two characters (for instance, Cl, Br, etc.); and iii. all others that contain one SMILES characters. Global SMILES attributes are the number of brackets (these are indicators of the branching in molecular structure), this kind of SAk is denoted by “xxx” (where x is some digit). By Monte Carlo method one can calculate numerical values for the CW(SAk) which produce the largest possible correlation coefficient between the end point of interest and the optimal descriptor DCW over a training set. Based on the CW(SAk) numerical data, one can calculate a single-variable model of the following form: K = C0 + C1 . DCW (2)


One can estimate the predictive potential of the model calculated with Eq. 2 using an external test set. Separation into the training and test set can be random but must comply with one limitation: intervals of the KI values for the training and test sets should be similar. SMILES used in present study have been obtained with ACD/ChemSketch software [14]. 2.2. Multiple linear regression analysis Multiple linear regression analysis was performed by considering exactly the same compounds that constituted the training and test sets in the SMILES modeling approach. 69 physicochemical constants, topological and structural descriptors (Table 2) were considered as possible input candidates in order to build a Multiple Linear Regression (MLR) model. Before the calculation of the descriptors, all structures were fully optimized using Cambridge Software Mechanics and more specifically MM2 force fields and the Truncated-Newton-Raphson optimizer, which provide a balance between speed and accuracy (Chemoffice Manual). Before calculating the HOMO and LUMO Energies (eV) all the structures were additionally fully optimized using the AM1 basis set. All the descriptors were calculated using Chem3D and Topix [15,16].

Table 2. Calculated descriptors.

ID Description Notation ID Description Notation

1 Molar Refractivity MR 2 Diameter Diam 3 Partition Coefficient

(Octanol Water) ClogP 4 Molecular

Topological Index

TIndx

5 Principal Moment of Inertia Z

PMIZ 6 Number of Rotatable Bonds

NRBo

7 Principal Moment of Inertia Y

PMIY 8 Polar Surface Area

PSAr

9 Principal Moment of Inertia X

PMIX 10 Radius Rad

11 LUMO Energy LUMO 12 Shape attribute ShpA 13 HOMO Energy HOMO 14 Shape coefficient ShpC 15 Balaban Index BIndx 16 Sum of Valence

Degrees SVDe

17 Cluster Count ClsC 18 Total Connectivity

TCon


Table 2. Continued

19 Wiener Index WIndx 20 Total Valence Connectivity

TVCon

21 DistEqTotal DistEqTotal 22 Randic 0 Chi0 23 Randic 1 Chi1 24 Randic 2 Chi2 25 Randic 3 Chi3 26 Randic 4 Chi4 27 Randic Information 0 ChiInf0 28 Randic

Information 1 ChiInf1

29 Randic Information 2 ChiInf2 30 Randic Information 3

ChiInf3

31 Randic Information 4 ChiInf4 32 Molecular Weight

MW

33 Randic Mod ChiMod 34 Xu1 Xu1 35 Xu2 Xu2 36 Xu3 Xu3 37 Balaban Topological TopoJ 38 Number of

Bramches NBranch

39 Number of Rings NRings 40 Wiener Dim Wiener Dim

41 Bertz Bertz 42 AtomCompMean AtomCompMean

43 AtomCompTot AtomCompTot 44 Zagreb1 Zagreb1 45 Zagreb2 Zagreb2 46 Kappa1 Kappa1 47 Kappa2 Kappa2 48 Kappa3 Kappa3 49 Wiener Distance WienerDistCode 50 Polarity Polarity 51 DistEqMean DistEqMean 52 Quadratic Quadr 53 InfMagnitDistTot InfMagnitDist Tot 54 ScHultz ScHultz 55 Gordon Gordon 56 Kier-Hall 0 Ki0 57 Kier-Hall 1 Ki1 58 Kier-Hall 2 Ki2 59 Kier-Hall 3 Ki3 60 Kier-Hall 4 Ki4 61 Kier-Hall

Information 0 KiInf0 62 Kier-Hall

Information 1 KiInf1

63 Kier-Hall Information 2

KiInf2 64 Kier-Hall Information 3

KiInf3

65 Kier-Hall Information 4

KiInf4 66 Randic Cluster 3 ChiCl3

67 Randic Cluster 4 ChiCl4 68 Wiener Information

InfWiener

69 Wiener Index WIndx


Our first objective was to determinate the variables which produce the most significant linear QSAR models linking the structure of compounds with their inhibitory activity. The Elimination Selection-Stepwise Regression (ES-SWR) algorithm was used on the training data set to select the most appropriate descriptors. ES-SWR is a popular stepwise technique [17] that combines Forward Selection (FS-SWR) and Backward Elimination (BE-SWR). The accuracy of the produced MLR model was illustrated using the following evaluation techniques: leave-one-out cross-validation procedure and validation through an external test set. According to Tropsha’ group [19-21] a QSAR model is considered predictive, if the following conditions are satisfied:

2predR > 0.6 (3)

2 2

2

( ) 0.1oR RR−

< or 2 '2

2

( ) 0.1oR RR−

< (4)

0.85 1.15k≤ ≤ or '0.85 1.15k≤ ≤ (5) In Eqs. (4),(5) R2 is the correlation coefficient between experimental values and model prediction on the test set. Mathematical definitions of 2

oR , '2oR , k

and 'k are based on regression of the observed activities against predicted activities and the opposite (regression of the predicted activities against observed activities). The definitions are presented clearly in [21] and are not repeated here for brevity. Moreover, in order for a QSAR model to be used for screening new compounds, its domain of application must be defined and predictions for only those compounds that fall into this domain may be considered reliable. Extent of Extrapolation is one simple approach to define the applicability of the domain. It is based on the calculation of the leverage hi [21-22] for each chemical, where the QSAR model is used to predict its activity:

1( )T Ti i ih x X X x−= (6)

In Eq. (6) ix is the row vector containing the k model parameters of the query compound and X is the nxk matrix containing the k model parameters for each one of the n training compounds. A leverage value greater than 3k/n is considered large. It means that the predicted response is the result of a substantial extrapolation of the model and may be not reliable.


3. Results and discussion In general the models produced by the two different approaches exhibited very good statistical performance both for the training and the test set. Detailed results for each method are presented below: 3.1. SMILES based optimal descriptors Statistical characteristics of the models for the K over three probes of the Monte Carlo optimization are demonstrated in Table 3. One can see that statistical quality of these models is quite good for both the training and test sets. Table 4 contains numerical data on the CW(SAk). Example of calculating the DCW is demonstrated in Table 5. One variable model of the logKI obtained in first probe of the optimization is the following: logKI = -19.444 (±0.3418) + 16.449 (±0.2672) * DCW (7) n=30, R2=0.7612, s=0.252, F=89 (training set) n=17, R2=0.8536, s=0.210, (test set) Table 6 depicts the separation of the compounds into training and validation sets. The same table presents the experimental logKI values and the respective predicted values calculated by Eq. 7. Graphically the model for the training and test sets is demonstrated in Fig. 1 and 2, respectively. Table 3. Statistical characteristics of QSAR based on optimal descriptors obtained in three probes of the Monte Carlo optimization.

Training set, n=30 Test set, n=17 Probe R2 S F R2 R2

pred * S F

1 0.7612 0.252 89 0.8536 0.8148 0.210 87

2 0.7604 0.252 89 0.8913 0.8645 0.196 123 3 0.7606 0.252 89 0.8768 0.8447 0.205 107

n n

*) R2pred = 1 - ∑ (Eyk – Cyk)2 / ∑ (Eyk – Ay)2

k=1 k=1

where Eyk , Cyk are experimental and calculated values of the inhibitory activity, respectively; Ay is average value of inhibitory activity on given set (i.e., training or test set), n is number of compounds in the set.


Table 4. Numerical data for correlation weights of the SAk obtained in three probes of the Monte Carlo optimization.

SMILES attribute, SAk CW(SAk) in probe 1

CW(SAk) in probe 2

CW(SAk) in probe 3

(003________ 0.9946079 0.9961606 0.9735016 (004________ 1.0156395 1.0198094 1.0201966 (005________ 1.0091059 1.0038481 1.0160863 (006________ 1.0136803 1.0028883 1.0291028 (007________ 1.0258007 1.0135517 1.0612002 (008________ 1.0010987 0.9715683 1.0188980(___________ 0.9989675 1.0022593 0.9949041 1___________ 1.1835983 1.2239750 1.2992088 2___________ 1.0020425 0.9990315 0.99662763___________ 1.0074892 1.0069452 1.0070966=___________ 0.9994141 1.0018132 1.0053143 C___________ 0.9990570 0.9983833 0.9967137 Br__________ 0.9751599 0.9621277 0.9488951 Cl__________ 0.9899030 0.9845061 0.9791878 F___________ 0.9963299 0.9940541 0.9922458 I___________ 1.0033072 0.9969325 1.0000845 N___________ 0.9946235 0.9921148 0.9892841 O___________ 0.9913011 0.9847984 0.9797559S___________ 0.9975581 0.9953914 0.9907505 c___________ 0.9963897 0.9959790 0.9952362

The SMILES based descriptors can be calculated with the correlation weights (Table 4) for an arbitrary substance (more exactly for arbitrary SMILES). Unfortunately, criteria for rational definition of the applicability domain for such models do not exist. There are several pieces of software to generate the SMILES. The choice can influence the statistical quality of the SMILES-based models. It is important that all SMILES are generated by the same software (utilization of SMILES prepared by different pieces of software is not correct). Taking into account the increase of SMILES-oriented databases in the Internet, one can conclude that the approach (SMILES-based descriptors) is a good perspective. 3.2. Multiple linear regression analysis The ES-SWR algorithm was applied on the set of training data, by considering all 69 available descriptors as potential input parameters to the model. The separation of the dataset into training and validation sets was performed according to the popular Kennard and Stones algorithm [18]. The advantages of this algorithm are that the calibration samples map the measured


Table 5. Example of calculation the DCW with CW(SAk) obtained in the first probe of the Monte Carlo optimization for first SMILES of the training set: SMILES="O=S(N)(=O)c1ccc(cc1)C(O)=O"; DCW = 1.3194129.

SAk CW(SAk) O___________ 0.9913011=___________ 0.9994141 S___________ 0.9975581 (___________ 0.9989675N___________ 0.9946235 (___________ 0.9989675 (___________ 0.9989675 =___________ 0.9994141 O___________ 0.9913011 (___________ 0.9989675 c___________ 0.9963897 1___________ 1.1835983c___________ 0.9963897c___________ 0.9963897 c___________ 0.9963897 (___________ 0.9989675c___________ 0.9963897 c___________ 0.9963897 1___________ 1.1835983 (___________ 0.9989675 C___________ 0.9990570 (___________ 0.9989675 O___________ 0.9913011 (___________ 0.9989675=___________ 0.9994141O___________ 0.9913011 (004________ 1.0156395

region of the input variable space completely with respect to the induced metric and that the test samples all fall inside the measured region. The validation data were not involved by any means in the process of selecting the most appropriate descriptors or in the development of the QSAR model. They were considered as a completely unknown external set of data, which was used only to test the accuracy of the produced model. The most significant descriptors according to the ES-SWR algorithm are: Lipophilicity (ClogP), KiInf2, ChiInf1 and Bertz index. Table 7 presents the correlation matrix, where it is clear that the four selected descriptors are not highly correlated. Lipophilicity is known to be important for absorption, permeability, and in vivo distribution of organic compounds [17] and has been used as a physicochemical descriptor in QSARs with great success.


Table 6. Experimental and calculated values (with Eq. 7) of the carbonic anhydrase II inhibitory activity (logKI) of para-substituted aromatic sulfonamides using the SMILES approach.

No Structure DCW Expr Calc Expr-Calc

Training set 1 O=S(N)(=O)c1ccc(cc1)C(O)=O 1.3194129 2.412 2.260 0.152 2 O=S(N)(=O)c1ccc(cc1)C(=O)NN 1.3167175 2.093 2.216 -0.123 3 Clc2ccc(NC(=O)NNC(=O)c1ccc

(cc1)S(N)(=O)=O)cc2Cl 1.2404880 1.114 0.962 0.152

4 O=C(Nc1ccc(cc1)C(C)=O)NNC (=O)c2ccc(cc2)S(N)(=O)=O

1.2641648 1.176 1.351 -0.175

5 O=C(Nc1ccc(cc1)C(=O)OCC)NNC (=O)c2ccc(cc2)S(N)(=O)=O

1.2519862 0.954 1.151 -0.197

6 Brc2ccc(NC(=O)NNC(=O)c1ccc (cc1)S(N)(=O)=O)cc2

1.2344774 0.863 0.863 0.000

7 O=C(Nc1ccc(cc1)c2ccccc2)NNC (=O)c3ccc(cc3)S(N)(=O)=O

1.2573710 1.041 1.240 -0.199

8 NS(=O)(=O)c1ccc(cc1)C(=O)NNC (=O)Nc3ccc(Oc2ccccc2)cc3

1.2464332 1.255 1.060 0.195

10 O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=CC=C2)=O)=O)=O) C= C1)(N)=O

1.2880059 1.826 1.743 0.083

11 O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=CC=C2C)=O)=O)=O) C=C1)(N)=O

1.2867913 1.732 1.723 0.009

13 O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=C(F)C=C2)=O)=O)=O) C=C1)(N)=O

1.2497919 0.978 1.115 -0.137

15 Fc1cc(ccc1NN)S(N)(=O)=O 1.3006607 1.708 1.952 -0.244 17 O=S(N)(=O)c1ccc(NC(C)=O)cc1 1.3225867 2.391 2.312 0.079 18 O=S(N)(=O)c1ccc(NC(=O)C(F)(F)F)

cc1 1.3001713 2.124 1.944 0.180

20 O=S(N)(=O)c1ccc(NC(=O)CCC) cc1 1.3200935 2.356 2.271 0.085 24 O=S(N)(=O)c1ccc(NC(=O)

CCCCC)cc1 1.3176050 1.799 2.230 -0.431

25 O=S(N)(=O)c2ccc(NC(=O) c1ccccc1)cc2

1.3007132 1.568 1.952 -0.384

26 Fc1c(c(F)c(F)c(F)c1F)C(=O)Nc2ccc (cc2)S(N)(=O)=O

1.2483770 1.230 1.092 0.138

27 O=C(Nc1ccccc1)Nc2ccc(cc2)S(N) (=O)=O

1.2937200 2.380 1.837 0.543

29 O=C(Nc1ccccc1)NCCc2ccc(cc2)S (N)(=O)=O

1.2912812 1.875 1.797 0.078

30 Clc2ccc(NC(=O)NCCc1ccc(cc1)S (N)(=O)=O)cc2Cl

1.2546021 1.114 1.194 -0.080

32 NS(=O)(=O)c2ccc(CNS(=O)(=O) c1ccccc1)cc2

1.2745907 1.602 1.523 0.079


Table 6. Continued

34 O=S(=O)(Nc1ccc(cc1)S(N)(=O)=O) c2ccc(F)cc2

1.2742383 0.954 1.517 -0.563

35 CC(=O)Nc1ccc(cc1)S(=O)(=O)NCCc2ccc(cc2)S(N)(=O)=O

1.2678946 1.875 1.413 0.462

36 O=S(N)(=O)c1ccc(N)cc1 1.3125085 2.477 2.147 0.330 38 NCc1ccc(cc1)S(N)(=O)=O 1.3112708 2.230 2.126 0.104 39 NCCc1ccc(cc1)S(N)(=O)=O 1.3100342 2.204 2.106 0.098 40 Fc1cc(ccc1N)S(N)(=O)=O 1.3076914 1.778 2.067 -0.289 44 Clc1c(cc(c(N)c1Cl)S(N)(=O)=O)S

(N)(=O)=O 1.2811755 1.447 1.631 -0.184

45 O=S(N)(=O)c1cc(c(N)cc1Cl)S(N) (=O)=O

1.2815962 1.875 1.638 0.237

Test set 9 O=C(Nc1ccc(cc1)Cc2ccccc2)NNC

(=O)c3ccc(cc3)S(N)(=O)=O 1.2561853 1.176 1.220 -0.044

12 O=S(C1=CC=C(C(NNC(NS(=O)(C2=CC=C(C)C=C2)=O)=O)=O)C=C1) (N)=O

1.2532128 0.991 1.171 -0.180

14 O=S(C1=CC=C(C(NNC(NS(=O)(C2=CC=C(Cl)C=C2)=O)=O)=O)C=C1)(N)=O

1.2417301 0.959 0.982 -0.023

16 NNc1ccc(cc1Cl)S(N)(=O)=O 1.2922707 1.881 1.814 0.067 19 O=S(N)(=O)c1ccc(NC(=O)CC)cc1 1.3213395 2.365 2.292 0.073 21 O=S(N)(=O)c1ccc(NC(=O)CC(C)C)

cc1 1.3076600 2.412 2.067 0.345

22 O=S(N)(=O)c1ccc(NC(=O)CCCC) cc1

1.3188486 2.330 2.251 0.079

23 O=S(N)(=O)c1ccc(NC(=O)CC(C)C) cc1

1.3076600 2.362 2.067 0.295

28 O=C(Nc1ccccc1)NCc2ccc(cc2)S(N) (=O)=O

1.2925000 2.021 1.817 0.204

31 O=S(=O)(Nc1ccc(cc1)S(N)(=O)=O) c2ccccc2

1.2757938 1.690 1.543 0.147

33 NS(=O)(=O)c2ccc(CCNS(=O)(=O) c1ccccc1)cc2

1.2733888 1.447 1.503 -0.056

37 NNc1ccc(cc1)S(N)(=O)=O 1.3054518 2.505 2.030 0.475 41 Nc1ccc(cc1Cl)S(N)(=O)=O 1.2992561 2.041 1.929 0.112 42 Nc1ccc(cc1Br)S(N)(=O)=O 1.2799056 1.602 1.610 -0.008 43 Ic1cc(ccc1N)S(N)(=O)=O 1.3168492 1.845 2.218 -0.373 46 OCc1ccc(cc1)S(N)(=O)=O 1.3068906 2.097 2.054 0.043 47 OCCc1ccc(cc1)S(N)(=O)=O 1.3056582 2.041 2.034 0.007

Information indices such as KiInf2 and ChiInf1 are graph theoretical invariants that view the molecular graph as a source of different probability distributions to which information theory definitions can be applied. They can be considered as a quantitative measure of the lack of structural homogeneity


Training set, n=30

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

logK(calc.)

logK

(exp

r.)

Figure 1. Plot of experimental versus calculated (with Eq.7) values of logKI for the training set.

Test set, n=17

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

logK(calc.)

logK

(exp

r.)

Figure 2. Plot of experimental versus calculated (with Eq.7) values of logKI for the test set.


Table 7. Correlation matrix for the four selected descriptors.

KiInf2 ClogP ChiInf1 Bertz

KiInf2 1

ClogP 0.301 1

ChiInf1 0.208 -0.439 1

Bertz 0.511 0.273 0.239 1 or the diversity of a graph, in this way being related to symmetry associated with structure [17]. Bertz’s complexity index [17], the most popular complexity index, takes into account both the variety of kinds of bond connectivities and atom types of H-depleted molecular graph. The produced MLR QSAR model is the following: logKI = 2.25 + 0.0792 KiInf2 - 0.189 CLogP - 0.0654 ChiInf1 – 0.00457 Bertz (8) n = 30; R2 = 0.77; F =20.8; RMS = 0.27; Q2 = 0.66 Eq. 8 was used to predict the binding affinity for the validation examples. The results are presented in Table 8 and corresponds to 2

predR = 0.76. The results illustrated that the linear MLR technique combined with a successful variable selection procedure are adequate to generate an efficient QSAR model for predicting the binding affinity of different compounds. The proposed model (Eq. 8) passed all the tests for the predictive ability (Eqs. 3-5).

2predR = 0.76 > 0.6

2 2

2( ) -0.2938 0.1oR R

R−

= < , 2 '2

2( ) -0.3148 0.1oR R

R−

= <

k = 0.98 and k’= 1.000 Graphically the MLR QSAR model for the training and test sets is demonstrated in Fig. 3 and 4, respectively.


Table 8. Experimental and calculated values (with Eq. 8) of the carbonic anhydrase II inhibitory activity (logKI) of para-substituted aromatic sulfonamides using MLR.

No. Exp log(1/IC50)

Calc log(1/IC50)

Expr-Calc Leverages (limit 0.40)

Training Set 1 2.4116 2.1097 -0.3019 0.2037 2 2.0934 2.1725 0.0791 0.16773 1.1139 0.9494 -0.1645 0.1352 4 1.1761 1.4226 0.2465 0.0678 5 0.9542 1.0316 0.0774 0.1275 6 0.8633 1.1960 0.3327 0.0879 7 1.0414 1.1208 0.0794 0.14428 1.2553 1.1675 -0.0878 0.1492 10 1.8261 1.7162 -0.1099 0.2664 11 1.7324 1.4906 -0.2418 0.1898 13 0.9777 1.3180 0.3403 0.2918 15 1.7076 1.8862 0.1786 0.2000 17 2.3909 2.3700 -0.0209 0.118818 2.2124 1.9471 -0.2653 0.1642 20 2.3560 2.1641 -0.1919 0.1310 24 1.7993 1.8912 0.0919 0.1490 25 1.5682 1.9853 0.4171 0.0846 26 1.2304 0.8678 -0.3626 0.220327 2.3802 1.9564 -0.4238 0.153429 1.8751 2.0653 0.1902 0.2247 30 1.1139 1.1035 -0.0104 0.2281 32 1.6021 1.8683 0.2662 0.1185 34 0.9542 1.2699 0.3157 0.1066 35 1.8751 1.4038 -0.4713 0.081736 2.4771 2.1284 -0.3487 0.3687 38 2.2304 2.2944 0.0640 0.2394 39 2.2041 2.2505 0.0464 0.2058 40 1.7782 2.0849 0.3067 0.138844 1.4472 1.5128 0.0656 0.1154 45 1.8751 1.7782 -0.0969 0.1196

Test Set

9 1.1761 1.2650 0.0889 0.1348 12 0.9912 1.3982 0.4070 0.1556 14 0.9590 1.1897 0.2307 0.217816 1.8808 1.7579 -0.1229 0.2479 19 2.3655 2.2019 -0.1636 0.0926 21 2.4116 2.0581 -0.3535 0.1609 22 2.3304 2.0288 -0.3016 0.131323 2.3617 2.1901 -0.1716 0.1329 28 2.0212 2.1660 0.1448 0.246631 1.6902 1.8147 0.1245 0.0939


Table 8. Continued

33 1.4472 1.6994 0.2522 0.1310 37 2.5051 2.2284 -0.2767 0.1462 41 2.0414 1.9386 -0.1028 0.1648 42 1.6021 1.8936 0.2915 0.180343 1.8451 1.8463 0.0012 0.2004 46 2.0969 2.4382 0.3413 0.1847 47 2.0414 2.3964 0.3550 0.1736

Training set, n=30

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

logK(calc.)

logK

(exp

r.)

Figure 3. Plot of experimental versus calculated (with Eq.8) values of logKI for the training set. The extrapolation method was applied to the compounds that constitute the test set. The leverages for all 17 compounds were computed (Table 9). All 17 compounds in the test set fall inside the domain of the model (warning leverage limit 0.40). MLR model provides qualitative and quantitative information on the dependency between descriptors and the CAII inhibitory activity. The model proposed based on the MLR methodology is simple, requires the calculation of only four descriptors and is easily interpretable in terms of the impact that each descriptor has on the activity. A medicinal chemist can derive useful conclusions for the structure – activity relationship from the positive or negative


Test set, n=17

0.5

1

1.5

2

2.5

3

0.5 1 1.5 2 2.5

logK(calc.)

logK

(exp

r.)

Figure 4. Plot of experimental versus calculated (with Eq.8) values of logKI for the test set. and the magnitude of contribution. The accuracy of the model demonstrates the efficiency of the produced QSAR model, which can be used for obtaining accurate predictions and for drawing safe conclusions. 4. Conclusions From the above discussion, it can be seen that both approaches are statistically meaningful, and the predicted values are in agreement with the experimental ones. The SMILES approach gives correlation coefficients of 0.81 and a standard error of 0.210 for the test set, which is slightly better than that of the MLR model whose corresponding statistical indices are 0.76 and 0.292. The differences between the experimental and calculated inhibition activities are visually shown in the figures that are provided in this work, for both SMILES and MLR models. From these figures, it can be visually observed that the residuals produced by the SMILES model are in general lower compared to those produced by the MLR model, and so the prediction results of the SMILES model are better than the ones of he MLR model.


The MLR model requires the calculation of only four descriptors which can be easily computed and can be interpreted in terms of their influence to the CA II inhibitory activity. Based on the MLR equation, which indicates the dependence and the extent of influence of the descriptors to the inhibitory activity, various structural modifications can be proposed for designing of novel structures with desired characteristics. References 1. Supuran, C.T. and Scozzafava, A. Bioorganic & Medicinal Chemistry, 2007, 15,

4336-4350. 2. Supuran, C.T. Nature Reviews Drug Discovery, 2008, 7, 168-181. 3. Innocenti, A., Vullo, D., Scozzafava, A. and Supuran, C. T. Bioorganic &

Medicinal Chemistry Letters, 2008, 18, 1583-1587. 4. Sączewski, F., Innocenti, A., Sławiński, J., Kornicka, A., Brzozowski, Z.,

Pomarnacka, E., Scozzafava, A., Temperini, C. and Supuran, C.T. Bioorganic & Medicinal Chemistry, 2008, doi:10.1016/j.bmcl.2006.06.064.

5. Carbonic Anhydrase: Its Inhibitors and Activators. CRC Enzyme Inhibitors Series, Volume 1. Edited by Claudiu T. Supuran, Andrea Scozzafava (Universita degli Studi, Firenze), and Janet Conway (Pfizer Inc., New York). CRC Press LLC: Boca Raton, FL. 2004. ISBN 0-415-30673-6.

6. Balaban A.T., Basak S.C., Beteringhe A., Mills D., Supuran C.T. Mol Divers. 2004, 8, 401-412.

7. Khadikar P.V., Sharma V., Karmarkar S., Supuran C.T. Bioorg Med Chem Lett. 2005, 15, 923-930.

8. Khadikar, P. V., Singh, J., Singh, S., Mishra, R., Supuran, C.T., Clare, B.W., Lakhwani, M., Medicinal Chemistry, 2008, 4, 30-66.

9. Melagraki, G., Afantitis, A., Sarimveis, H., Igglessi-Markopoulou, O., Supuran C.T. Bioorg. Med. Chem., 2006, 14, 1108-1114.

10. Singh, J., Shaik, B., Singh, S., Agrawal, V. K., Khadikar, P.V., Deeb, O., Supuran, C. T. Chemical Biology & Drug Design, 2008, 71, 244-259.

11. Toropov, A.A., Toropova, A.P., Mukhamedzhanova, D.V., Gutman I. Indian J. Chem. 2005, 44A, 1545-1552.

12. Toropov, A., Nesmerak, K., Raska I., Jr., Waisser, K., Palat K. Computat. Bio. Chem. 2006, 30, 434 - 437.

13. Toropov, A., Benfenati, E. Computat. Bio. Chem. 2007, 31, 57 - 60. 14. http://www/acdlab.com/ 15. CambridgeSoft Corporation www.cambridgesoft.com 16. www.lohninger.com/topix.html 17. Todeschini, R., Consonni, V., Mannhold, R. (Series Editor), Kubinyi, H., (Series

Editor), Timmerman, H., (Series Editor) 2000, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim.

18. Kennard, R.W. and Stone, L.A. Technometrics, 1969, 11, 137-148.


19. Shen, M., Beguin, C., Golbraikh, A., Stables, J. Kohn, H. and Tropsha, A. J. Med. Chem. 2004, 47, 2356-2364.

20. Tropsha, A., Gramatica, P., Gombar, V.K. QSAR & Comb. Science. 2003, 22, 69-77.

21. Golbraikh, A. and Tropsha, A. J. Mol. Graph. Mod. 2002, 20, 269-276. 22. Atkinson, A. 1985, Plots, transformations and regression. Clarendon Press,

Oxford (UK). 23. Walters, W.P.A. and Murcko, M.A. 1999, Curr. Opin. Chem. Biol. 3, 384-387.




5. From molecular structure to molecular design through the Molecular Descriptors

Family Methodology

Sorana D. Bolboacă1 and Lorentz Jäntschi2 1“Iuliu Hatieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur

400349 Cluj-Napoca, Romania; 2Technical University of Cluj-Napoca, 103-105 Muncii Bvd 400641 Cluj-Napoca, Romania

Abstract. The Molecular Descriptors Family on the Structure-Property/Activity Relationships (MDF SPR/SAR) is an area of computational research able to generate a family of molecular descriptors and to build models in order to estimate and predict the property/activity of chemical compounds. This review aims to briefly present the MDF SPR/SAR methodology and to discuss its abilities to estimate and predict different properties and activities.

Introduction Structure-Activity Relationships (SARs), Structure-Property Relationships (SPRs) and Property-Activity Relationships (PARs) were first introduced by Louis Pluck HAMMETT in 1937 [1]. A more recent review summarizes the most important applications of Hammett’s equation [2]. The idea of linking the structure of a compound with its activity or property was published before the introduction of the SAR concept. In 1868, Correspondence/Reprint request: Dr. Sorana D. Bolboacă, “Iuliu Hatieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj-Napoca, Romania. E-mail: [email protected]

Sorana D. Bolboacă & Lorentz Jäntschi

118

Crum-Brown and Fraser noted that the activity of a compound is a function of its chemical composition and structure [3]. In 1893, Richet and Seancs demonstrated that cytotoxicity was inversely related with water solubility on a set of organic compounds [4]. In 1899, Mayer demonstrated that the narcotic action of a sample of organic compounds was related with solubility in olive oil [5]. Therefore, Hammett [6] and Taft [7] could be said to have laid the basis of QSAR/QSPR development. Quantitative relationships (QSAR, QSPR, QPAR), which are mathematical approaches to the link between the structure and the property/activity of chemical compounds in a quantitative manner [8], are applied when the property and/or activity are quantitative. Note that not all the properties and activities of chemical compounds can be classified as quantitative. An example could be the sweetness of sugar (one of the five basic tastes, being almost universally perceived as a pleasurable experience), which can be appreciated only through comparison (relative scale) since there is no single reference scale (such as the boiling and freezing point and the Celsius scale for temperature). Properties that are expressed quantitatively may have several reference scales. Consequently, the use of terms such as QSAR, QSPR, and QPAR is currently avoided, the terms (Q)SAR, (Q)SPR, and (Q)PAR, or simply SAR, SPR, and PAR being preferred. As far as the structure is concerned, things are relatively simpler. Thus, an atom or a bond from a molecule could exist (highlighted through electronic transitions and/or molecular vibrations and/or rotations) or could not exist (0 or 1). Molecular geometry (especially the liquid or gas phase) is more complicated. The Heisenberg principle (Werner HEISENBERG, 1901-1976, one of the founders of quantum mechanics, a Nobel prize winner) demonstrates the rules of uncertainty through the principles of uncertainty (molecular and atomic level) at micro level. Moreover, molecular geometry depends on the environment on which a molecule is placed (the vicinity of the molecule), the temperature, the pressure, etc. Thus, dealing with molecular geometry is a matter of relativity if not a matter of uncertainty. In conclusion, in the field of Structure-Property-Activity Relationships (SPARs) there are certainties (e.g. molecular topology), uncertainties (e.g. molecular geometry), relativities (e.g. biological activities), and evidences (e.g. physical-chemical properties). Mathematical Chemistry [9,10], Quantum Chemistry [11], and Medicinal Chemistry [12] have increasingly significant contributions to the design and improvement of drugs. The dynamics of pharmaceuticals is high; new drugs appear on the market daily, even if the process is a long-lasting one. Drug design has recently emerged as a new field [13,14]. The usage of computing power was a major breakthrough. Grassy et al. [15] reported a search for

MDF: From molecular structure to molecular design

119

peptides possessing immunosuppressive activity by using 27 descriptors derived from the structure (topology and shape). Twenty-six peptides were selected from a combinatorial library of about 280000 compounds. The predicted activity was high. Five of them were actually synthesized and tested experimentally. The most potent compounds had shown an immunosuppressive activity approximately 100 times higher than the lead compound. Combinatorial Chemistry is now a new field [16,17]. Almost 40 years have passed [18] since QSAR (Quantitative Structure-Activity Relationship) paradigm proved its utility in agriculture, pharmacy, toxicology, and other fields. Many methods (CoMFA [19] - Comparative Molecular Field Analysis and its variants CoMSIA, MSIA - Molecular Similarity Indices Analysis), WHIM [20] - Weighted Holistic Invariant Molecular (and its variant MS-WHIM - Molecular Surface WHIM) MTD [21] - Minimal Topological Distance (and its variant MSD, S - Steric), FPIF [22] - Fragmental Property Index Family, MDF [23] - Molecular Descriptors Family) were introduced and proved to be good estimators and predictors. Other emergent results such as S(Q)SAR (Spectral (Quantitative) Structure Activity Relationship) [24] are again challenging the classical approaches. The results in cellular genetics, the mechanic dynamics of cells, genetic mutations, sequencing and coding of macromolecular topology information lead to the recording and storing of molecular geometry into databases for a large number of biologically active chemical compounds [25]. Information stored in databases integrated with proper combinatorial methods may lead to the identification of active compounds with high potency. Statistical methods for internal and external validation [26], correlated correlation analysis [27], and principal component analysis [28] are methods which have a role in identifying (Q)SA/PRs (see [29,30,31] for meaning of parenthesis Q). These methods allow the selection of compounds with the best biological activity. The aim of the present review is to present the Molecular Descriptors Family (MDF) methodology and the obtained results by applying this methodology on over 50 physical-chemical properties or biologically active compounds. MDF SPR/SAR methodology The MDF SAR method integrates the complex information obtained from the structure of the compounds into models in order to explain the activity/property of interest. The input data for a given set of molecules are represented by the molecular and/or structural formulas and the experimental


120

values for the activity and/or property of interest. By applying the methodology, the molecular descriptors and/or uni- versus multivariate models are obtained. The following six steps were used in the modelling process [32]. ÷ Step 1: drawing of the topological model (2D) of the compounds for each

molecule from the set of interest by using the HyperChem [33]. ÷ Step 2: building the geometrical model (3D) of each compound from the

set of interest by using HyperChem [33]. ÷ Step 3: applying a semiempirical model (for calculating the partial charge

distribution on atoms) and (sometimes) a quantum mechanics model (up to the most advanced ones such as Ab-initio and Time-Dependent Density Functional Theory) using specific modules of HyperChem [33] (examples: HyperNewton, HyperGauss, HyperNDO) in order to obtain an optimized geometrical model in vitro or in vivo.

÷ Step 4: generating the molecular descriptors family by using the MDF software. The MDF software is described below.

÷ Step 5: applying the bias procedure in order to eliminate identical descriptors from the generated family.

÷ Step 6: obtaining simple and/or multivariate linear regression relationships between the members of the descriptors and the given property/activity. The following criteria are used [34,35]: (1) the goodness-to-fit of the model [36,37] (the correlation coefficient and the squared correlation coefficient; values close to ±1 indicate a good model); (2) the co-linearity between pairs of descriptors (values lower than 0.5 indicate the absence of co-linearity between descriptors); and (3) the significance of the regression model (for a significance level of 5%). The internal validation of the MDF SAR models is analyzed in cross-validation leave-one-out analysis [38]. A correlated correlation analysis is applied whenever appropriate by using Steiger’s Z test at a significance level of 5% [27].

MDF SAR physical model The MDF SAR approach has a mathematical model that comprises seven pieces, every piece having a list of possibilities, which come from the physical approach. Every piece provides a letter in the descriptor’s name: ÷ The linearizing operator (thefirst letter) makes the link between micro,

nano, and macro levels. Example: pH = -log[H+] its macro property (measure, effect) measured in micro environment (phenomenon, cause), the presence and the number of H+ in a given solution. It takes six values:


121

I (identity), i (inverse), A (absolute), a (inverse of absolute), L (logarithm of absolute), l (logarithm).

÷ The molecular level superposing operator (the second letter) lay upon the fragmental contributions. Its existence is sustained by the variety of the causality of the molecular property/activity, from specificity, regioselectivity, and selectivity (which most biological activities have) to the independent structural formula (such as relative mass - same for all molecular formula isomers). It takes nineteen values: sized group (m = smallest; M = largest; n = smallest absolute; N = largest absolute); averaged group (S = sum; A = average over all values; a = sum divided by the number of all fragments; B = average first by atom group and then by the whole molecule; b = adjusted B by bonds); geometric group (P = multiplication; G = geometric mean; g = adjusted G by fragments; F = geometric mean first by atom group and then by the whole molecule; f = adjusted F by bonds); harmonic group (s = harmonic sum, H = harmonic mean, h = adjusted H by fragments, I = harmonic mean first by atom group and then by the whole molecule, and i = adjusted I by bonds).

÷ The pair-based fragmentation criteria (the third letter) implements different criteria. Some parts of a molecule are more active and give the most of the activity/property of a molecule than others (substituent’s role) as it was observed in the first SAR studies carried out by Hammet. It takes four values: m (minimal), M (maximal), D (Szeged, distance based), and P (Cluj, shortest paths based [22,39]).

÷ The interaction model (the fourth letter) implements different levels of approximation (scalar and vectorial) for superposing the descriptors of interaction at the fragment level. It is well known that a series of field-type interactions (such as gravitational and electrostatic) are vectorially treated at low range and scalarly treated at distance. It takes six values: scalar (R = rare model and resultant relative to fragment’s head, r = rare model and resultant relative to conventional origin, M = medium model and resultant relative to fragment’s head, m = medium model and resultant relative to conventional origin), vectorial (D = dense model and resultant relative to fragment’s head, d = dense model and resultant relative to conventional origin).

÷ The interaction descriptor (the fifth letter) implements a series of interaction descriptors for physical entities (such as force, field, energy, potential), as they occur in magnetism, electrostatics, gravity and quantum mechanics. Different physical entities have different formulas. It takes twenty-four values: D (distance), d (inverted distance), O (first atom’s property), o (inverted of first atom’s property), P (product of atomic


122

property), p (inverted P), Q (squared P), q (inverted Q), J (first atom’s property multiplied by distance), j (inverted J), K (product of atomic property and distance), k (inverted K), L (product of distance and squared atomic property), l (inverted L), V (first atom’s property potential), E (first atom’s property field), W (first atom’s property work), w (property’s work), F (first atom’s property’s force), f (property’s force), S (first atom’s property’s weak nuclear force), s (property’s weak nuclear force), T (first atom’s property strong nuclear force), t (property’s strong nuclear force).

÷ The stomic property (the sixth letter) discriminates atoms through elemental properties. Every atom has a series of characteristics and/or properties making it similar and/or dissimilar to another. It takes six values: M (mass), Q (charge), C (cardinality), E (electronegativity), G (group electronegativity), and H (number of attached atoms of hydrogen).

÷ The distance operator (the seventh letter) implements both 2D and 3D approaches (topology and geometry). It takes two values: g (geometry), t (topology).

The application of every piece of mathematical model is a physical model, each model being able to take more than one value. The molecular descriptors family is generated following the calculation of 787968 (6 × 19 × 4 × 6 × 24 × 6 × 2) possibilities. Not all these possibilities have physical meaning (e.g. the logarithm of a negative number). Moreover, not all of them produce finite numbers (e.g. the division by zero). For a given set of molecules a descriptor can be degenerated relative to the set (having the same value for all the molecules in the set) and relatively to another descriptor (two descriptors with different calculation formulas produce the same result for all molecules in the set). A bias procedure trails out these descriptors from the family of the set. Depending on the set, the number of MDF members is around 100000. MDF software The MDF software was created by using the triad: PHP (Pre Hypertext Processor, [40]), MySQL database [41], and FreeBSD server [42]. A set of programs completes the task of generating the molecular descriptors family. Two databases (one temporary `MDFSARtmp`- for the sets being processed and one permanent `MDFSARs` - for finalized sets) stored on a FreeBSD server from IntraNet [IP 172.27.211.5] by using a MySQL database server. On December 25, 2007, `MDFSARtmp` had 174 tables (1.8 Gb), and `MDFSARs` more than 300 tables (> 3.5Gb). Four tables were generated for each investigated set:


123

÷ `"NameSet"_tmpx` (787968/6 = 131328 records; fields: molecules; records: descriptors)

÷ `"NameSet"_data` (field: property/activity; records: molecules; number of records equal with number of molecules)

÷ `"NameSet"_valx` (fields: molecules; records: descriptors; after bias has about 100000 records)

÷ `"NameSet"_valy` (records: same number of as `"NumeSet"_valx` table; fields: M(X), M(X*X), M(X*Y), r2(X,Y), Name(X), where M = average operator, r2 = determination coefficient, Name = name of X, Y = property/activity, X = MDF member).

Note that the numeric fields of the `"NameSet"_valy` table are computed

for multivariate regression purposes (significantly decreasing the execution time). The `0_MDFSARRes` table (one per database) contains all obtained MDF SARs (fields: name (of the set), eq (MDF SAR), r2 (determination coefficient), m (molecule’s number), n (number of MDF members in MDF SAR model).

A set of five PHP applications generated MDF, running on a FreeBSD server from IntraNet [IP 172.27.211.4]: ÷ 0_mdf_prepare.php: creates the structure for the set tables using the

name of the directory (for set name) and names of the files (for molecules names)

÷ 1_mdf_generate.php: generates MDF for the set of interest, filling 131328 records in the `"NameSet"_tmpx` table for every molecule. It is a multitasking application, one task being executed for every molecule at the same time.

÷ 2_mdf_linearize.php: it applies the linearizing operator (131328 × 6 = 787968) and it fills valid records (having sense and finite) into the `"NameSet"_xval` and the `"NameSet"_yval` tables.

÷ 3_mdf_bias.php: it sorts them in its memory by r2 and it deletes the degenerations from both the `"NameSet"_xval` and the `"NameSet"_yval` tables simultaneously.

÷ 4_mdf_order.php: it sorts them in its memory by r2 again, it creates two temporary tables containing ordered records by r2 from `"NameSet"_xval` and `"NameSet"_yval`, and it deletes the old tables and renames the new ones.

Our previous experience in working with a great number of molecular descriptors [22] indicated that the best found descriptor (the one which correlates the best with the measured property) is never found among the descriptors of the best found structure-activity relationship with two


124

descriptors. Thus, MDF uses pairs of descriptors when we search for SAR models with more than one descriptor (natural selection). MDF uses a genetic algorithm for QSPR/QSAR modelling (genetic algorithms are a particular class of evolutionary algorithms, being categorized as global search heuristics [43]). The peculiarities of the genetic algorithm used are: ÷ Step 1 (inheritance and mutation). The linearization procedure described

above is applied to the solution domain (2 × 6 × 24 × 6 × 4 × 19 MDF members having a genetic representation with six letters) whenever every descendent is obtained from a parent (inheritance) through a transformation (mutation). The number of descendants obtained is six times higher than that of the parents. In this step, the fitness function is defined as having real and distinct values. Over half of the descendants die due to mutation (around 300000 descendants remain, which now have genetic representations with seven letters).

÷ Step 2 (selection). A bias procedure (selection) is applied to the solution domain (MDF descendants from Step 1). In this step, the fitness function is defined as having distinct first nine digits of the determination coefficient. Only around 100000 members pass the selection process. Another selection is made from this solution domain obtained: the best descriptor (the one that best correlates with the measured property).

÷ Step 3 (crossover). Pairs of MDF members are crossed over in order to obtain models with two descriptors. Two fitness functions are used here: “to have the best determination coefficient” and “to have the best cross-validation leave-one-out score”.

Searching procedures of the uni- and multivariate models were created using Delphi client-server programs [44]. MDF structure- Property/activity relationships discussed More than one hundred and forty models are presented. Seventy properties or activities on different classs of compounds (thirty-one) have been investigated. The results are structured on two sections (properties - MDF SPR Models and activities - MDF SAR models) based on the investigated property/activity and the class of compounds. The univariate regression model was presented for the investigated class of compounds. The model with two descriptors was also presented whenever possible [45].


125

The statistics associated with the MDF model are expressed as correlation coefficient (r), standard error of estimated (s); Fisher parameter of the model and associated type I error (F(p)) at a significance level of 5%. The statistics obtained in cross-validation leave-one-out analysis are expressed as correlation coefficient (rcv-loo), standard error of predicted (scv-loo), Fisher parameter of the predictive model and associated type I error (Fcv-loo(p)) [34] at a significance level of 5%. MDF SPR models In order to justify the introduction of the molecular descriptors family on the structure-property relationships a series of twenty properties on nine classes of compounds have been investigated. The results obtained are presented and discussed below. Partition & activity coefficients

Volatile organic compounds - n-octanol/water partition coefficient The analysis of the results reveals that the model with two descriptors obtained better performances in terms of goodness-of-fit and cross-validation [34]. According to the model with two descriptors, the octanol/water partition coefficient of the investigated volatile organic compounds is of geometrical and topological nature and it depends on the partial charge and number of the directly bounded hydrogens.

Sample size [reference] 24 [46]MDF SPR Equation ŷ=-0.004·x+2.09 ŷ=0.65·x1-0.13·x2+3.99SPR Determination (%) 53 81MDF Descriptor(s): x1 & x2 ISDRTHg LsPrDQt & IADRSHgDominant Atomic Property Hydrogen’s (H) Charge (Q) & Hydrogen (H)Interaction Via Space (geometry) Bonds (topology) &

Space (geometry) Interaction Model H2·d-4 d & H2·d-3

Structure on Property Scale Identity Logarithmic & Identity Model Statistics r=0.7309; s=0.43;F(p)=

25 (4.98·10-5) r=0.9006; s=0.28; F (p)=45 (2.51·10-8)

Cross-Validation Leave- One-Out

rcv-loo=0.6792; scv-loo=0.47;Fcv-loo (p)=18 (3.00·10-4)

rcv-loo=0.8815; scv-loo=0.31; Fcv-loo (p)=36.41 (1.49·10-7)


126

Para-substituted phenols - n-octanol/water partition coefficient The octanol/water partition coefficient of para substituted phenols depended directly on the molecules’ geometry and it was related with partial charge and group electronegativity (see the model with two descriptors).

Sample size [reference] 30 [47] MDF SPR Equation [reference]

ŷ=-923.42·x+4.25 ŷ=0.003·x1-0.40·x2+1.07 [48]

SPR Determination (%) 71 89 MDF Descriptor(s): x1 & x2 IsPdOQg isDDkGg & IMmrKQg Dominant Atomic Property Charge (Q) Group Electronegativity

(G) & Charge (Q) Interaction Via Space (geometry) Space (geometry) &

Space (geometry)Interaction Model Q G-2·d-1 & Q-2·d-1

Structure on Property Scale Identity Inversed & Identity Model Statistics r=0.8412; s=0.60;

F (p) = 68 (5.83·10-9) r=0.9457; s=0.37; F (p) = 114 (6.65·10-14)


rcv-loo=0.8115; scv-loo=0.65; Fcv-loo (p)=53 (5.82·10-8)


Polychlorinated biphenyls - n-octanol/water partition coefficient According to the model with two descriptors, the octanol/water partition coefficient of investigated polychlorinated biphenyls is of geometrical nature and depends on the atomic and group electronegativity.

Sample size [reference] 206 [49]MDF SPR Equation ŷ=2552.7·x-14.22 ŷ=-0.44·x1+0.04·x2+3.12 SPR Determination (%) 87 89 MDF Descriptor(s): x1 & x2 iBMmwHg IIDDKGg & IHDRKEgDominant Atomic Property Hydrogen (H) Group Electronegativity

(G) & Electronegativity (E)Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model H2·d-1 G2·d & E2·dStructure on Property Scale Inversed Identity & Identity Model Statistics r=0.9347; s=0.30;

F (p)=1410 (7.75·10-94)r=0.9433; s= 0.28; F (p)= 819 (1.71·10-98)





127

Organic pollutants - Soil/water partition coefficient normalized to organic carbon The soil/water partition coefficient of the studied organic pollutants proved to be a topological property related with the number of directly bounded hydrogen atoms and atomic electronegativity (see the equation with two descriptors). Sample size [reference] 8 [50] MDF SPR Equation ŷ=-17.45·x+8.12 ŷ=-0.22·x1-0.68·x2+16.62 SPR Determination (%) 90 98 MDF Descriptor(s): x1 & x2

IbPMtMt lfDMWHt & IbmrTEt

Dominant Atomic Property

Mass (M) Hydrogen (H) & Electronegativity (E)

Interaction Via Bonds (topology) Bonds (topology) & Bonds (topology)Interaction Model M2·d-4 H2·d-1 & E2·d-4

Structure on Property Scale

Identity Logarithmic & Identity

Model Statistics r=0.9483; s=0.21; F (p)=54 (3.33·10-4)

r=0.9839; s=0.22; F (p)=26 (2.27·10-3)

Cross-Validation Leave-One-Out



Fifteen standard amino acids - Partition (1st & 2nd columns) & activity coefficients (3rd column) Fifteen standard amino acids were investigated: alanine (Ala), asparagine (Asn), aspartate (Asp), cysteine (Cys), glutamine (Gln), glutamate (Glu), glycine (Gly), isoleucine (Ile), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), serine (Ser), threonine (Thr), and valine (Val).


128

The partition coefficients of the 15 amino acids were studied on two different scales. In both situations (column 1 and 2) the property was identified as being of geometrical nature and also depended on the partial charge. The activity coefficient is of topological nature but it also depends on the partial charge. Chromatographic parameters

Polychlorinated biphenyls - Relative retention time The relative retention time of polychlorinated biphenyls proved to be of both topological and geometrical nature, and was related with the number of directly bounded hydrogens (see the model with two descriptors).

Sample size [reference] 206 [49]MDF SPR Equation ŷ=0.09·x-0.17 ŷ=0.02·x1-1.02·x2-5.99SPR Determination (%) 98 99.7MDF Descriptor(s): x1 & x2

iIDRwHg ISDmsHt & lADrtHg

Dominant Atomic Property Hydrogen (H) Hydrogen (H) & Hydrogen (H)Interaction Via Space (geometry) Bonds (topology) &

Space (geometry) Interaction Model H2·d-1 H2·d-3 & H2·d-4 Structure on Property Scale Inversed Identity & Logarithmic Model Statistics r=0.9921; s=0.02;

F (p)=13013 (1.64·10-189)r=0.9986; s=0.01; F (p)=36600 (1.10·10-265)




Polychlorinated biphenyls - Relative response factor The MDF SPR abilities to estimate and predict the relative response factor are not strong, the SAR determination being lower than 70%. According to the model with two descriptors, the relative response factor of the polychlorinated biphenyls is both of topological and geometrical nature and it strongly depends on the number of directly bounded hydrogens.

Sample size [reference] 209 [49] MDF SPR Equation ŷ=0.53·x-0.51 ŷ=-357.3·x1+2.16·x2 +5.08SPR Determination (%) 63 69 MDF Descriptor(s): x1 & x2

iHMdTHg imMrFHt & iHDdFHg

Dominant Atomic Property Hydrogen (H) Hydrogen (H) & Hydrogen (H)


129

Table. Continued

Interaction Via Space (geometry) Bonds (topology) & Space (geometry)

Interaction Model H2·d-4 H2·d-2 & H2·d-2 Structure on Property Scale Inversed Inversed & Inversed Model Statistics r=0.7929; s=0.22;

F (p)=351 (1.67·10-46)r=0.8324; s=0.20; F (p)=232 (9.59·10-54




Organophosphorus herbicides - Retention chromatography index The retention chromatography index of organophosphorus herbicides proved to be estimated and predicted by using the MDF approach. The SPR determination was higher than 90%. According to the model with two descriptors the retention chromatography index is of topological and geometrical nature and depends on relative atomic mass and partial charge.

Sample size [reference] 10 [53]MDF SPR Equation [reference]

ŷ=0.32·x-3.37 ŷ=6.37·x1+0.06·x2-62.36 [23]

SPR Determination (%) 94 99.92MDF Descriptor(s): x1 & x2 IBPdqHg lSDmwMt & iHPDEQgDominant Atomic Property Hydrogen (H) Mass (M) & Charge (Q)Interaction Via Space (geometry) Bonds (topology) &

Space (geometry)Interaction Model 1/H√H M2·d-1 & M·d-2

Structure on Property Scale Logarithmic Logarithmic & InversedModel Statistics r=0.9708; s=0.78;

F (p)=131 (3.08·10-6) r=0.9996; s=0.10;F (p) = 4348 (1.47·10-11)




Molar refraction

Cyclic organophosphorus (1st & 2nd column) & fifteen standard amino acids (3rd column) The MDF SPR approach proved to be a very good approach. The determination coefficient is higher than or equal with 98 percent. The molar refraction proved to be of topological nature and related with the relative atomic mass and the atomic electronegativity (see the model with two variables obtained for the cyclic organophosphorus s). In a SPR determination of 98% the molar refraction proved to be of geometrical nature, linearly related with the paretial charge in the sample of standard amino acids.


130

Sample size [reference] 10 [53] 15 [51] MDF SPR Equation [reference]

ŷ=16.37·x-0.28 ŷ=28.25·x1-83.97·x2+17.39 [54]

ŷ=-0.89·x+6.7 [52]

SPR Determination (%) 99 100 98 MDF Descriptor(s): x1 & x2

iIMdsGg lGDmSMt & lAmrfEt lFMMwQg


Group electronegativity (G)

Mass (M) & Electronegativity (E)

Charge (Q)

Interaction Via Space (geometry) Bonds (topology) & Bonds (topology)

Space (geometry)

Interaction Model G2·d-3 M2·d-3 & E2·d-2 Q2·d-1 Structure on Property Scale

Inversed Logarithmic & Logarithmic

Logarithmic


r=0.9999; s=0.07; F (p)=83206 (4.44·10-16)

r=0.9892; s=1.13; F (p)=590 (3.22·10-12)




rcv-loo=0.9845; scv-loo=1.35; Fcv-loo (p)= 409 (3.29·10-11)

Boiling point

Alkanes Good determinations are obtained in the estimation and prediction of the boiling point of the investigated alkanes. The best performing model, the one with two descriptors, revealed that the boiling point of the investigated compound is of topological nature and directly related with group electronegativity and the number of directly bounded hydrogens (see the model with two descriptors).

Sample size [reference] 73 [55] MDF SPR Equation ŷ=188.40·x-507.95 ŷ=-67.45·x1+4.89·x2-129.20 SPR Determination (%) 99 99.8 MDF Descriptor(s): x1 & x2 lbMdsHg lGDrtGt & IbDrfHt Dominant Atomic Property Hydrogen (H) Group Electronegativity (G)

& Hydrogen (H) Interaction Via Space (geometry) Bonds (topology) &

Bonds (topology) Interaction Model H2·d-3 G2·d-4 & H2·d-2

Structure on Property Scale Logarithmic Inversed & Identity Model Statistics r=0.9956; s=3.81;

F (p)=8050 (1.23·10-75) r=0.9991; s=1.75; F (p)=19361 (4.66·10-99)





131

Other properties

Fifteen standard amino acids - Magnetic susceptibility (1st column) & dipole moment (2nd column) & solubility (3rd column) The magnetic susceptibility of the investigated standard amino acids revealed to be of geometrical nature and linearly dependent on partial charge. The dipole moment and the solubility of the studied standard amino acids are of topological nature and depend on partial charge. For the last two properties (dipole moment and solubility) the model had the same molecular descriptor, but the performance was better for solubility than for the dipole moment.

Sample size [reference]

15 [51] 15 [51] 15 [51]

MDF SPR Equation [reference]

ŷ=-92.99·x+84.21 [52]

ŷ=-8.70·x+0.19 [52] ŷ=-25.28·x+4.06 [52]

SPR Determination (%)

91 79 87

MDF Descriptor(s): x1 & x2

iHMRqQg IiDRLQt IiDRLQt


Charge (Q) Charge (Q) Charge (Q)

Interaction Via Space (geometry) Bonds (topology) Bonds (topology)Interaction Model Q-1(√Q)-1 d·Q Q·d Structure on Property Scale

Inversed Identity Identity


r=0.8885; s=0.50; F (p)=49 (9.61·10-6)

r=0.9338; s=1.08; F (p)=89 (3.63·10-7)




rcv-loo=0.9070; scv-loo=1.27; Fcv-loo (p) = 60 (3.15·10-6)

Fifteen standard amino acids - Hückel energy (1st column) & hydration energy (2nd column) The Hückel energy of the investigated amino acids revealed to be a geometrical property that depends on atomic electronegativity. Hydration energy revealed to be of topological nature and related with partial charge. Note that the determination coefficient is higher than 90% in both models.

Sample size [reference] 15 [33] 15 [33] MDF SPR Equation [referenc ŷ=868.02·x-1417.5 [52] ŷ=17.59·x+19.46 [52] SPR Determination (%) 99.7 93 MDF Descriptor(s): x1 & x2 lfPdkEg iGPmLQt


132

Table. Continued

Dominant Atomic Property Electronegativity (E) Charge (Q) Interaction Via Space (geometry) Bonds (topology) Interaction Model E-2·d-1 Q·d Structure on Property Scale Logarithmic InversedModel Statistics r=0.9983; s=235;

F (p)=3915 (1.53·10-18) r=0.9646; s=0.86; F (p)=174 (6.74·10-9)


rcv-loo=0.9979; scv-loo=264; Fcv-loo (p)=3089 (1.11·10-16)


Fifteen standard amino acids - Polarizability (1st column) & refractivity (2nd column) The polarizability and refractivity of the investigated standard amino acids was estimated by the same molecular descriptor. Thus, the properties are of geometrical nature and directly related with atomic electronegativity.

Sample size [reference] 15 [33] 15 [33] MDF SPR Equation [referenc ŷ=36.97·x-4.84 [52] ŷ=93.72·x-13.09 [52] SPR Determination (%) 98 97 MDF Descriptor(s): x1 & x2 iIMdWEg iIMdWEgDominant Atomic Property Electronegativity (E) Electronegativity (E)Interaction Via Space (geometry) Space (geometry) Interaction Model E2·d-1 E2·d-1

Structure on Property Scale Inversed Inversed Model Statistics r=0.9883; s=0.46;

F (p)=546 (5.32·10-12) r=0.9862; s=1.27; F (p)=462 (1.53·10-11)




MDF-SAR models The abilities of the molecular descriptors family approach have been investigated on twenty-one samples of biological active compounds. A total number of seventy-three activities were investigated on different classes of compounds. Water activated carbon adsorption

Organic compounds The water activated carbon adsorption on the investigated organic compounds revealed to be of topological and geometrical nature and related with the number of directly bounded hydrogens and partial charge (see the


133

Table. Continued

Sample size [reference] 16 [56] MDF SAR Equation [reference]

ŷ=-57.99·x+1.99 [57] ŷ=0.85·x1+0.003·x2+2.58 [57]

SAR Determination (%) 86 98 MDF Descriptor(s): x1 & x2 iSDrDQt IiMMWHt & lPMDVQgDominant Atomic Property Charge (Q) Hydrogen (H) & Charge (Q) Interaction Via Bonds (topology) Bonds (topology) &

Space (geometry) Interaction Model d H2·d-1 & Q·d-1 Structure on Activity Scale Inversed Identity & Logarithmic Model Statistics r=0.9270; s=0.13;

F (p)=86 (2.43·10-7) r=0.9905; s=0.05; F (p)=337 (6.30·10-12)




model with two descriptors). The power of determination of the activity is good - close to optimum (98%). Hydrophobic vs. hydrophilic character The hydrophobic or hydrophilic character, which is an important property in protein structure and protein-protein interactions, is one of the most studied properties of the amino acids. Many hydrophobicity scales have already been reported (see below). The differences between scales are significant: Janin (1979) and Kyte & Doolittle (1982) classified cysteine as the most hydrophobic while Wolfenden et al. [58] or Rose et al. [59] did not. These differences could be explained by the fundamentally different methods used for constructing the scale. Fifteen standard amino acids – Hydrophobicity The hydrophobicity on the Bumble, Hessa et al. and Kyte & Doolittle scales revealed to be of geometrical nature and directly related with partial charge on the sample of fifteen standard amino acids (see the table below). Excepting the Bumble scale, the MDF models obtained had a determination coefficient higher than 90%.


15 [51] 15 [60] 15 [61]

MDF SAR Equation [reference]

ŷ=-160·x-0.07 [52] ŷ=8.5·x-0.58 [52] ŷ=-21·x+12 [52]


134

Table. Continued

SAR Determination (%)

65 90.5 95

MDF Descriptor(s) AbmrEQg iMDRoQg IGDROQg Dominant Atomic Property


Interaction Via Space (geometry) Space (geometry) Space (geometry)Interaction Model Q·d2 Q-1 Q Structure on Activity Scale

Proportional Inversed Proportional


r=0.9514; s=0.44; F (p)=124 (5.05·10-8)

r=0.9759; s=0.71; F (p)=260 (5.66·10-10)





Twenty standard amino acids – Hydrophobicity The sample of twenty standard amino acids used for the following MDF SAR models contains: alanine (Ala), arginine (Arg), asparagine (Asn), aspartate (Asp), cysteine (Cys), glutamine (Gln), glutamate (Glu), glycine (Gly), histidine (His), isoleucine (Ile), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), proline (Pro), serine (Ser), threonine (Thr), tryptophan (Trp), tyrosine (Tyr), and valine (Val).


20 [62] 20 [63] 20 [64]


ŷ=0.39·x-1.23 [65] ŷ=-27.79·x+6.55 [65] ŷ=-1.73·x-2.88 [65]


44 66 69


amMRLQt immRoQg LmDROQg



Interaction Via Bonds (topology) Space (geometry) Space (geometry) Interaction Model Q·d Q-1 Q Structure on Activity Scale

Inversed Inversed Logarithmic


r=0.8163; s=2.19; F (p)=6 (1.14·10-5)

r=0.8309; s=1.70; F (p)=40 (5.70·10-6)




rcv-loo=0.7936; scv-loo= 1.87; Fcv-loo (p) = 30 (3.34·10-5)


135

Two out of three of the above presented models for hydrophobicity proved that the activity was of geometrical nature and depended on the partial charge. None of the above models was strong; the coefficient of determination was lower than 70%. Twenty standard amino acids – Hydrophobicity The MDF SAR model for hydrophobicity on the Wimley & White scale is of topological nature and depends on partial charge. The hydrophobicity models on the Hoop & Woods and Cowan & Whittaker scales revealed to be of geometrical nature and depended on partial charge. The coefficient of determination associated with the above presented models proved not to be powerful, even if the values were higher than 50%. Sample size [reference] 20 [66] 20 [67] 20 [68] MDF SAR Equation [reference]

ŷ=7.35·x-3.37 [65] ŷ=10.63·x-1.99 [65]

ŷ=-6.57·x+1.47 [65]

SAR Determination (%) 71 74 75MDF Descriptor iBmrWQt iMPRoQg AmDROQg Dominant Atomic Property


Interaction Via Bonds (topology) Space (geometry) Space (geometry) Interaction Model Q2/d Q-1 Q Structure on Activity Scale

Inversed Inversed Absolute

Model Statistics

r=0.8434; s=0.48; F (p)=44 (3.00·10-6)

r=0.8608; s=1.01; F (p)=52 (1.11·10-6)

r=0.8661; s=0.6; F (p)=54 (1.15·10-8)



rcv-loo=0.8288; loo=1.11; Fcv-loo (p) = 39 (6.49·10-6)


Twenty standard amino acids – Hydrophobicity Hydrophobicity on the Manavalan & Ponnuswamy, the Fauchere et al., and the Rao & Argos scales proved to be of geometrical nature and linearly


20 [69] 20 [70] 20 [71]


ŷ=23.43·x+14.55 [65] ŷ=5.94·x-4.36 [65] ŷ=-2.73·x+1.43 [65]


78 78 79


136

Table. Continued

MDF Descriptor inMrpQg ibDRPQg AmDROQg Dominant Atomic Property


Interaction Via Space (geometry) Space (geometry) Space (geometry) Interaction Model Q-2 Q2 Q Structure on Activity Scale

Inversed Inversed Proportional


r=0.8832; s=0.50; F (p)=65 (2.50·10-7)

r=0.8901; s=0.24; F (p)=69 (1.48·10-7)





depended on partial charge. All the above MDF SAR models presented a weak coefficient of determination (the values were higher than 75%). At this value, there is an important linear relationship between the molecular descriptor and the hydrophobic or hydrophilic character. Twenty standard amino acids – Hydrophobicity All three models revealed that the hydrophobic or hydrophilic character was of geometrical nature and depended on partial charge. The determination was fairly good in these models (greater than 80%). Sample size [reference]

20 [72] 20 [73] 20 [59]


ŷ=1.74·x+0.86 [65] ŷ=-3.78·x+5.30 [65] ŷ=1.75·x+0.86 [65]


81 85 90

MDF Descriptor inMrpQg IAmrLQg inMrpQgDominant Atomic Property


Interaction Via Space (geometry) Space (geometry) Space (geometry) Interaction Model Q-2 Q·d Q2·d-3 Structure on Activity Scale

Inversed Identity Inversed


r=0.9208; s =0.80; F (p)=100 (8.69·10-9)

r=0.8974; s=0.05; F (p)=74 (8.21·10-8)






137

Twenty standard amino acids – Hydrophobicity The hydrophobicity on the Urry scale proved to be of topological nature and linearly dependent on the atomic electronegativity of the standard amino acids studied. The hydrophobicity on the Engelman et al., and on the Eisenberg et al. scales proved to be of geometrical nature and directly dependent on the partial charge. The determination was considered rather good. Sample size [reference]

20 [74] 20 [75] 20 [76]


ŷ=-11.96·x-29.73 [65] ŷ=-753.09·x+ 1.85 [65] ŷ=-0.92·x+ 1.68 [65]


82 83 83

MDF Descriptor iBDMkEt INPrWQg IAMdKQg Dominant Atomic Property

Electronegativity (E) Charge (Q) Charge (Q)

Interaction Via Bonds (topology) Space (geometry) Space (geometry) Interaction Model Q-2·d-1 Q2·d-1 Q2·d Structure on Activity Scale

Inversed Logarithmic Logarithmic


r=0.9116; s=2.07; F (p)=89 (2.26·10-8)

r=0.9128; s=0.42; F (p)=90 (2.02·10-8)



rcv-loo=0.8731; scv-loo=2.56;Fcv-loo (p) = 51 (1.13·10-6)


Twenty standard amino acids – Hydrophobicity The hydrophobicity on the Cowan et al. scale is of geometrical nature and depends on charge. The same is valid for the model with fifteen amino acids. This observation is also true for the hydrophobicity on the Hessa et al.


20 [68] 20 [77] 20 [60]


ŷ=-2.16·x+4.64 [65] ŷ=817.95·x+81.72 [65] ŷ=7.18·x-0.41 [65]


84 85 85

MDF Descriptor(s) lbmdKQg inMrpQg AmDROQgDominant Atomic Property


Interaction Via Space (geometry) Space (geometry) Space (geometry)


138

Table. Continued

Interaction Model Q2·d Q-2 Q Structure on Activity Scale

Logarithmic Inversed Proportional


r=0.9232; s=20.73; F (p)=104 (6.69·10-9)

r=0.9238; s=0.32; F (p)=105 (6.24·10-9)





scale, the determination being lower than in the model obtained on the sample of fifteen amino acids. All models revealed that the hydrophobic or hydrophilic character was of geometrical nature and depended on the partial charge. Twenty standard amino acids – Hydrophobicity Hydrophobicity is of geometrical nature and depends on charge. In these models, the determination is slightly better compared with the above models in the sample of twenty standard amino acids.


20 [78] 20 [79] 20 [61]


ŷ=-0.20·x+1.36 [65] ŷ=1.85·x+11.05 [65] ŷ=19.17·x-7.60 [65]


85 86 87

MDF Descriptor(s) iIPmLQt lfPROQg IiDRLQt



Interaction Via Bonds (topology) Space (geometry) Bonds (topology)

Interaction Model

Q·d Q d·Q

Structure on Activity/ Property Scale

Inversed Logarithmic Identity

Model Statistics r=0.9252; s=0.36; F (p)= 107 (5.30·10-9)

r=0.9259; s=2.46; F (p)=108 (4.88·10-9)

r=0.9328; s=1.11; F (p)=120 (2.10·10-9)






139

Twenty standard amino acids – Hydrophobicity The best determination on the sample of twenty amino acids was obtained for the Black et al. scale. On the Monera et al. scale the determination was better but the sample was of nineteen amino acids (Proline was the amino acid excluded from the generation of molecular descriptors). As an overall conclusion, the hydrophobicity of amino acids is of geometrical nature and depends on partial charge.

Sample size [reference] 20 [80] 19 [81] (- Proline)


ŷ=-0.96·x+0.86 [65] ŷ=843.88·x+86.05 [65]

SAR Determination (%) 88 90

MDF Descriptor(s) lAmrLQg inMrpQg


Charge (Q) Charge (Q)

Interaction Via Space (geometry) Space (geometry)

Interaction Model d·√Q Q-2

Structure on Activity Scale

Proportional Inversed


r=0.9504; s=16.49; F (p)=159 (4.77·10-10)




Toxicity

Polychlorinated organic compounds The toxicity of the investigated polychlorinated organic compounds revealed to be of geometrical nature and related with partial charge on the univariate as well as on the MDF model with two descriptors.

Sample size [reference] 31 [82] MDF SAR Equation ŷ=-9.06·x+4.00 ŷ=-8.33·x1+0.28·x2+0.83 SAR Determination (%) 72 87 MDF Descriptor(s): x1 & x2 IsMRKQg IsMRKQg & AHPROQg

Dominant Atomic Property Charge (Q) Charge (Q) & Charge (Q)

Interaction Via Space (geometry) Space (geometry) & Space (geometry)

Interaction Model Q-2·d-1 Q-2·d-1 & Q


140

Table. Continued

Structure on Activity Scale Identity Identity & Absolute Model Statistics r=0.8514; s =0.30;

F (p)=76 (1.27·10-9) r=0.9318; s=0.21; F (p)=92 (4.79·10-13)




Mono-substituted nitrobenzenes The toxicity of the investigated mono-substituted nitrobenzenes is of both geometrical and topological nature. It also depends on group electronegativity and partial charge. Ninety-six percent of variation in toxicity could be explained by the linear relationship with two molecular descriptors.

Sample size [reference] 39 [83] MDF SAR Equation ŷ=-91.15·x+6.27 ŷ=-92.37·x1-7.28·x2+6.37 SAR Determination (%) 60 60 MDF Descriptor(s): x1 & x2 IBMrkGg IBMrkGg & IsPmVQt

Dominant Atomic Property Group Electronegativity (G) Group Electronegativity (G)& Charge (Q)

Interaction Via Space (geometry) Space (geometry) & Bonds (topology)

Interaction Model G-2-·d-1 E-2·d-1 & Q·d-1

Structure on Activity Scale Identity Identity & Identity


r=0.7739; s=0.35; F (p)=27 (7.33·10-8)




Benzene derivates The toxicity of the investigated benzene derivates revealed to be of topological and geometrical nature. It also depended on partial charge and on the number of directly bounded hydrogens.

Sample size [reference] 69 [84] MDF SAR/SPR Equation ŷ=-0.91·x+2.92 ŷ=-9.66·x1+1.00·x2+3.25

SAR/SPR Determination (%) 68 87 MDF Descriptor(s): x1 & x2 lFPdoGg ABmrsQg & iGPrfHt


141

Table. Continued

Dominant Atomic Property Group Electronegativity (G) Charge (Q) & Hydrogen (H)


Interaction Model G-1 Q2·d-3 & H2·d-2


Logarithmic Absolute & Inversed


r=0.9331; s=0.28; F (p)=222 (1.48·10-30)




Alkyl metal compounds The toxicity of alkyl metal compounds revealed to be of geometrical nature and depended on the relative atomic mass as well as on the partial charge. The MDF model was very good, its determination coefficient being close to 1.

Sample size [reference] 10 [85]MDF SAR Equation [reference]

ŷ=0.25·x+0.33 ŷ=28.06·x1+0.08·x2+2.80 [86]

SAR Determination (%) 97 99.7 MDF Descriptor(s): x1 & x2

iFDmdCg IbMmpMg & LPPROQg


Cardinality (C) Mass (M) & Charge (Q)


Interaction Model d-1 M-2 & Q

Structure on Activity Scale Inversed Identity & Logarithmic


r=0.9988; s=0.06; F (p)=1473 (6.49·10-10)




Para-substituted phenols – Toxicity The model obtained for the toxicity of the para substituted phenols revealed to be a good model. This model showed that the toxicity was of topological and geometrical nature and was related with partial charge.


142

Sample size [reference] 30 [47]MDF SAR Equation ŷ=-603.71·x+1.74 ŷ=0.04·x1-0.22·x2-2.26 SAR Determination (%) 71 90 MDF Descriptor(s): x1 & x2

IsPdOQg ASMmVQt & lfDdOQg

Dominant Atomic Property Charge (Q) Charge (Q) & Charge (Q)Interaction Via Space (geometry) Bonds (topology) &

Space (Geometry)Interaction Model Q Q·d-1 & Q Structure on Activity Scale Identity Absolute & Logarithmic Model Statistics r=0.8458; s=0.38;

F (p)=70 (4.01·10-9) r=0.9472; s=0.23; F (p)=118 (4.56·10-14)




Para-substituted phenols - Relative toxicity The relative toxicity of the investigated para-substituted phenols revealed to be of geometrical nature and linearly dependant on the partial charge and number of directly bounded hydrogens. The model with two descriptors could be considered considerably good, its determination coefficient being of eighty-five percent.

Sample size [reference] 30 [47] MDF SAR Equation ŷ=4.50·x+6.59 ŷ=-1.76x1-1.42·x2+12.14SAR Determination (%) 68 85 MDF Descriptor(s): x1 & x2 InDDoQg AHMMVQg & inDmwHg Dominant Atomic Property Charge (Q) Charge (Q) & Hydrogen (H) Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model Q-1 Q·d-1 & Q2·d-1 Structure on Activity Scale Identity Absolute & Inversed Model Statistics r=0.8275; s=0.54;

F (p)=61 (1.71·10-8) r=0.9225; s=0.38; F (p)=77 (6.87·10-12)




Quinoline – Cytotoxicity The model with two descriptors obtained in the investigation of cytotoxicity in the quinoline sample showed that the activity was of topological


143

nature and depended on relative atomic mass and partial charge. The model was successfully able to predict the activity (the determination coefficient was close to 1). The prediction power was sustained by the value of the cross-validation leave-one-out score, which had a value of 0.9805 (see the model with two descriptors).

Sample size [reference] 15 [87]MDF SAR Equation [reference] ŷ=-6.58·x+4.63 ŷ=8.35·x1+1.96·x2-4.49 [88]SAR Determination (%) 65 98 MDF Descriptor(s): x1 & x2 iHPMdCg INDRLQt & lHPmTMt Dominant Atomic Property Cardinality (C) Charge (Q) & Mass (M) Interaction Via Space (geometry) Bonds (topology) &

Bonds (topology) Interaction Model d-1 Q·d & M2·d-4

Structure on Activity Scale Inversed Identity & Inversed


r=0.9882; s=0.17; F (p) = 250 (1.65·10-10)




Quinoline – Mutagenicity The model with two descriptors demonstrated that mutagenicity was of geometrical nature and strongly dependent on partial charge. The estimation power of the model was very close to 1.


ŷ=0.008·x-4.14 ŷ=0.21·x1+0.09·x2-1.57 [88]

SAR Determination (%) 72 96 MDF Descriptor(s): x1 & x2 aAmrKQt lNMrSQg & ASPrVQgDominant Atomic Property Charge (Q) Charge (Q) & Charge (Q) Interaction Via Bonds (topology) Space (geometry) &

Space (geometry) Interaction Model Q2·d Q2·d-3 & Q·d-1

Structure on Activity Scale Inversed Logarithmic & Proportional


r=0.9782; s=0.18; F (p)=122 (3.12·10-8)





144

Insecticidal & herbicidal activities

Neonicotinoids - Insecticidal Activity The model with two descriptors obtained a very good estimation power. According to this model the activity was of both geometrical and topological nature. It also depended directly on atomic electronegativity and partial charge.


ŷ=-77.49·x +19.15 ŷ=-2.21·x1 +3.74·x2 +43.34 [90]

SAR Determination (%) 88 99.9 MDF Descriptor(s): x1 & x2 iIDrSMg ImMdsEg & lIMMFQt Dominant Atomic Property Mass (M) Electronegativity (E) &

Charge (Q)Interaction Via Space (geometry) Space (geometry) &

Bonds (topology) Interaction Model M2·d-3 E2·d-3 & Q2·d-2 Structure on Activity Scale Inversed Identity & Inversed Model Statistics r=0.9370; s=0.48;

F (p)=43 (5.95·10-4)r=0.9996; s=0.04; F (p)=2865 (2.24·10-8)




Substituted triazines (Triazines) - Herbicidal activity A very good MDF model for estimation and prediction was obtained for the herbicidal activity of substituted triazines. The herbicidal activity was of both geometrical and topological nature and depended on the number of directed bounded hydrogen and partial charge.

Sample size [reference] 30 MDF SAR Equation [reference]

ŷ=-4284.7·x+7.47 ŷ=-8112.2·x1+194.35·x2+5.52

SAR Determination (%) 95 98 MDF Descriptor(s): x1 & x2 iSDRFHg iSMMWHg & iSMmEQt Dominant Atomic Property Hydrogen (H) Hydrogen (H) & Charge (Q)


Interaction Model H2·d2 Structure on Activity Scale Inversed Inversed & Inversed Model Statistics r=0.9754; s=0.16;

F (p)=549 (2.18·10-20) r=0.9876; s=0.13; F (p) = 533 (1.37·10-23)





145

Therapeutically activities

3-Indolyl derivates - Antioxidant efficacy The antioxidant efficacy of the investigated 3-indolyl derivates revealed to be of geometrical nature and depended on partial charge (see the model with one descriptor). As was expected the model with two descriptors obtained better results (determination coefficient of 99.9%), but this model is close to the limit according to the Hawkins criteria [45].

Sample size [reference] 8 [93]MDF SAR Equation [reference] ŷ=-1.34·10-5·x-3.76 ŷ=-1.10·x1-33.24·x2+7.18 [94]SAR Determination (%) 90 99.9MDF Descriptor(s): x1 & x2 asMmtQg lbPMkHg & iAPrVGtDominant Atomic Property Charge (Q) Hydrogen (H) &

Group Electronegativity (G)Interaction Via Space (geometry) Space (geometry) &

Bonds (topology)Interaction Model Q2·d-4 H-2·d-1 & G·d-1

Structure on Activity Scale Inversed Logarithmic & InversedModel Statistics r=0.9508; s=0.21;

F (p) = 56 (2.87·10-4) r=0.9999; s=0.01; F (p) = 12591 (5.55·10-10)




Substituted N 4-methoxyphenyl benzamides - Antiallergic activity The model with two descriptors indicated that the antiallergic activity of the investigated substituted N 4-methoxyphenyl benzamides was of both geometrical and topological nature and depended on group electronegativity and relative atomic mass.

Sample size [reference] 23 [95]MDF SAR/SPR Equation ŷ=0.03·x+0.21 ŷ=-6.8·10-5·x1-1.3·10-6·x2+0.18SAR/SPR Determination (%) 92 98 MDF Descriptor(s): x1 & x2 InPrdQg IFDDpGg & ISDrFMt Dominant Atomic Property Charge (Q) Group Electronegativity (G)

& Mass (M)Interaction Via Space (geometry) Space (geometry) &

Bonds (topology) Interaction Model d-1 G2 & M2·d-2


Identity Identity & Identity


r=0.9920; s=0.15; F (p)=616 (4.87·10-20)





146

Polyhydroxyxanthones - Antituberculotic activity The antituberculotic activity of the investigated polyhydroxyxanthones was well estimated and predicted by the model with two descriptors. Its estimated power was of 99.7%. The activity was of geometrical nature and depended on partial charge and group electronegativity.


ŷ=-0.07·x+9.74 ŷ=2.32·x1+19.34·x2 -19.11 [97]

SAR Determination (%) 82 99.7MDF Descriptor(s): x1 & x2 isDDoHg lHPDOQg & IsMRKGg Dominant Atomic Property Hydrogen (H) Charge (Q) &

Group Electronegativity (G)


Interaction Model H-1 Q & G-2·d-1

Structure on Activity Scale Inversed Logarithmic & Identity Model Statistics r=0.9082; s=0.23;

F (p)=38 (2.78·10-4) r=0.9987; s=0.03; F (p)=1327 (9.33·10-10)




Taxoids - Growth inhibition activity A model with the estimated power of 92% was obtained during the investigation of the taxoids. According to the model with two descriptors, the growth inhibition activity of the investigated taxoids was of geometrical nature and was related with the number of directly bounded hydrogens.


ŷ=0.89·x-8.23 ŷ=0.002·x1+77.22·x2-17.7 [99]

SAR Determination (%) 83 92 MDF Descriptor(s): x1 & x2 IHDrFHt isMdTHg & IiDrQHg Dominant Atomic Property Hydrogen (H) Hydrogen (H) & Hydrogen (H)Interaction Via Bonds (topology) Space (geometry) &

Space (geometry) Interaction Model H2·d-2 H2·d-4 & H√H Structure on Activity Scale Identity Inversed & Identity Model Statistics r=0.9108; s=0.51;

F (p)=156 (7.75·10-14)r=0.9583; s=0.36; F (p)=174 (2.86·10-18)





147

HEPTA and TIBO derivatives - anti-HIV-1 potencies The anti-HIV-1 potencies of the investigated HEPTA and TIBO derivatives revealed to be of geometrical nature and related with atomic and group electronegativity. The estimated power of the model was modest, but increasing the number of molecular descriptors to five provide a significantly better model [101].

Sample size [reference] 57 [100]MDF SAR Equation ŷ=-3776.8·x+8.29 ŷ=-19.43·x1+11.07·x2-4.30 SAR Determination (%) 61 78 MDF Descriptor(s): x1 & x2 imMDPMg lIDrFEg & iMMsGg Dominant Atomic Property Mass (M) Electronegativity (E) &

Group Electronegativity (G)Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model M2 E2·d-2 &G2·d-3

Structure on Activity Scale Inversed Logarithmic & LogarithmicModel Statistics r=0.7812; s=0.95;

F (p)=86 (7.58·10-13) r=0.8849; s=0.71; F (p)=97 (5.77·10-19)




Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_1) - Inhibition activity on carbonic anhydrase I The inhibition activity on carbonic anhydrase I of the substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides revealed to be of geometrical nature and depended on partial charge and relative atomic mass according to the model with two descriptors. When the number of descriptors increased, a model whose estimated power was 10% higher than that of the model with two descriptors was obtained [103].

Sample size [reference] 40 [102] MDF SAR Equation ŷ=-0.008·x+0.66 ŷ=0.11·x1+3.10·10-3·x2+1.74 SAR Determination (%) 63 81 MDF Descriptor(s): x1 & x2 isMRdQt inPRlQg & lPDMqMgDominant Atomic Property Charge (Q) Charge (Q) & Mass (M) Interaction Via Bonds (topology) Space (geometry) &

Space (geometry) Interaction Model d-1 Q-2·d-1 &M√M Structure on Activity Scale Inversed Inversed & LogarithmicModel Statistics r=0.7927; s=0.33;

F (p)=64 (1.09·10-9) r=0.8975; s=0.24; F (p)=77 (6.95·10-14)





148

Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_2) - Inhibition activity on carbonic anhydrase II The inhibition activity of the carbonic anhydrase II of the substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides was of geometrical nature and depended on cardinality and partial charge (see the above model with two descriptors). When the investigation continued, a more powerful model was obtained (a model with 4 descriptors whose estimated power was 11% higher than that of the model with two descriptors [104]).

Sample size [reference] 40 [102]MDF SAR Equation ŷ=-2.49·10-3·x +1.43 ŷ=2.44·x1+0.09·x2-4.45 SAR Determination (%) 55 79 MDF Descriptor(s): x1 & x2 lPmrSMg imDdSCg & iiMrqQg Dominant Atomic Property Mass (M) Cardinality (C) & Charge (Q)Interaction Via Space (geometry) Space (geometry) &

Space (geometry)Interaction Model M2·d-3 C2·d-3 & Q√Q Structure on Activity Scale Logarithmic Inversed & Inversed Model Statistics r=0.7422; s=0.35;

F (p)=47 (4.21·10-8)r=0.8862; s=0.25; F (p)=68 (4.36·10-13)




Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_4) - Inhibition activity on carbonic anhydrase IV The analysis of the model with two descriptors revealed that the inhibition activity on carbonic anhydrase IV of the substituted 1,3,4- thiadiazole- and 1,3,4-thiadiazoline-disulfonamides was of geometrical and topological nature and depended on partial charge. The estimated power of the model was

Sample size [reference] 40 [102] MDF SAR Equation ŷ=-0.006·x+0.40 ŷ=0.11·x1+9.98·10-9·x2+0.80 SAR Determination (%) 56 75 MDF Descriptor(s): x1 & x2 iAPmsQt inPRlQg & iHMMTQt Dominant Atomic Property Charge (Q) Charge (Q) & Charge (Q) Interaction Via Bonds (topology) Space (geometry) &

Bonds (topology) Interaction Model Q2·d-3 Q-1·d-1 & Q2·d-4

Structure on Activity Scale Inversed Inversed & Inversed Model Statistics r=0.7455; s=0.36;

F (p)=48 (3.41·10-8) r=0.8672; s=0.27; F (p)=56 (6.21·10-12)



rcv-loo=0.8536; scv-loo=0.29; Fcv-loo (p)=0.49 (3.54·10-11)


149

slightly moderate, but the increase to four in the number of descriptors increased the estimated power with 16% [105]. Dipeptides - Inhibition activity A model with moderate estimation power was obtained when the inhibition activity of dipeptides was investigated. The model with two descriptors showed that the activity was of both topological and geometrical nature and depended on relative atomic mass and number of directly bounded hydrogens.

Sample size [reference] 58 [106] MDF SAR Equation ŷ=2.26·10-4·x+1.94 ŷ=-1.47·x1+0.12·x2+2.57 SAR Determination (%) 75 85 MDF Descriptor(s): x1 & x2 ISMrSGg ibDMFHt & ISPdlMg Dominant Atomic Property Group Electronegativity (G) Hydrogen (H) &

Mass (M) Interaction Via Space (geometry) Bonds (topology) &

Space (geometry) Interaction Model G2·d-3 H2·d-2 &M-1·d-1

Structure on Activity Scale Identity Inversed & IdentityModel Statistics r=0.8656; s=0.51;

F (p)=167 (1.32·10-18) r=0.9208; s=0.40; F (p)=153 (1.15·10-23)




2,4-diamino-5-(substituted-benzyl)-pyrimidines - Inhibition activity The inhibition activity of the investigated 2,4-diamino-5-(substituted-benzyl)-pyrimidines revealed to be of both geometrical and topological nature. It was also related with the number of directly bounded hydrogens (see the model with two descriptors).

Sample size [reference] 67 [107] MDF SAR Equation ŷ=-707.32·x+10.83 ŷ=-4.90·x1+2.31·x2+3.26 SAR Determination (%) 73 86 MDF Descriptor(s): x1 & x2 ibMrEMt lImrKHt & lIMDWHg Dominant Atomic Property Mass (M) Hydrogen (H) & Hydrogen (H)Interaction Via Bonds (topology) Bonds (topology) &

Space (geometry) Interaction Model M·d-2 H-2·d-1 & H2·d-1 Structure on Activity Scale Inversed Logarithmic & Logarithmic Model Statistics r=0.8551; s=0.32;

F (p)=177 (2.42·10-20) r=0.9268; s=0.23; F (p)=195 (2.06·10-28)





150

Peptide analogues - Inhibition activity The model with two descriptors obtained when the inhibition activity of the peptide analogues was investigated showed that the activity was of geometrical nature and strongly related with the relative atomic mass.

Sample size [reference] 47 [108]MDF SAR Equation ŷ=3202.8·x-16.51 ŷ=240.54·x1-0.10·x2+0.94 SAR Determination (%) 81 88 MDF Descriptor(s): x1 & x2 IHmRpMg IHMdpMg & IHMdOMg Dominant Atomic Property Mass (M) Mass (M) & Mass (M) Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model M2·d-3 M2·d-3 & M Structure on Activity Scale Identity Identity & Identity Model Statistics r=0.8999; s=0.28;

F (p)=192 (5.16·10-18)r=0.9400; s=0.22; F (p)=167 (8.04·10-22)




Lethal/effective concentration

Ordnance compounds - Fertilization of Sea Urchin The lethal/effective concentration of the investigated ordnance compounds in the fertilization of the Sea Urchin revealed to be of topological nature. It also depended on group electronegativity (see the model with one descriptor). The estimated power did not increase with the increase in the number of descriptors used by the models. This sustained the ability of the model with one descriptors to estimate and predict.


ŷ=75.98·x+3937.6 ŷ=-4291.5·x1-24751·x2+82488 [110]

SAR Determination (%) 99.9 99.9 MDF Descriptor(s): x1 & x2 ISPRwGt iSDmtQg & lAMrFEt Dominant Atomic Property Group electronegativity

(E) Charge (Q) & Electronegativity (E)

Interaction Via Bonds (topology) Space (geometry) & Bonds (topology)

Interaction Model E2·d-1 Q2·d-4 & E2·d-2

Structure on Activity Scale Identity Inversed & Logarithmic Model Statistics r=0.9996; s=192;

F (p)=7754 (1.44·10-10) r=0.9999; s=13; F (p) = 823154 (1.61·10-14)





151

Ordnance Compounds - Embryological development (1st column) & germination (2nd column) of Sea Urchin The lethal/effective concentration in the embryological development of the Sea Urchin revealed to be of topological nature and depended on partial charge. This model had a very good estimation power. The investigation of the lethal/ effective concentration in the germination of the Sea Urchin revealed that it was of topological nature and related with partial charge. In this model the power of estimation was good, but it was not as high as for the previous model.

Sample size [reference] 5 [109] 7 [109] MDF SAR Equation ŷ=-0.37·x-0.16 ŷ=-1.09·x-7.09 SAR Determination (%) 99.9 93 MDF Descriptor LIMmwQt lNPmfQt Dominant Atomic Property Charge (Q) Charge (Q) Interaction Via Bonds (topology) Bonds (topology)Interaction Model Q2·d-1 Q2·d-2

Structure on Activity Scale Logarithmic Logarithmic Model Statistics r=0.9997; s=0.02;

F (p)=5677 (5.15·10-6) r=0.9650; s=0.35; F (p)=68 (4.32·10-4)




Ordnance compounds - Zoospore germination of Green Macroalgae The lethal/effective concentration of ordnance compounds in the zoospore germination of the Green Macroalgae seemed to be of geometrical nature and depended on relative atomic mass and partial charge (see the model with two descriptors).


ŷ=0.06·x-1.50 ŷ=-0.004·x1+11.11·x2+21.58 [110]

SAR Determination (%) 89 99.9 MDF Descriptor(s): x1 & x2 aIDmjQg iHDRkMg & inMrPQg Dominant Atomic Property Charge (Q) Mass (M) & Charge (Q) Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model Q-1·d-1 M-2·d-1 & Q2 Structure on Activity Scale Inversed Inversed & Inversed Model Statistics r=0.9435; s=0.39;

F (p)=49 (4.32·10-4) r=0.9996; s=0.04; F (p)=2942 (2.10·10-8)





152

Ordnance compounds - Germling length of Green Macroalgae The lethal/effective concentration of ordnance compounds in the germling length of the Green Macroalgae revealed to be of geometrical nature and related with the number of directly bounded hydrogens and partial charge according to the model with two descriptors. The estimation and prediction power of this model was almost perfect.

Sample size [reference] 8 [109]MDF SAR Equation [reference]

ŷ=-1.88·x-6.13 ŷ=-10.09·x1-1.39·x2+6.95 [110]

SAR Determination (%) 89 99.9 MDF Descriptor(s): x1 & x2 LIDmjQg iGDREHg & lnDDVQg Dominant Atomic Property Charge (Q) Hydrogen (H) & Charge (Q)Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model Q-1·d-1 H·d-2 &Q·d-1

Structure on Activity Scale Logarithmic Inversed & Logarithmic Model Statistics r=0.9445; s=0.35;

F (p)=50 (4.09·10-4) r=0.9999; s=0.02; F (p)=11350 (7.20·10-10)




Ordnance compounds - Germling cell number of Green Macroalgae Two models were obtained for the lethal/effective concentration of ordnance compounds in the germling cell number of the Green Macroalgae. According to the model with two descriptors, the activity was of geometrical nature and strongly depended on partial charge. The estimated power of the model with two descriptors was very high.


ŷ=-1.87·x-6.02 ŷ=382.96·x1-5.15·x2+5.97 [110]

SAR Determination (%) 88 99.9 MDF Descriptor(s): x1 & x2 LIDmjQg AHDmtQg & inMDqQg Dominant Atomic Property Charge (Q) Charge (Q) & Charge (Q) Interaction Via Space (geometry) Space (geometry) &

Space (geometry) Interaction Model Q-1·d-1 Q2·d-2 & Q-1(√Q)-1

Structure on Activity Scale Logarithmic Absolute & Inversed Model Statistics r=0.9359; s=0.38;

F (p)=42 (6.28·10-4) r=0.9996; s=0.03; F (p)=3132 (1.80·10-8)





153

Ordnance compounds - Survival and reproductive success of Polychaete (1st column) & Juveniles survival of Opossum Shrimp (2nd column) The lethal/effective concentration of ordnance compounds expressed as the successful reproduction and survival of Polychaete Juveniles (1st column) revealed to be of geometrical nature and related with partial charge. The lethal/effective concentration of ordnance compounds expressed as the survival of the Opossum Shrimp (2nd column) also revealed to be of geometrical nature and depended on partial charge.

Sample size [reference] 7 [109] 7 [109] MDF SAR Equation ŷ=-102.72·x-0.79 ŷ=-1.31·x+0.28 SAR Determination (%) 97 91 MDF Descriptor(s) IAPmtQt LHDmjQg Dominant Atomic Property Charge (Q) Charge (Q) Interaction Via Bonds (topology) Space (geometry)Interaction Model Q2·d-4 Q-1·d-1 Structure on Activity Scale Identity Logarithmic Model Statistics r=0.9835; s=0.22;

F (p)=148 (6.65·10-5) r=0.9531; s=0.25; F (p)=50 (8.93·10-4)




Ordnance compounds - Redfish larvae survival The lethal/effective concentration of ordnance compounds expressed as redfish larvae survival proved to be of topological nature and related with cardinality and partial charge.


ŷ=16.91·x-1.73 ŷ=-14.72·x1-0.11·x2+17 [110]

SAR Determination (%) 93 99.9 MDF Descriptor(s): x1 & x2 anDRJQt iAMrECt & aAPmfQt Dominant Atomic Property Charge (Q) Cardinality (C) & Charge (Q)Interaction Via Bonds (topology) Bonds (topology) &

Bonds (topology) Interaction Model Q·d C·d-2 & Q2·d-2 Structure on Activity Scale Inversed Inversed & Inversed Model Statistics r=0.9655; s=0.32;

F (p)=82 (9.99·10-5) r=0.9998; s=0.03; F (p)=6295 (3.14·10-9)





154

No observed effect concentration (NOEC)

Ordnance compounds - Fertilization (1st column) & embryological development (2nd column) & germination of Sea Urchin (3rd column) The NOEC of the investigated ordnance compounds in the fertilization of the Sea Urchin revealed to be of geometrical nature and related with compounds’ cardinality. The NOEC of the investigated ordnance compounds in the embryological development of the Sea Urchin revealed to be of geometrical nature and related with partial charge. The NOEC of the investigated ordnance compounds in the germination of the Sea Urchin revealed to be of topological nature and related with partial charge. All the above mentioned models had good estimation and prediction power, the coefficient of determination being higher than or equal with 95%. Sample size [reference]

7 [109] 7 [109] 6 [109]

MDF SAR Equation

ŷ=-0.80·x+3.93 ŷ=0.17·x+1.42 ŷ=0.001·x-1.27


96 95 97

MDF Descriptor imMrtCg ASPmwQg asmrfQt Dominant Atomic Property

Cardinality (C) Charge (Q) Charge (Q)

Interaction Via Space (geometry) Space (geometry) Bonds (topology) Interaction Model C2·d-4 Q2·d-1 Q2·d-2


Inversed Logarithmic Inversed


r=0.9738; s=0.08; F (p)=92 (2.09·10-4)

r=0.9859; s=0.27; F (p)=139 (2.97·10-4)





Ordnance compounds - Germling length and cell number of Green Macroalgae The NOEC of the investigated ordnance compounds in the germling length and cell number of the Green Macroalgae revealed to be of both topological and geometrical nature. It also depended on group electronegativity and relative atomic mass (the models with two descriptors). The model with two descriptors proved to have excellent estimation and prediction power.


155

Sample size [reference] 8 [109] MDF SAR Equation ŷ=0.06·x-1.74 ŷ=24.40·x1-28.28·x2+252.94SAR Determination (%) 87 99.9 MDF Descriptor(s): x1 & x2 aIDmjQg lMMRSGt & lsPmpMg Dominant Atomic Property Charge (Q) Group Electronegativity (G)

& Mass (M) Interaction Via Space (geometry) Bonds (topology) &

Space (geometry)Interaction Model Q-1·d-1 G2·d-3 & M-2

Structure on Activity Scale Inversed Logarithmic & Logarithmic Model Statistics r=0.9355; s=0.40;

F (p)=42 (6.38·10-4) r=0.9995; s=0.04; F (p)=2499 (3.16·10-8)




Ordnance compounds - Survival and reproductive success of Green Macroalgae The NOEC in the survival and reproductive success of the Green Macroalgae of the investigated ordnance compounds proved to be of both geometrical and topological nature and related with the compounds’ atomic electronegativity and partial charge (model with two descriptors). The model with two descriptors proved to have excellent estimation and prediction power.

Sample size [reference] 8 [109] MDF SAR Equation ŷ=-1.28·x+3.71 ŷ=-0.89·x1-13455·x2+28.03 SAR Determination (%) 92 99.9 MDF Descriptor(s): x1 & x2 LnDRJQt IHMRFEg & INPmsQt Dominant Atomic Property Charge (Q) Electronegativity (E) & Charge (Q)Interaction Via Bonds (topology) Space (geometry) &

Bonds (topology)Interaction Model Q·d E2·d-2 & Q2·d-3

Structure on Activity Scale Logarithmic Identity & Identity Model Statistics r=0.9578; s=0.36;

F (p)=65 (1.83·10-4) r=0.9998; s=0.03; F (p)=5938 (3.63·10-9)




Ordnance compounds - Redfish larvae survival (1st column) & survival and reproductive success of Polychaete (2nd column) The NOEC of the investigated ordnance compounds in redfish larvae survival proved to be of both geometrical and topological nature and related with partial charge and the number of directly bounded hydrogens.


156

The NOEC of the investigated ordnance compounds in the survival and reproductive success of Polychaete was of geometrical nature and related with partial charge.


8 [109] 6 [109]

MDF SAR Equation

ŷ=-1.37·x+0.09 ŷ=-0.003·x1+ 2.25·x2+4.59

ŷ=-1.42·x-10.25


91 99.9 95


LHDmjQg asDmkQg & IGMmTHt

LsmrfQg


Charge (Q) Charge (Q) & Hydrogen (H)

Charge (Q)

Interaction Via Space (geometry) Space (geometry) &Bonds (topology)

Space (geometry)

Interaction Model Q-1·d-1 Q-2·d-1 & H2·d-4 Q2·d-2


Logarithmic Inversed & Identity Logarithmic

Model Statistics r=0.9542; s=0.24;F (p)=61 (2.33·10-4)

r=0.9995; s=0.03;F (p)=2373 (3.59·10-8)

r=0.9754; s=0.32;F (p)=78 (8.98·10-4)





Ordnance compounds - Juveniles survival of Opossum Shrimp The NOEC of the investigated ordnance compounds in the juvenile survival of the Opossum Shrimp was of geometrical nature and related with partial charge and cardinality (model with two descriptors). The determination coefficient of the model with one descriptor was moderately good while the determination power of the model with two descriptors was very good.

Sample size [reference] 8 [109] MDF SAR Equation ŷ=668.36·x+19.24 ŷ=1.46·x1-0.008·x2+0.28 SAR Determination (%) 82 99.9 MDF Descriptor(s): x1 & x2 iBPMwEt iIPdqQg & iImrSCg Dominant Atomic Property Electronegativity (E) Charge (Q) & Cardinality (C)Interaction Via Bonds (topology) Space (geometry) &

Space (geometry) Interaction Model E2·d-1 Q-1(√Q)-1 & C2·d-3

Structure on Activity Scale Inversed Inversed & Inversed Model Statistics r=0.9048; s=0.28;

F (p)=27 (2.01·10-3) r=0.9998; s=0.01; F (p)=5435 (4.53·10-9)





157

Lowest observed effect concentration (LOEC)

Ordnance compounds - fertilization (1st column) & - embryological development (2nd column) of Sea Urchin The LOEC of the investigated ordnance compounds in the fertilization and embryological development of the Sea Urchin was of topological nature and depended on partial charge. The estimation power of the models was good, the values were higher than or equal with 93%.

Sample size [reference] 6 [109] 7 [109]MDF SAR Equation ŷ=-47.56·x+0.57 ŷ=-1.14·x-7.62 SAR Determination (%) 99.8 93MDF Descriptor(s) IAPmfQt lNPmfQt Dominant Atomic Property Charge (Q) Charge (Q) Interaction Via Bonds (topology) Bonds (topology) Interaction Model Q2·d-2 Q2·d-2

Structure on Activity Scale Identity Logarithmic Model Statistics r=0.9993; s=0.04;

F (p)=2781 (7.74·10-7) r=0.9653; s=0.36; F (p)=68 (4.22·10-4)




Ordnance compounds - Germination of Sea Urchin The LOEC of the investigated ordnance compounds in the germination of the Sea Urchin was both of topological and geometrical nature. It was also related with group electronegativity and the relative atomic mass. The estimation and prediction power of this model was much closer to the optimum value.

Sample size [reference] 8 [109]MDF SAR Equation ŷ=0.06·x-1.43 ŷ=11.77·x1+14.55·x2-76.46SAR Determination (%) 88 99.9 MDF Descriptor(s): x1 & x2 aIDmjQg lGPrfGt & iGPMqMg Dominant Atomic Property Charge (Q) Group Electronegativity (G) &

Mass (M)Interaction Via Space (geometry) Bonds (topology) &

Space (geometry) Interaction Model Q-1·d-1 E2·d-2 & M-1(√M)-1

Structure on Activity Scale Inversed Logarithmic & Inversed Model Statistics r=0.9357; s=0.40;

F (p)=42 (6.33·10-4)r=0.9996; s=0.04; F (p)=3002 (1.99·10-8)





158

Ordnance compounds - Germling length and cell number of Green Macroalgae The LOEC of the investigated ordnance compounds in the germling length and cell number of the Green Macroalgae revealed to be of topological nature and strongly depended on atomic electronegativity (model with two descriptors).

Sample size [reference] 8 [109] MDF SAR Equation ŷ=0.06·x-2.02 ŷ=0.66·x1-688.62·x2-0.47 SAR Determination (%) 90 99.9 MDF Descriptor(s): x1 & x2 aIDmjQg ISPRfEt & imDrwEt Dominant Atomic Property Charge (Q) Electronegativity (E) &

Electronegativity (E)Interaction Via Space (geometry) Bonds (topology) &

Bonds (topology) Interaction Model Q-1·d-1 E2·d-2 & E2·d-1

Structure on Activity Scale Inversed Identity & Inversed Model Statistics r=0.9504; s=0.35;

F (p) = 56 (2.94·10-4) r=0.9995; s=0.04; F (p) 2539 (3.03·10-8)




Ordnance compounds - Survival and reproductive success of Polychaete The LOEC of the investigated ordnance compounds in the survival and reproductive success of the Polychaete revealed to be of both geometrical and topological nature and related with atomic and group electronegativity. Both models had good estimation and prediction power.

Sample size [reference] 8 [109] MDF SAR Equation ŷ=16.60·x-1.69 ŷ=-20.97·x1+51.18·x2-267.22 SAR Determination (%) 92 99.9MDF Descriptor(s): x1 & x2 anDRJQt lsPmkEg & lSmRFGt Dominant Atomic Property Charge (Q) Electronegativity (E) &

Group Electronegativity (G) Interaction Via Bonds (topology) Space (geometry) &

Bonds (topology)Interaction Model Q·d E-2·d-1 & G2·d-2

Structure on Activity Scale Inversed Logarithmic & Logarithmic Model Statistics r=0.9612; s=0.34;

F (p)=73 (1.42·10-4) r=0.9999; s=0.02; F (p)=13109 (5.02·10-10)





159

Ordnance compounds - Survival and reproductive success of Green Macroalgae (1st column) & Redfish larvae survival (2nd column) & Juveniles survival of Opossum Shrimp (3rd column) The LOEC of the investigated ordnance compounds in the survival and reproductive success of the Green Macroalgae revealed to be of geometrical nature and depended on partial charge. The LOEC of the investigated ordnance compounds in the Redfish larvae survival proved to be of geometrical nature and depended on partial charge. The LOEC of the investigated ordnance compounds in the juvenile survival of the Opossum Shrimp revealed to be of geometrical nature. It also depended on cardinality. All three models had great abilities in estimation and prediction, their power being higher than 88%.


7 [109] 7 [109] 7 [109]

MDF SAR/ Equation

ŷ=0.11·x-3.69 ŷ=-1.30·x+0.39 ŷ=-0.83·x+4.22


95 94 98

MDF Descriptor(s) iIDdPQg LHDmjQg imMrtCg Dominant Atomic Property

Charge (Q) Charge (Q) Cardinality (C)

Interaction Via Space (geometry) Space (geometry) Space (geometry) Interaction Model Q2 Q-1·d-1 C2·d-4


Inversed Logarithmic Inversed


r=0.9694; s=0.20; F (p)=78 (3.09·10-4)

r=0.9897; s=0.07; F (p)=239 (2.06·10-5)





Discussions The researchers from many chemistry related fields are interested in quantitative structure-activity/property relationship approaches due to their advantages. A quick search in the SCOPUS database (http://www.scopus.com) with the query “TITLE-ABS-KEY(qsar)”, retrieved a number of 6278 abstracts, out of which almost 10% were published last year. The same query retrieved a number of 4920 items in the PubMed database (http://www.ncbi.nlm.nih.gov). Out of these, almost 13% where published last year and almost 26% in the last two years. The amount of


160

research published on the QSAR subject showed the importance of these studies. The main three advantages offered by the approaches are as follows: ÷ Time and money efficiency [111] ÷ Effective (with comparable or better accuracy) alternatives compared

with experimental counterparts [111] ÷ Virtual screening environments (particularly receptor-based virtual

screening) offered reliable and inexpensive methods for identifying leads [112].

A new approach termed MDF SPR/SAR that used the information extracted from the 2D and 3D structure of the compounds in order to generate and calculate the molecular descriptors able to estimate and predict property/activity of interest was developed. The approach was tested on different properties and activities in a number of chemical and/or biological active classes of compounds. The analysis of the MDF abilities in estimation and prediction of properties lead to the following: ÷ The estimated power expressed as correlation coefficient varied from

0.8324 to 1.0000. ÷ The prediction power expressed as leave-one-out correlation coefficient

varied from 0.8258 to 0.9999. ÷ The stability of the model, expressed as the difference between estimated

and predictive power, varied from 0.01% to 5.46%. The most stable models (with a difference between estimated and predicted power of 0.0001) were obtained in the following classes of compounds: polychlorinated biphenyls (relative retention time), organophosphorus herbicides (retention chromatography index), cyclic organophosphorus (molar refraction), alkanes (boiling point), and standard amino acids – 15aa (Hückel energy).

÷ The estimation power of the best performing model (when two models were reported) occurred in 90% of cases (eighteen out of twenty investigated sets) higher than or equal with 0.9.

÷ The prediction power of the best performing model (when two models were reported) occurred in 85% of cases (seventeen out of twenty investigated sets) higher than or equal with 0.9. The analysis of the MDF abilities in estimation and prediction of

activities leads to the following:


161

÷ The estimated power expressed as correlation coefficient varied from 0.6649 to 0.9999.

÷ The predictive power expressed as leave-one-out correlation coefficient varied from 0.5961 to 0.9999.

÷ The stability of the model, expressed as the difference between estimation and prediction power, varied from 0.01% to 7.92%. The most stable model (with a difference between estimated and predicted power of 0.0001) was obtained when the fertilization of the Sea Urchin was investigated. The sample size of this set was small (8 compounds). A lower difference between estimation and prediction power (of 0.0002) was also obtained in the investigation of antioxidant efficacy of 3-indolyl derivates. The worst result was obtained in the toxicity investigation of the mono-substituted nitrobenzenes, where a difference of 0.0792 was obtained.

÷ The estimated power of the best performing model (when two models were reported) occurred in almost 77% of cases (fifty-six out of seventy-three investigated sets) higher than or equal with 0.9.

÷ The predicted power of the best performing model (when two models were reported) occurred in 70% of cases (fifty-one out of seventy-three investigated sets) higher than or equal with 0.9.

The MDF SPR/SAR models proved to have good correlation coefficient and predictive power compared with the previously reported models [23,29,32,34,35,47,54,56,59,67,88,90,92,94,96,99,101,103,105,106,107]. The MDF methodology was successful in extracting information from the 2D and 3D structure of compounds. This is useful for identifying the link between the compounds’ structure and property or activity of interest. The use of structural information to characterize chemical compounds allows good correlations between the compounds’ structure and chemical and/or biological activity. Thus, the MDF methodology is a powerful approach that investigates the activities and properties of chemical active compounds since it includes the information extracted from the compounds’ structure in the SPR/SAR models. Furthermore, the approach could be useful in the process of drug design. References 1. Hammett, L.P. 1937, J. Am. Chem. Soc., 59, 96. 2. Hansch, C., Leo, A., and Taft, R.W. 1991, Chem. Rev., 91, 165. 3. Crum-Brown, A., and Fraser, T.R. 1868, Philos. Trans. R. Soc. Lond., 25, 151. 4. Richet, C.R. 1893, C. R. Seances Soc. Biol. Fil., 9, 775.


162

5. Meyer, H. 1899, Naunyn-Schmiedeberg's Arch. Pharmacol., 42, 109. 6. Hammett, L.P. 1935, Chem. Rev., 17, 125. 7. Taft, R.W. 1952, J. Am. Chem. Soc., 74, 3120. 8. Hansch, C. 1969, Acc. Chem. Res., 2, 232. 9. Mezey, P.G., Journal of Mathematical Chemistry, Springer Netherlands, since

1987 (periodical). 10. Gutman, I., MATCH - Communications in Mathematical and in Computer

Chemistry, Faculty of Science Kragujevac and University of Kragujevac Serbia, from 1975 (periodical).

11. Brändas, E., and Öhrn, Y., International Journal of Quantum Chemistry, John Wiley & Sons NY USA, since 1967 (periodical).

12. Portoghese, P.S., Journal of Medicinal Chemistry, American Chemical Society OH USA, since 1959 (periodical).

13. Atta-ur-Rahman. Letters in Drug Design & Discovery, Bentham Science Publishers, since 2004 (periodical).

14. Sawyer, T.K., Chemical Biology & Drug Design, Blackwell Publishing, since 2006 (periodical, formally known as Journal of Peptide Research until 2005).

15. Grassy, G., Calas, B., Yasri, A., Lahana, R., Woo, J., Iyer, S., Kaczorek, M., Floc'h, R., and Buelow, R. 1998, Nat. Biotechnol., 16, 748.

16. van Breemen, R.B. Combinatorial Chemistry & High Throughput Screening, Bentham Science Publishers, from 1998 (periodical).

17. Czarnik AW. Journal of Combinatorial Chemistry, American Chemical Society OH USA, from 1999 (periodical).

18. Hansch, C., and Leo, A. 1979, Substituent Constants for Correlation Analysis in Chemistry and Biology, John Wiley & Sons, Inc., New York.

19. Cramer, R.D. III, Patterson, D.E., and Bunce, J.D. 1988, J. Am. Chem. Soc., 110, 5959.

20. Todeschini, R., Lasagni, M., and Marengo, E. 1994, Theory J. Chemometrics, 8, 263. 21. Simon, Z., Chiriac, A., Holban, S., Ciubotariu, D., and Mihalas, G.I. 1984,

Minimum Steric Difference. The MTD Method for QSAR Studies, Research Studies Press, 1.

22. Jäntschi, L., Katona, G., and Diudea, M.V. 2000, MATCH Commun. Math. Comput. Chem., 41, 151.

23. Jäntschi, L. 2004, Leonardo Journal of Sciences, 3, 68. 24. Putz, M.V., and Lacrămă, A.-M. 2007, Int. J. Mol. Sci., 8, 363. 25. Cambridge Structural Database (CSD), available from:

http://www.ccdc.cam.ac.uk/products/csd/; Protein Data Bank (PDB), available from: http://www.pdb.org/; Visual Molecular Dynamics (VMD), available from: http://www.ks.uiuc.edu/Research/vmd/; Molecular Modeling DataBase (MMDB), available from: http://130.14.29.110/Structure/MMDB/mmdb.shtml; PubChem Compounds (PCC), available from: http://ncbi.nlm.nih.gov/entrez/ query.fcgi?DB=pccompound; PubChem Substance (PCS). available from: http://ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pcsubstance.

26. Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M., and Gramatica, P. 2003, Environ. Health Perspect., 111, 1361.


163

27. Steiger, J.H. 1980, Psychol. Bull., 87, 245. 28. Thurstone, L.L. 1931, Psychol. Rev., 38, 406. 29. Bolboacă, S.D., and Jäntschi, L. 2006, Leonardo Journal of Sciences, 5, 179. 30. Tarko, L. 2006, Revista de Chimie, 57, 1014. 31. Jäntschi, L., Bolboacă, S.D. 2007, Leonardo Electronic Journal of Practices and

Technologies, 6, 169. 32. Jäntschi, L. 2005, Leonardo Electronic Journal of Practices and Technologies, 6, 76. 33. HyperChem, Molecular Modelling System [online]; ©2003, Hypercube [cited

2007 Feb]. Available from: URL: http://hyper.com/products/. 34. Bolboacă, S.D., and Jäntschi, L. 2008, Environ. Chem. Lett., 6, 175. 35. Jäntschi, L., and Bolboacă, S.D. 2007, Studii şi Cercetări Ştiinţifice Universitatea

Bacău Seria Biologie, 12, 57. 36. Lu. A., Zhang. J., Yin. X., Luo. X., and Jiang. H. 2007, Bioorg. Med. Chem.

Lett., 17, 243. 37. Ma, W., Luan, F., Zhao, C., Zhang, X., Liu, M., Hu, Z., and Fan, B., 2006, QSAR

Comb. Sci., 25, 895. 38. Tropsha, A., Gramatica, P., and Gombar, V.K. 2003, QSAR & Comb. Sci., 22, 69. 39. Diudea, M., Gutman, I., and Jäntschi, L. 2002, Molecular Topology, 2nd Edition,

Nova Science, Huntington, New York. 40. The PHP Group [online]; ©2001-2007, The PHP Group [cited 2007 June].

Available from: URL: http://php.net. 41. MySQL AB [online]; ©1995-2007 MySQL AB [cited 2007 June]. Available

from: URL: http://mysql.com. 42. The FreeBSD Project [online]; ©1995-2007 The FreeBSD Project [cited 2007

June]. Available from: URL: http://freebsd.org. 43. Chambers, D.L. 2001, The practical handbook of genetic algorithms. Chapman &

Hall, Boca Raton. 44. Borland Software Corporation [online]; ©1994 - 2007 Borland Software

Corporation [cited 2007 June]. Available from: URL: http://borland.com. 45. Hawkins, D.M. 2004, J. Chem. Inf. Comput. Sci., 44, 1. 46. Abraham, M. H., Kumarsingh, R., Cometto-Muniz, J.E., and Cain, W. S. 1998,

Toxicol. In Vitro., 12, 201. 47. Ivanciuc, O. 1998, Revue Roumanian de Chimie, 43, 255. 48. Jäntschi, L., and Bolboacă, S.D. 2007, Int. J. Quantum Chem., 107, 1736. 49. Eisler, R., and Belisle, A.A. 1996, Biological Report 31 and Contaminant Hazard

Reviews Report, 31, 75. 50. Baker, J.R., Mihelcic, J.R., and Sabljie, A. 2001, Chemosphere, 45, 213. 51. Bumble, S. 1999, Computer Generated Physical Properties. CRC Press, LLC,

Boca Raton. 52. Bolboacă, S.D., and Jäntschi, L. 2008, Chem. Biol. Drug Des., 71(2), 173. 53. Jäntschi, L., Mureşan, S., and Diudea, M. 2000, Studia Universitatis Babes-

Bolyai, Chemia, XLV, 313. 54. Jäntschi, L., and Bolboacă, S.D. 2005, Leonardo Electronic Journal of Practices

and Technologies, 7, 55. 55. Toropov, A., Toropova, A., Ismailov, T., and Bonchev, D. 1998, Theochem, 424, 237.


164

56. Brasquet, C., and Le Cloirec, P. 1999, Water Res., 33, 3603. 57. Jäntschi, L. 2004, Leonardo Journal of Sciences, 5, 63. 58. Wolfenden, R., Andersson, L., Cullis, P., and Southgate, C. 1981, Biochemistry,

20, 849. 59. Rose,G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., and Zehfus, M.H., 1985,

Science, 229, 834. 60. Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H.,

Nilsson, I., White, S.H., and von Heijne, G. 2005, Nature, 433, 377. 61. Kyte, J., and Doolittle, 1982, R.F., J. Mol. Biol., 157, 105. 62. Welling, G.W., Weijer, W.J., Van der Zee, R., and Welling-Wester, S. 1985,

FEBS Lett, 188, 215. 63. Wilson, K.J., Honegger, A., Stotzel, R.P., and Hughes, G.J. 1981, Biochem. J.,

199, 31. 64. Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L., Berzofsky, J.A., and

DeLisi, C. 1987, J. Mol. Biol., 195, 659. 65. Bolboacă, S.D., and Jäntschi, L. 2007, Recent Advances in Synthesis & Chemical

Biology VI, Centre for Synthesis & Chemical Biology, University of Dublin, December 14, Dublin, Ireland, P2.

66. Wimley, W.C., and White, S.H. 1996, Nature Struct. Biol., 3, 842. 67. Hoop, T.P., and Woods, K.R. 1981, Proc. Natl. Acad. Sci. USA, 78, 3824. 68. Cowan, R. 1990, Pept. Res., 3, 75. 69. Manavalan, P., and Ponnuswamy, P.K. 1978, Nature, 275, 673. 70. Fauchere, J.-L., and Pliska, V.E. 1983, Eur. J. Med. Chem., 18, 369. 71. Rao, M.J.K., and Argos, P. 1986, Biochim. Biophys. Acta., 869, 197. 72. Janin J, Surface and inside volumes in globular proteins, Nature, 1979, 277, 491-492. 73. Roseman., M.A. 1988, J. Mol. Biol., 200, 513. 74. Urry, D.W., 2004, Chem. Phys. Lett., 399, 177. 75. Engelman, D.M., Steitz, T.A., and Goldman, A., 1986, Annu. Rev. Biophys.

Chem., 15, 321. 76. Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. 1984, J. Mol. Biol.,

179, 125. 77. Sereda, T.J., Mant, C.T., Sonnichsen, F.D., and Hodges, R.S. 1994, J.

Chromatogr. A., 676, 139. 78. Bull, H.B., and Breese, K. 1974, Arch. Biochem. Biophys., 161, 665. 79. Parker, J.M.R, Guo, D., and Hodges, R.S., 1986, Biochemistry, 25, 5425. 80. Black, S.D., Mould, D.R., Black, S.D., and Mould, D.R. 1991, Anal. Biochem.,

193, 72. 81. Monera, O.D., Sereda, T.J., Zhou, N.E., Kay, C.M., and Hodges, R.S. 1995, J.

Pept. Sci., 1, 319. 82. Wei, D., Zhang, A., Wu, C., Han, S., and Wang, L. 2001, Chemosphere, 44, 1421. 83. Agrawal, V.K., and Khadikar, P.V. 2001, Bioorg. Med. Chem., 9, 3035. 84. Toropov, A.A., and Toropova, A.P. 2002, Theochem, 578, 129. 85. Ade, T., Zaucke, F., and Krug, H.F. 1996, Fresenius J. Anal. Chem., 354, 609. 86. Bobloacă, S.D., and Jäntschi, L. 2006, Terapeutics, Pharmacology and Clinical

Toxicology, X(1), 110. 87. Smith, C.J., Hansch, C., and Morton, M.J. 1997, Mutat. Res., 379, 167.


165

88. Jäntschi, L., and Bolboacă, S. 2005, The 10th Electronic Computational Chemistry Conference, April 2005.

89. Hasegawa, K., Arakawa, M., and Funatsu, K. 1999, Chemometr. Intell. Lab., 47, 33. 90. Bolboacă, S., and Jäntschi, L. 2005, Leonardo Journal of Sciences, 6, 78. 91. Diudea, M., Jäntschi, L., and Pejov, L. 2002, Leonardo Electronic Journal of

Practices and Technologies, 1, 1. 92. Bolboacă, S., and Jäntschi, L. 2006, Bulletin of University of Agricultural

Sciences and Veterinary Medicine – Agriculture, 62, 35. 93. Shertzer,G.H., Tabor, M.W., Hogan, I.T.D., Brown, J.S., and Sainsbury, M.

1996, Arch. Toxicol., 70, 830. 94. Bolboacă, S., Filip, C., Ţigan, Ş., and Jäntschi, L. 2006, Clujul Medical, LXXIX, 204. 95. Zhou, Y.X., Xu, L., Wu, Y.P., and Liu, B.L. 1999, Chemometr. Intell. Lab., 45, 95. 96. Frahm, A.W., and Chaudhuri, R. 1979, Tetrahedon, 35, 2035. 97. Bolboacă, S.D., and Jäntschi, L. 2005, Leonardo Journal of Sciences, 7, 58-64. 98. Morita, H., Gonda, A., Wei, L., Takeya, K., and Itokawa, H. 1997, Bioorg. Med.

Chem. Lett., 7, 2387-2392. 99. Bolboacă, S.D., and Jäntschi, L. 2008, Arch. Med. Sci., 4(1), 7. 100. Castro, E.A., Torrens, F., Toropov, A.A., Nesterov, I.V., and Nabiev, O.M. 2004,

Mol. Simulat., 30, 691. 101. Bolboacă, S., Ţigan, Ş., and Jäntschi, L. 2006, Proceedings of the European

Federation for Medical Informatics Special Topic Conference, April 6-8, 222. 102. Supuran, C.T., and Clare, B.W. 1999, Eur. J. Med. Chem., 34, 41. 103. Bolboacă, S.D., and Jäntschi, L. 2007, Integration of Structure Information,

Computer-Aided Chemical Engineering, Elsevier Netherlands & UK, 24, 965. 104. Jäntschi, L., Ungureşan, M.L., and Bolboacă, S.D. 2005, Applied Medical

Informatics, 17, 12. 105. Jäntschi, L., Bolboacă, S. 2006, Electron. J. Biomed., 2, 22. 106. Opris, D., and Diudea, M.V. 2001, SAR QSAR Environ. Res., 12, 1599. 107. Selassie, C.D., Li, R.L., Poe, M., and Hansch, C. 1991, J. Med. Chem., 34, 46. 108. Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg,

B., Wold, S., and Andrews, P. 1991, Int. J. Pept. Protein Res., 37, 414. 109. Carr, R.S., and Nipper, M. 2000. Toxicity of marine sediments and porewaters

spiked with ordnance compounds: Report prepared for Naval Facilities Engineering Commands, NFESC Contract Report CR 01-001-ENV, Washington, DC, USA. Available from: http://enviro.nfesc.navy.mil/erb/erb_a/restoration/ fcs_area/con_sed/MarineSed2000.pdf.

110. Stoenoiu, C.E., Bolboacă, S.D., and Jäntschi, L. 2007, MetEcoMat - Ecomaterials and Processes: Characterization and Metrology, April 19-21, St. Kirik, Plovdiv, Bulgaria, 54.

111. Prathipati, P., Dixit, A., and Saxena, A.K. 2007, Current Computer-Aided Drug Design, 3, 133.

112. Lyne, P.D. 2002, Drug Discov. Today, 7, 1047.




6. Selection of an optimal set of descriptors: use of the Enhanced Replacement Method

Andrew G. Mercader

Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas (INIFTA, UNLP, CCT La Plata-CONICET), Diag. 113 y 64, Sucursal 4, C.C. 16, 1900 La Plata, Argentina

Abstract. The Enhanced Replacement Method (ERM) is a search algorithm that selects an optimal group of descriptors from a much greater set in the construction of linear QSAR/QSPR models. This algorithm has shown many superior attributes that will be presented in this chapter. In addition, successful applications will be displayed to give a better general picture. Finally an innovative methodology to determine the optimal number of parameters using the ERM will be presented.

1. Introduction Quantitative Structure-Property/Activity Relationships (QSPR/QSAR) are generally applied to overcome the lack of experimental data in complex chemical phenomena.[1] Therefore, there exists a permanently renewed interest focused on the development of such kind of predictive techniques. [2-5] The ultimate role of the QSPR/QSAR theory is to suggest mathematical models capable of estimating relevant properties of interest. Such studies rely on the basic assumption that the structure of a compound determines entirely its properties, which can therefore be translated into so-called molecular Correspondence/Reprint request: Dr. Andrew G. Mercader, Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas (INIFTA, UNLP, CCT La Plata-CONICET), Diag. 113 y 64, Sucursal 4, C.C. 16, 1900 La Plata, Argentina. E-mail: [email protected]

Andrew G. Mercader 168

descriptors. These parameters are calculated through mathematical formulae obtained from several theories, such as Chemical Graph Theory, Information Theory, Quantum Mechanics, etc.[6, 7] Until the present time, thousands of descriptors encoding different aspects of the molecular structure have been developed and are available in the literature.[8] As in any QSAR/QSPR study we have to determine how to select those that characterize the property/activity under consideration in the most efficient way. First the optimal number of parameters to include in the model has to be determined. Afterwards the mathematical problem of selecting a subset d of d descriptors from a much larger set D of D>>d ones has to be addressed. The search for the optimal set of descriptors may be monitored by the minimization or maximization of a chosen statistical parameter; we normally prefer to search the model that makes the Standard Deviation (S) as small as possible. In other words, we look for the global minimum of S(d), where d is a point in a space of D!/[d!(D-d)!] ones. A full search (FS) of the optimal variables is impractical since it requires D!/[d!(D-d)!] linear regressions and D normally is higher than one thousand. For this reason, we have recently proposed the Enhanced Replacement Method (ERM)[9] as an adaptation of the Replacement Method (RM).[10-12] This new algorithm present many advantages that have shown to outperform RM and the more elaborated Genetics Algorithms (GA).[13] The main differences between ERM and RM will be discussed in the next sections, also the resemblance of the new algorithm with the Simulated Annealing (SA) which is an adaptation of the Metropolis-Hastings algorithm, a Monte Carlo Method[14] to generate sample states of a thermodynamic system; the name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the cooling gives them more chances of finding configurations with lower internal energy than the initial one.[15] Additionally successful practical applications that give a better understanding of the algorithm and the way it can be used will be presented and shown. Additionally the description of a new method of selecting the optimal number of parameters to include in a model using ERM and. 1.1. Motivation to develop the ERM Even though the RM is a rapidly convergent iterative algorithm that produces linear regression models with small S in a remarkably little computer

ERM: Selection of an optimal set of descriptors 169

time[12, 16, 17]; in some cases it could get trapped in a local minimum of S, living some room for improvement, witch lead to de development of ERM. It should be mentioned that such local minima provided acceptable models, as shown in all earlier applications of the RM.[12, 16, 17] In summary RM is a technique that approaches the minimum of S by judiciously taking into account the relative errors of the coefficients of the least-squares model given by a set of d descriptors d=X1,X2,…,Xd. It has been shown[16] that the RM gives models with better statistical parameters than the Forward Stepwise Regression (FSW) procedure[18] and similar ones than the GA. We believed that RM was preferable to the GA[12, 19] because the former takes into account the error in the regression coefficient and as a result the replacement of the descriptor is not at random as in the GA. GA are more complicated and in their practical application the tuning of some parameters such as population size, generation gap, crossover rate, and mutation rate is required. These parameters typically interact among themselves nonlinearly and cannot be optimized one at a time. There is considerable discussion about parameter settings and approaches to parameter adaptation in the evolutionary computation literature; however there does not seem to be conclusive results on which may be the best. Just as a summary GA are search techniques based on natural evolution where variables play the role of genes (in this case a set of descriptors) in an individual of the species. An initial group of random individuals (population) evolve according to a fitness function (in this case the standard deviation) that determines the survival of the individuals. The algorithm searches for those individuals that lead to better values of the fitness function through selection, mutation and crossover genetic operations. The selection operators guarantee the propagation of individuals with better fitness in future populations. The GAs explore the solution space combining genes from two individuals (parents) using the crossover operator to form two new individuals (children) and also by randomly mutating individuals using the mutation operator. The GAs offer a combination of hill-climbing ability (natural selection) and a stochastic method (crossover and mutation) and explore many solutions in parallel processing information in a very efficient manner.[20] 1.2. Development of ERM The first algorithm that was built was slight modification of RM and was called Modified Replacement Method (MRM).[9] The comparison of the behaviour of the RM and MRM is shown in Figure 1 and Figure 2., respectively


Figure 1. Standard deviation vs. number of steps of the RM.

Figure 2. Standard deviation vs. number of steps of the MRM. suggested us to implement a further optimization routine that integrates the two algorithms, associating the MRM to a thermal agitation of the RM. We tried the following combinations: MRM-RM, RM-MRM and RM-MRM-RM. When RM was applied after MRM (MRM-RM), the starting solution for RM consisted in the best set of variables obtained from the previous application of the MRM algorithm. It should be mentioned that after several runs we decided to discard the MRM-RM-MRM simply because it increased the number of linear regressions significantly without achieving appreciable


Figure 3. Standard deviation vs. number of steps of the ERM. improvements in the statistical results. The sequence RM-MRM-RM emerged as the best algorithm which is in fact the ERM (Figure 3), this is in line with the idea that the ERM is the only algorithm from the above list that goes through a complete simulated annealing cycle.[15] 1.3. Description of the algorithms: Differences between ERM and RM Having a large set , , , 1 2X X X D=D K of D descriptors we need to choose

an optimal subset 1 2 , , , m m m mdX X X=d K of d≪ D descriptors with minimum standard deviation S:

1

21 ( 1)

N

ii

S resN d =

=− − ∑ (1)

where N is the number of molecules in the training set, and resi the residual for molecule i (difference between the experimental and predicted property). Notice that S(dn) is a distribution on a discrete space of ! !( )!D d D d− disordered points dn. The full search (FS) that consists of calculating S(dn) on all those points always enable us to arrive at the global minimum, however it is computationally prohibitive if D is sufficiently large. The RM consists of the following steps:


• We choose an initial set of descriptors dk at random, replace one of the descriptors, say

kiX , with all the remaining D − d descriptors, one by one, and keep the set with the smallest value of S. That is what we define as a ‘step’

• From the resulting set we choose the descriptor with the greatest standard deviation in its coefficient (not considering the one changed previously) and substitute all the remaining D − d descriptors, one by one in its position. The procedure is repeated until the set remains unmodified. In each cycle the descriptor optimized in the previous one is not modify. Thus, we obtain the candidate ( )i

md that comes from the so-constructed path i.

• It should be noticed that if the replacement of the descriptor with the largest error by those in the pool does not decrease the value of S, then the descriptor is not changed.

• The process is carried out for all the possible paths i =1,2,…,d and keep the point dm with the smallest standard deviation: ( )( )min i

mi

S d .

MRM follows the same strategy except that in each step the descriptor with the largest error is replaced even if that substitution is not accompanied by a smaller value of S (the next smallest value of S is chosen). MRM converges to different solutions and commonly bounces from one point to another, occasionally repeating some of them; in such a case we find that a plausible solution is the first one that appears four times. If convergence is too slow the process is stopped after 350 steps, this number of steps achieves a sufficiently small S. ERM as mentioned before is the combination of MRM and RM in the sequence RM-MRM-RM. 1.4. ERM, MRM vs RM: Test results The test and comparison of the different algorithms was performed using four different experimental data sets: a fluorophilicity data set (FLUOR), consisting of 116 organic compounds characterized by 1269 theoretical descriptors [16]; a Growth Inhibition data set (GI), with growth inhibition values to the ciliated protozoan Tetrahymena pyriformis by 200 mechanistically diverse phenolic compounds and 1338 structural descriptors;[17] a GABA receptor data set (GABA), containing 78 inhibition data for flavone derivatives and 1187 molecular descriptors[21] and a 100 ED50 MES mice ip for enaminones (MES) with 1306 descriptors.[22-24] Additionally a data set of 209 Polychlorinated Biphenyls (PCB) with measured Relative Response Factor containing 63912 molecular


descriptors[25] was used to test whether the application of the improved algorithm on an extremely large dataset was possible. In all cases the structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in Hyperchem version 6.03 [26], and the resulting geometries were further refined by means of the semi empirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal/Å. More than a thousand molecular descriptors were calculated using the software Dragon 5.0,[27] including parameters of all types such as constitutional, topological, geometrical, quantum mechanical, etc. In the last dataset 62873 descriptors calculated by the molecular descriptors family methodology[25] were added to the pool. All the algorithms were programmed in the computer system Matlab 5.0.[28] All the numerical tests were carried out for d=7 as an example of a high computational demanding search with a reasonable number of descriptors for a potential model in common QSPR/QSAR studies. It should be mentioned that in the practical use of the algorithm the correlation of the descriptors can be easily avoided by taking out of the pool those descriptors that have a correlation higher than a set limit. A comparison against the exact solution is not possible because even for the smallest database (GABA, 1187D = ) the number of linear regressions for a FS for d = 7 amounts to 6.47x1017 which would take about 3.4 x106 years in a PC with an AMD Athlon 64 2800+ processor. Even in a much more powerful computer the solution would not be reached in a reasonable time. Table 1 shows the average percentage improvement of S taking RM as reference, using all databases for three random initial set of seven descriptors. At the end of the table the ratio of the number of linear regressions for the new algorithms with respect to the RM can be seen. Table 1. Average improvement on the standard deviation taking RM as reference, number of linear regressions and computational time (between parentheses) for MRM and ERM, for the four full data sets with three different initial seven-descriptor sets.

Algorithm RM MRM ERMAverage

Improvement0% 4.76% 5.73%

Average 283878 (0.78) 1629938 (4.51) 1926775 (5.33)Ratio 1 5.74 6.79

Number of linear regressions (Time in minutes*)

* Using an AMD Athlon 64 2800+ processor DDR 1GB


It follows from Table 1 that the MRM outperforms RM and ERM gives even better results. The ERM computational demand is comparable to MRM and is almost seven times greater than RM. This is the small price we have to pay for obtaining QSAR models with better statistical parameters than the ones obtained previously.[16] As a final test of the ERM on a much more demanding problem, we evaluated it on the PCB database containing 63912 descriptors.[25] The ERM converged in a reasonable time, Table 2 shows the results. As expected ERM gave smaller values of S for the same three random initial sets chosen in the preceding tests. It arises from this section that ERM outperforms RM, which as mentioned gives similar results than GA, therefore it is expected that ERM will outperform GA as well. ERM slightly increases the computational demand with respect to RM, nevertheless it was successfully used in a problem with extremely large number of descriptors (63912) calculated by Dragon and the molecular descriptors family methodology. Table 2. Average improvement on the standard deviation taking RM as reference, number of linear regressions and computational time (between parentheses) for ERM, for the PCB data set with three different initial seven-descriptor sets.

Algorithm RM ERM

Average Improvement

0% 2%

Average1.42E+07 (39.25)

7.42 E+07 (205.18)

Ratio 1 5.23

Number of linear regressions

(Computation time in minutes*)

* Using an AMD Athlon 64 2800+ processor DDR 1GB 2. Programming language As previously mentioned we use Matlab 5.0[28] as programming platform in all the search algorithm calculations that are performed in our research group, this is because this platform presents many advantages that will be detailed bellow. As a reference point, we will mention that in the past we used Derive[29] as programming language; which worked fine being able to have all the algorithms achieving satisfactory results. Nevertheless the computational


time was very high; some times the programs had to be left running for several days. For that reason we decided to translate the algorithms to a more efficient programming language. We evaluated two alternatives, ANSI C[30] and Matlab. After reading the opinion of several experts in programming from around the globe it was decided to proceed with Matlab. In summary the experts said that potentially ANSI C could be faster than Matlab, nevertheless in order to be able to program an efficient C code a basic knowledge of the language was not enough, additionally it was required to have several years of experience; and the estimation of the time it would take to program in C vs. Matlab for a person with basic knowledge was as high as 10 to 1. Moreover, it was estimated that although the new program in Matlab might not be optimized, the Matlab language has been optimized in matrix handling; hence the computational resources management would be optimal in that sense. After translating all the codes from Derive to Matlab the results were marvelous. The reduction in computational time exceeded our expectations. As can be seen in Table 3 the necessary time in Matlab with respect to Derive is from 140 to 700 times lower. The increment in the time savings is due to fact that Derive’s computational time increases exponentially whereas Matlab polynomially. This reduction in computational time is one of the reasons that an extremely high demanding problem, such as the one presented in the previous section on descriptors calculated by the molecular descriptors family methodology, can be solved using ERM.

Table 3. Calculation time of Derive vs. Matlab for different d. D=1269, N=116.

d Numberregressions

Derive’s Time (s) Matlab’s Time (s) Derive/Matlab

1 1269 41 0.125 330.4

2 20290 270 2.0 137.6

3 53175 1541 5.5 280.2

4 80964 3850 9.8 394.8 5 126405 6382 16.8 378.9

6 189456 15170 28.1 540.3

7 247359 26647 37.5 710.6 8 322824 38345 54.8 699.7

* Using an AMD Athlon 64 2800+ processor DDR 1GB


We can not compare the results against ANSI C because we do not have the C code, nevertheless we can say that programming in Matlab was very straight forward and the interface is very friendly making the input of the data very simple. To be able to test the ERM code please refer to the Mathworks[44] community files exchange server, a QSAR/QSPR Search Algorithms Toolbox containing the ERM code along with a tutorial has been uploaded and it is available at that site. 3. Determining the optimal number of parameters to include in a model using ERM A fundamental step in the construction of any QSAR model is determining the optimal number of parameters (dopt) to include in the model. As the number of parameters included in the model is increased the statistical parameters of the training set (molecules used to elaborate the model) always improve, nevertheless the statistical parameters in the test set (molecules left aside to test the predictive power of the model) start to improve, as more relevant information from the structure is used, and after a certain number of parameters that depend on the case start to deteriorate due to over-fitting of the model to the training set molecules. A scheme that summarizes the problem of over-fitting was include in Figure 4.[31] One may think that a good way of determining the optimal number of parameters is to develop models with increasing number of parameters, try

Figure 4. Typical behaviour of the standard deviation of a model in the Training and Test Set as d is increased.


them in the test set and keep the one with better statistical parameters in the test set. This may seem appropriate but it can not be done because the chosen model would have been determined using the test set, consequently the test set would no longer be a true external set of molecules. Thus we need a way of determining dopt that only uses the training set. In order to do that we will attempt to use the Kubinyi function (FIT)[32, 33] which is a statistical parameter that closely relates to the Fisher ratio (F), but avoids the main disadvantage of the latter that is too sensitive to changes in small d values and poorly sensitive to changes in large d values. The FIT(d) criterion has a low sensitivity to changes in small d values and a substantially increasing sensitivity for large d values. The greater the FIT value the better the linear equation. It is given by the following expression:

2

2 2

( 1)( )(1 )

R N dFITN d R

− −=

+ − (2)

Commonly, a plot of FIT vs. d presents a maximum from which it is possible to calculate the optimal number of molecular descriptors (d opt) to be included in the linear regression model. There are many occasions when the maximum is not reached after adding a reasonable number of descriptors in the model. For this reason we recently proposed a variable FIT equation or VFIT which depends on a semi-empirical constant k that gives more weight to d in the FIT equation.[34] It reads:

2

2 2

( 1)( )(1 )R N kdVFITN d R

− −=

+ − (3)

By means of this equation we obtain d opt as the number of descriptors that yields the maximum value of VFIT (dmax) in the plot of VFIT vs. d. The semi-empirical constant k is determined by taking incremental values of 0.5 until the maximum (dmax) remains unchanged for two increments and complies with the rule of thumb that at least 5 data points should be present for each fitting parameter.[35] In this way we obtain dopt with out ever using the data from the test set. An exampled is presented in section 4.2. 4. Applications of ERM The possible applications of the algorithm present no limit compared to any other QSAR methodology. The potential applications are enormous and in this section only a couple of recent applications will be presented as


examples. The main goal is to show how the algorithm is used, the type of predictive models that can be achieved, how this models are successfully used to predict experimental values and to compare the results of ERM and GA. 4.1. QSAR Prediction of inhibition of aldose reductase for flavonoids[36] In this study we performed a predictive analysis based on Quantitative Structure-Activity Relationships (QSAR) of an important property of flavonoids which is the inhibition (IC50) of Aldose Reductase (AR). The importance of AR inhibition is that it prevents cataract formation in diabetic patients. The best linear model constructed from 55 molecular structures incorporated six molecular descriptors, selected from more than a thousand geometrical, topological, quantum-mechanical and electronic types of descriptors. In the present study we choose a training set of 55 flavonoid derivatives for which their activities were reported in the literature by Štefanič-Petek et al.[37] We first try to use all 75 molecules from that paper, but a further revision of the references containing the experimental data to clarify some doubts on the structures[38, 39] revealed some important errors in the representation of some of them. For this reason the number of flavones was reduced to just 55 reliable structures. More precisely, that reduction was due to the fact that some molecules either did not exhibit the desired flavone backbone or their exact structure was not found in the references. In addition, a set of 4 flavones with the desired backbone,[40] was chosen to test the prediction ability of the new model. The structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in the Hyperchem 6.03 package,[26] and the resulting geometries were further refined by means of the semiempirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal.Å-1. We computed the molecular descriptors using the software Dragon 5.0,[27] including parameters of all types as Constitutional, Topological, Geometrical, Charge, GETAWAY (Geometry, Topology and Atoms-Weighted AssemblY), WHIM (Weighted Holistic Invariant Molecular descriptors), 3D-MoRSE (3D-Molecular Representation of Structure based on Electron diffraction), Molecular Walk Counts, BCUT descriptors, 2D-Autocorrelations, Aromaticity Indices, Randic Molecular Profiles, Radial Distribution Functions, Functional Groups, Atom-Centred Fragments, Empirical and Properties.[8] We enlarged that pool by the addition of 18 constitutional and 4 quantum-chemical descriptors (molecular


dipole moments, total energies, homo-lumo energies) not provided by the program Dragon. The resulting total pool thus consisted of D=1233 descriptors. It should be mentioned that the determination that dopt=6 in this case was not done using the above mentioned methodology because this improved methodology was not developed at that time. The best model found using ERM was the following:

50logIC 85.5375( 10) 10.882( 1.4) 1 15.398( 0.9) 4 55.920( 5.35) 2 7.7606( 0.9) 6 2.6755( 0.5) 18.253( 1.1) 4

E u MATS e BELmHATS e DISPe R p

− =− ± − ± − ± + ± +

± − ± − ±

(4)

5

30% 30%

55, 0.9523, 0.3789, 5.14, 10 0.934, 0.447, 0.803, 0.886

2.9127loo loo l o l o

Test Set

N R S FIT pR S R SRMSE

−

− − − −

= = = = <= = = =

= The theoretical validation of the models was done using the well-known Leave-One-Out (loo) and the Leave-More-Out Cross-Validation procedures (l-n%-o),[41] where n% represents the number of molecules removed from the training set. We generated 5,000,000 cases of random data removal for l-n%-o, where n%=30% (16 flavonoids). As a benchmark we used RM and a GA to select an optimal model with dopt = 6 descriptors. To this end we optimized the GA parameters for this particular problem by means of several trials and thus arrived at the following convenient settings: Number of individuals = 250; Generation Gap = 0.9; Single Point Crossover Probability = 0.6; Mutation Probability = 0.7/d. We decided to stop the evolution process when one individual occupied more than 90% of the population or when the number of generations reached 2500. In Table 4 the results are summarized, the details of the descriptors involved in all models are presented in Table 5. By examining the statistical parameters calculated from the training and test sets we conclude that the ERM produces better results than both the GA and RM when exploring large sets of descriptors. Table 4. Comparison between the best models found using ERM, RM and GA with d=6.

Model DESCRIPTORS USED R S RMSETest Set

ERM E1u, MATS4e, BELm2, HATS6e, DISPe, R4p

0.952 0.379 2.9127

RM BELp4, GGI8, MATS4e, Mor22e, E1p, R4v

0.936 0.436 3.9059

GA TIC0, MATS4e, H7e, E1u, BELe4, R3v

0.937 0.433 4.1607


Table 5. Symbols for molecular descriptors involved in the models.

Molecular descriptor

Type Detail

E1u WHIM 1st component accessibility directional WHIM index / unweighted.

MATS4e 2D Autocorrelations Moran autocorrelation - lag 4 / weighted by atomic Sanderson electronegativities.

BELm2 BCUT lowest eigenvalue n. 2 of Burden matrix / weighted by atomic masses.

HATS6e GETAWAY Leverage-weighted autocorrelation of lag 6 / weighted by atomic Sanderson electronegativities.

DISPe Geometrical d COMMA2 value / weighted by atomic Sanderson electronegativities.

R4p GETAWAY R autocorrelation of lag 4 / weighted by atomic polarizabilities.

BELp4 BCUT Lowest eigenvalue n. 4 of Burden matrix / weighted by atomic polarizabilities.

GGI8 Topological Topological charge index of order 8 Mor22e 3D-MoRSE 3D-MoRSE - signal 22 / weighted by atomic

Sanderson electronegativities. E1p WHIM 1st component accessibility directional WHIM

index / weighted by atomic polarizabilities. R4v GETAWAY R autocorrelation of lag 4 / weighted by atomic

van der Waals volumes.

TIC0 Topological total information content index (neighborhood symmetry of 0-order).

H7e GETAWAY H autocorrelation of lag 7 / weighted by atomic Sanderson electronegativities.

BELe4 BCUT lowest eigenvalue n. 4 of Burden matrix / weighted by atomic Sanderson electronegativities.

R3v GETAWAY R autocorrelation of lag 3 / weighted by atomic van der Waals volumes.

The plot of predicted vs. experimental –logIC50 using the best model obtained using ERM shown in Figure 5 suggests that the 55 flavone derivatives follow a straight line. With the purpose of demonstrating that Eq. (4) does not result from happenstance, we resorted to a widely used approach to establish the model


Figure 5. Predicted (Eq. 4) versus experimental –logIC50 for the training (rhombus) and test (triangles) sets. robustness: the so-called y-randomization.[42] It consists of scrambling the experimental property p in such a way that activities do not correspond to the respective compounds. After analyzing 1000000 cases of y-randomization, the smallest value S = 0.8254 obtained from this process resulted to be considerably greater than the one corresponding to the true calibration S = 0.3789. This result further proved that the model is robust, that the calibration is not a fortuitous correlation, and that we have derived a reliable structure-activity relationship. 4.2. Predictive Study of solvent quenching of the →5 7

0 2D F emission of Eu(6,6,7,7,8,8,8-heptafluoro-2,2-dimethyl-3,5-octanedionate)3[34] In this study we performed a predictive analysis of the luminescence lifetime (τ) of Eu(fod)3 in different solvents. The Eu(fod)3 luminescent properties have been helpful as replacements for radioactivity, alternatives to standard fluorescent dyes and donors in energy transfer experiments, and are gaining expanding applications in wide variety of bioanalytical assays. Our best linear model constructed from 23 molecular structures combines four


molecular descriptors, selected from more than a thousand geometrical, topological, quantum-mechanical and electronic types of descriptors. We chose a training set of 23 solvents and a test set of 2 solvents for which Eu(fod)3 luminescence lifetime (τ) was already measured.[41] The size of the training set was selected aiming to have as much structural information as possible on the calculated models. It is interesting to notice that in this example there was only one molecule involved and the solvent was changed, a different approach of what is done in the majority of QSAR studies where different molecules are modeled in the same media. This shows the extention of possible applications, showing that the limit can be further increased by taking different approaches. Again the structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in the Hyperchem 6.03 package[26], and the resulting geometries were further refined by means of the semiempirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal.Å-1. We computed the molecular descriptors by means of the software Dragon 5.0,[27] including parameters of all types.[8] Additionally, 18 constitutional descriptors and 4 quantum-chemical ones (molecular dipole moments, total energies, homo-lumo energies), which were not provided by the program Dragon, were added to the pool. We thus had a pool of D=1057 available descriptors. The optimal number of descriptors was determined using the new methodology, different predictive relationships with the ability to link the molecular structure of the solvents with the luminescence lifetime (τ) of Eu(fod)3 were calculated by means of linear regression models with 1 to 14 parameters (d) that were selected by ERM from the pool of D=1057 descriptors. As k in VFIT is increased (Table 6) a first maximum appears at d=9 (k=2), a second one at d=7 (k=2.5) and finally a third one at d=4 (k=3) that remains unchanged for an additional increment and complies with the above mentioned rule of thumb, which in this case indicated a maximum of 5 parameters. The resulting VFIT with k=3 increases with d up to a maximum value d=dmax=4 shown in Figure 6 this is the optimal value of descriptors in the model. Figure 6 also shows that FIT does not present a maximum in the interval of d between 1 and 14. Table 6 shows that dmax=4 remains for two more k increments supporting the fact that this is actually the optimal number of model parameters.


Table 6. Incremental values of k and the resulting number of descriptors (d) that present a maximum in VFIT.

k d (max. in VFIT)1 - 1.5 -2 9 2.5 73 4 3.5 44 4 4.5 35 2 5.5 2 6 1

Figure 6. VFIT (squares) and FIT (circles in secondary axe) parameters as a function of the number of descriptors for the training set. The best model found using ERM and dopt=4 is:

0.6903( 0.04) 0.1575( 0.03) 5 0.1731( 0.01) 0.2010( 0.03) 0.2120( 0.01) 015

GATS v SpJhetv RDF e

τ = ± + ± ⋅ − ± ⋅ +± ⋅ + ± ⋅

(5)

4

22% 22%

23, 0.9746, 0.0375, 8.7242, 10 0.9597, 0.0470, 0.9031, 0.0735

0.03808loo loo l o l o

Test Set

N R S FIT pR S R SRMSE

−

− − − −

= = = = <= = = =

=


Once more the theoretical validation of the models was done using the well-known Leave-One-Out (loo) and the Leave-More-Out Cross-Validation procedures (l-n%-o)[41], where n% represents the number of molecules removed from the training set. We generated 5,000,000 cases of random data removal for l-n%-o, where n%=22% (5 solvents). We also tried the GA on the same problem which required several runs to optimize its parameters. They were: Number of individuals = 20; Generation Gap = 0.9; Single Point Crossover Probability = 0.2; Mutation Probability = 0.7/d. The algorithm was stopped when one individual occupied more than 90% of the population or when the number of generations reached 2000. In Table 7 the results are summarized, definitions of the descriptors involved in all models are presented in Table 8. We can appreciate that ERM presents better statistical parameters than GA. The plot of predicted vs. experimental τ shown in Figure 7suggests that the 23 training and 2 test set solvents approximately follow a straight line. A y-randomization procedure was also performed by analyzing 5000000 case, the smallest S value obtained 0.06739 is considerably larger than the one found in the true calibration (S=0.0375).

Table 7. Linear models for the training set with N=23.

Model DESCRIPTORS USED R S VFIT RMSETestSet ERM GATS5v, Sp, Jhetv, RDF015e 0.975 0.0375 4.856 0.0381 GA Mor30e, piPC05, BEHm5, X1Av 0.968 0.0420 3.815 0.1246

Table 8. Symbols for molecular descriptors appearing in different models.

Molecular descriptor Type Description

GATS5v 2D Autocorrelations Geary autocorrelation - lag 5 / weighted by atomic van der Waals volumes.

Sp Constitutional sum of atomic polarizabilities (scaled on Carbon atom).

Jhetv Topological Balaban-type index from van der Waals weighted distance matrix.

RDF015e Radial Distribution Function

Radial Distribution Function - 1.5 / weighted by atomic Sanderson electronegativities.

Mor30e 3D-MoRSE 3D-MoRSE - signal 30 / weighted by atomic Sanderson electronegativities.

piPC05 Topological Molecular multiple path count of order 05.

BEHm5 BCUT Highest eigenvalue n. 5 of Burden matrix / weighted by atomic masses.

X1Av Topological Average valence connectivity index chi-1.


Figure 7. Lifetime τ experimental versus predicted by Eq. (5) for the training set (rhombus), test set (triangles). 5. Conclusions The Enhanced Replacement Method has shown to be a superior algorithm for the search of an optimal set of descriptors from a much larger group. The Algorithm has shown to outperform RM presenting better statistical parameters adding an acceptable computational effort. The algorithm was successfully implemented in an extremely high demanding problem as the one presented by searching a pool of 62873 descriptors calculated by the molecular descriptors family methodology. The above presented applications show that it is easy to use and confirm it’s superiority to GA. The linear nature of the obtained models makes any potential interpretation of the structural parameters possible; in contrast to nonlinear methodology as Artificial Neural Networks.[43] Using the algorithm and the above presented methodology it is possible to determine the optimal number of parameters to include in a model using only information from the training set. The algorithms are available at the Mathworks[44] community files exchange site.


References 1. Hansch, C. and Leo, A. 1995, Exploring QSAR: Fundamentals and Applications

in Chemistry and Biology, Am. Chem. Soc., Washington, D.C. 2. Sexton, W.A. 1950, Chemical Constitution and Biological Activity, D. Van

Nostrand, New York. 3. Hansch, C. and Fujita, T. 1964, J. Am. Chem. Soc, 86, 1616-1626. 4. Hansch, C. 1969, Acc. Chem. Res, 2, 232-239. 5. King, R.B., Ed. 1983, Chemical Applications of Topology and Graph Theory;

Studies in Physical and Theoretical Chemistry, Elsevier, Amsterdam. 6. Katritzky, A.R., Lobanov, V. S., Karelson, M. 1995, Chem. Rev. Soc., 24, 279-287. 7. Trinajstic, N. 1992, Chemical Graph Theory, CRC Press, Boca Raton, FL. 8. Todeschini, R. and Consonni, V. 2000, Handbook of Molecular Descriptors,

Wiley VCH, Weinheim, Germany. 9. Mercader, A.G., Duchowicz, P.R., Fernandez, F.M. and Castro, E.A. 2008,

Chemom. Intell. Lab. Syst., 92, 138-144. 10. Duchowicz, P.R., Castro, E.A. and Fernández, F.M. 2006, MATCH Commun.

Math. Comput. Chem., 55, 179-192. 11. Duchowicz, P.R., Fernández, M., Caballero, J., Castro, E.A. and Fernández, F.M.

2006, Bioorg. Med. Chem., 14, 5876-5889. 12. Helguera, A.M., Duchowicz, P.R., Pérez, M.A.C., Castro, E.A., Cordeiro,

M.N.D.S. and González, M.P. 2006, Chemometr. Intell. Lab., 81, 180-187. 13. So, S.S. and Karplus, M. 1996, J. Med. Chem., 39, 1521-1530. 14. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E.

1953, J. Chem. Phys., 21, 1087-1092. 15. Kirkpatrick, S., Gelatt Jr., C.D. and Vecchi, M.P. 1983, Science, 220, 671-680. 16. Mercader, A.G., Duchowicz, P.R., Sanservino, M.A., Fernandez, F.M. and

Castro, E.A. 2007, Journal of Fluorine Chemistry, 128, 484-492. 17. Duchowicz, P.R., Mercader, A.G., Fernández, F.M. and Castro, E.A. 2007,

Chemom. Intell. Lab. Syst, 90, 97-107. 18. Draper, N.R. and Smith, H. 1981, Applied Regression Analysis, John

Wiley&Sons, New York. 19. Duchowicz, P.R., González, M.P., Helguera, A.M., Cordeiro, M.N.D.S. and

Castro, E.A. 2007, Chemom. Intell. Lab. Syst., 88, 197-203. 20. Melanie, M. 1998, in An Introduction to Genetic Algorithms, Vol., A Bradford Book

The MIT Press, Cambridge, Massachusetts • London, England, pp.3-9, 130 -131. 21. Duchowicz, P.R., Vitale, M.G., Castro, E.A., Autino, J.C., Romanelli, G.P. and

Bennardi, D.O. 2007, Eur. J. Med. Chem., 43, 1593-1602 22. Edafiogho, I.O., Hinko, C.N., Chang, H., Moore, J.A., Mulzac, D., Nicholson,

J.M. and Scott, K.R. 1992, J. Med. Chem., 35, 2798-2805. 23. Edafiogho, I.O., V.V, K., Ananthalakshmi and Kombian, S.B. 2006 Bioorganic

& Medicinal Chemistry, 14, 5266-5272. 24. Eddington, N.D., Cox, D.S., Khurana, M., Salama, N.N., Stables, J.P., Harrison,

S.J., Negussie, A., Taylor, R.S., Tran, U.Q., Moore, J.A., Barrow, J.C. and Scott, K.R. 2003, European Journal of Medicinal Chemistry, 38, 49-64.


25. Jäntschi, L. 2007, Leonardo Electronic Journal of Practices and Technologies, 3, 67 - 84.

26. HYPERCHEM. 6.03 (Hypercube) http://www.hyper.com. 27. DRAGON. 5.0 Evaluation Version http://www.disat.unimib.it/chm. 28. Matlab. 5.0 The MathWorks Inc. http://www.mathworks.com/. 29. Derive. 5, Texas Instrument Incorporated, http://www.derive-europe.com/

specs.asp?d6. 30. C. ANSI, Tutorial, http://cprogramminglanguage.net/c-programming-language-

tutorial.aspx. 31. Hawkins, D.M. 2004, Journal of Chemical Information and Computer Sciences,

44, 1-12. 32. Kubinyi, H. 1994, QSAR & Combinatorial Science, 13, 285-294. 33. Kubinyi, H. 1994, QSAR & Combinatorial Science, 13, 393-401. 34. Mercader, A.G., Duchowicz, P.R., Fernández, F.M., Castro, E.A. and Wolcan, E.

2008, Chem. Phys. Lett., 462, 352–357. 35. Hansch, C. 1990, Comprehensive Drug Design, Pergamon Press, New York. 36. Mercader, A.G., Duchowicz, P.R., Fernández, F.M., Castro, E.A., Bennardi,

D.O., Autino, J.C. and Romanelli, G.P. 2008, Bioorganic & Medicinal Chemistry, 16, 7470–7476.

37. Štefanič-Petek, A., Krbavčič, A. and Šolmajer, T. 2002, Croat. Chem. Acta, 75, 517-529.

38. Okuda, J., Miwa, I., Inagaki, K., Horie, T. and Nakayama, M. 1982, Biochemical Pharmacology, 31, 3807-3822.

39. Okuda, J., Miwa, I., Inagaki, K., Horie, T. and Nakayama, M. 1984, Chem. Pharm. Bull., 32, 767-772.

40. Sung Lim, S., Hoon Jung, S., Ji, J., Shin, K.H. and Keum, S.R. 2001, Journal of Pharmacy and Pharmacology, 53, 653-668.

41. Hawkins, D.M., Basak, S.C. and Mills, D. 2003, J. Chem. Inf. Model., 43, 579-586. 42. Wold, S. and Eriksson, L. 1995, in Statistical validation of QSAR results, Vol. 2

(Ed. H. v. d. Waterbeemd), VCH, Weinheim, pp.309-318. 43. Wold, S., Sjostrom, M. and Eriksson, L. 1998, Encyclopedia of Computational

Chemistry, Wiley, Chichester, England. 44. Mathworks. http://www.mathworks.com/matlabcentral/fileexchange/19578.




7. Orthogonalization methods in QSPR - QSAR Studies

Pablo R. Duchowicz*, Francisco M. Fernández and Eduardo A. Castro

Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas INIFTA (UNLP, CCT La Plata-CONICET), Diag. 113 y 64, C.C. 16, Suc.4, (1900) La Plata, Argentina

Abstract. We discuss some features of the orthogonalization methods commonly applied to QSPR - QSAR studies. We outline the well known multivariable linear regression analysis in vector form in order to compare mainly Randic and Gram-Schmidt orthogonalization procedures and also cast the basis for other approaches like Löwdin’s one. We expect that present review may become the starting point for future developments in QSAR - QSPR Theory.

1. Introduction Among the different theoretical methods available in the literature for predicting physicochemical properties or biological activities of substances, solely based on the knowledge of their molecular structure, is the Quantitative Structure - Property and Structure - Activity Relationships (QSPR - QSAR) Theory [1, 2]. Since the pioneer works developed by Hansch in 1964 [3], the basis of QSPR - QSAR is the assumption that the variation of the behaviour of chemical compounds, as expressed by any Correspondence/Reprint request: Dr. Pablo R. Duchowicz, Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas INIFTA (UNLP, CCT La Plata-CONICET), Diag. 113 y 64, C.C. 16, Suc.4, (1900) La Plata, Argentina. E-mail: [email protected] / [email protected]

Pablo R. Duchowicz et al. 190

experimentally measured property or activity on such compounds, can be correlated with numerical entities related to some aspect of the chemical structure represented through well-defined molecular descriptors [4, 5]. The hypotheses involved in QSPR - QSAR analysis have proved suitable for a wide spectrum of properties / activities of interest. The molecular descriptors employed for establishing the model are generally used to describe different characteristics / attributes of certain structure in order to yield information about the property / activity being studied. These numerical variables capture different constitutional, topological, geometrical or electronic aspects of the molecular structure in consideration, and in most cases can be obtained through mathematical formulae obtained from several theories, such as the Chemical Graph Theory, Information Theory, Quantum Mechanics, etc. [6-8]. Nowadays, there exist more than a thousand of descriptors in the literature which can be readily calculated via different software packages [9-11]. QSPR - QSAR models enable to predict the property / activity for substances that have yet not been tested for many different reasons, either because they are unstable, toxic, or simply because their measurement requires too much time or is expensive. In terms of economical aspects, these studies give chances for a rational use of the available resources present in the laboratory, avoiding performing expensive and unnecessary experimental determinations. With respect to moral aspects, the QSPR - QSAR analyzes applied to the field of Toxicology have reached a great importance in the virtual screening of the toxic potential of compounds before their synthesis [12], and thus represent an effective alternative that reduces animal testing in biological essays. In Medicinal Chemistry, both QSPR - QSAR predictions of ADMET properties [13] and oral bioavailability of compounds [14] were conveniently addressed. Finally, from the theoretically point of view, the model can illuminate the mechanisms of physicochemical properties or biological activities of the compounds. As a consequence that a single descriptor is unable to carry the complete structural information of a given molecule, one has to search for the best descriptors among the more than a thousand available in the literature, in order to find those that are the most representative / descriptive parameters for the modeled property / activity. There exist various standard statistical methods that constitute a common practice for the QSPR - QSAR model design, such as linear: Multivariable Linear Regression (MLR) [15], Principal Component Analysis (PCA) [16], Genetics Algorithms (GA) [17], Replacement Method (RM) [18], and non-linear methods: Artificial Neural Networks (ANN) [19], or Support Vector Machines (SVM) [20].

Orthogonalization in QSPR-QSAR 191

In past years, the MLR technique has proved to be of valuable applicability in establishing predictive QSPR - QSAR by performing an exhaustive analysis of a pool containing a great number of structural descriptors. Linear models result more general and can transparently reveal the effect on the property being modeled regarding the inclusion / exclusion of structural variables, thus making it possible to suggest cause / effect relationships by means of parallelisms. The main advantage of developing linear regression models when compared to non-linear ones is the fact that linear models suffer less from the over-fitting (over-training) problem [21, 22], as they do not involve too many optimization parameters to be found for model building, just the regression coefficient for each model’s numerical variable. Earlier attempts to derive the best QSPR - QSAR models based on MLR has led to the use of orthogonal descriptors [23-27], that is to say, descriptors with non - overlapping structural information. It is well known that the use of a set of orthogonal descriptors does not improve the global statistical parameters of the linear regression model although the standard errors of the regression coefficients appear to be smaller [26]. However, a set of orthogonal descriptors (orthogonal predictor variables in general) offers several advantages: first, several expressions (such as the correlation coefficient and standard deviation, for example) are simpler [28]. Second, orthogonal descriptors have proved more suitable for the search of the best model [28-31]. Third, the orthogonalization algorithms are useful to identify linear dependent or almost linear dependent descriptors [32, 33] and make the linear regression more stable [32]. Fourth, the coefficients of orthogonal descriptors do not change when we add more descriptors to the set. During last decades, several authors have developed a program that enables one to obtain the best subset, in the sense of smaller statistical parameters of standard deviation S or Fisher parameter F, of d descriptors out of a large number of D descriptors [11, 18, 28, 34]. This analysis requires that

we do linear regressions for all [ ]! ( )! !D

D D d dd⎛ ⎞

= −⎜ ⎟⎝ ⎠

possible subsets, and this

is also facilitated by the use of orthogonal descriptors because they make it easier to determine the contribution of each descriptor to the set. 2. Overview of vector notions of multivariable linear regression analysis Linear algebra gives us an appropriate language for the discussion of the Least Squares Method and the Multivariable Regression Analysis [15]. It is


possible to develop this approach further and derive some useful relationships related to orthogonal concepts that have not been discussed earlier, as reported in a recent work [35]. We consider a vector space , ,...V = f g on the field of real numbers, endowed with an inner product f · g, and define the norm = ⋅f f f of a vector V∈f and a distance between f and g, , V∈f g , as ( , )D = −f g f g .

We consider a set of d linearly independent vectors 0 1 , ,..., dB = f f f that span a subspace BS V⊆ . It is well known that the set B is linearly independent if and only if the determinant M of the matrix M with elements

, , 0,1,...ij i jM i j= ⋅ =f f is nonzero. It is our purpose to find the closest approximation to a given vector

V∈f by means of a linear combination BS∈f of vectors of B:

0

d

j jj

c=

=∑f f (2.1)

Clearly, the problem to solve here is to find the set of coefficients cj that minimize the distance ( , )D f f :

2( , ) 0, 0,1,...,j

D j dc

∂= =

∂f f (2.2)

It is quite easy to prove that the set cj make −f f orthogonal to the subspace SB. Therefore, it follows from ( ) 0, 0,1,...,j j d− ⋅ = =f f f that the optimal coefficients cj are solutions to the set of linear equations and are unique:

0, 0,1,...,

d

jk k jk

M c j d=

= ⋅ =∑ f f (2.3)

Under such conditions it follows that ( , ) ( , )D D≥g f f f for all BS∈g . One can also prove that 2

( ) ( )− ⋅ − = −f v f v f v for all BS∈v as well as

2 22− = − − −f f f v f v . It follows from these expressions that 2

⋅ =f f f , and 2 2≤f f .


If we now define ( )2= − ⋅wu u u w w w and 2= − ⋅wv v v w w w for

, , V∈u v w , and take into account the Cauchy - Schwarz inequality [15] we conclude that the ratio

( , , )C ⋅= w w

w w

u vu v wu v

(2.4)

satisfies 1 ( , , ) 1C− ≤ ≤u v w . In the application of these results to MLR the vectors are N-tuples of real numbers, such as ( (1), (2),..., ( ))f f f N=f and we choose the inner product to be:

1( ) ( )

N

jf j g j

=

⋅ = ∑f g (2.5)

Typically the components of the vector 0f are 0 ( ) 1, 1,2,...,f j j N= = so

that 20 N=f and

02

10

1 ( )N

j

g jN =

⋅= =∑g f g

f (2.6)

is the expectation value of V∈g . It follows from the definition of Eq. (2.4) and from

2⋅ =f f f and

20 0 0 0b⋅ = ⋅ =f f f f f , with 2

0 0 0b = ⋅f f f , that:

( )( )2 22 22

0 0 0 00 2 2222 2 22 2 0 0

0 0 0 0

( , , )b b

Cbb b

− −= =

−− −

f f f ff f f

f ff f f f

(2.7)

Therefore, the coefficient of correlation R between f and f is simply given by:

2 220 0

0 2 220 0

0 ( , , ) 1b

R Cb

−≤ = = ≤

−

f ff f f

f f (2.8)


The standard deviation S in terms of present notation reads [36]:

( , )1

DSN d

=− −f f (2.9)

From now on we assume that the vector f represents a given physicochemical property or biological activity for a set of N molecules, and the vectors

1 2, ,..., df f f are predictor variables or descriptors in the language of the QSPR - QSAR Theory. Accordingly, f is the regression model. Both R and S parameters depend upon the subspace spanned by the vectors f1,f2,…, fd, and measure the model’s fitting between f and f . The R parameter is based on a theoretical definition that is usually employed as a non absolute criterion of linearity for the model. It is only based on random errors and does not contemplate the precision of the experimental data. The value of R increases and approaches unity and ( , )D f f decreases and approaches zero as the number of linear independent vectors f j increases. When d + 1 = N we have an interpolation case with R = 1 and ( , ) 0D =f f because SB = V. The S parameter quantifies the precision of the predicted data by taking into account the degrees of freedom of the mathematical problem: N−d−1. It is to be noted that the MLR would have sense whenever the value of S is smaller than the experimental standard deviation (Sexp), which indicates the deviation of the experimental observations from their mean value. A QSPR - QSAR model is successful if it gives acceptable global statistical parameters (S, R, etc.) with a small number of descriptors. Finally, for calculating the standard error of the regression coefficients cj we employ the following formula [36]:

1( )j jjc Sδ −= M (2.10)

3. Outline of some popular orthogonalization techniques As argued in the Introduction, from both the theoretical and practical point of view it is convenient to use an orthogonal basis set for the subspace SB. There are many ways of orthogonalizing a finite set of linearly independent vectors; here we compare two well - established methods: the Randic orthogonalization method and the Gram Schmidt procedure. The outcome of both methods depends on the order of the sequential process; in other words: different orthogonalization orders produce different sets of orthogonal vectors that span the same subspace. For the case of d


nonorthogonal vectors, there exist d ! orders of orthogonalization. In addition, it is very important to note that the values of the global statistical parameters (S, R, etc.) depend on the subspace spanned by the chosen vectors f j and are independent of the order of construction of the orthogonal vectors. From a given a set of linearly independent vectors 1 2 , , , df f fK we want to obtain a set of orthogonal vectors 1 2 , , , du u uK by means of linear combinations

1, 1, 2, ,

d

i ji jj

c i d=

= =∑u f K (3.1)

that should satisfy 2.i j i ijδ=u u u . Therefore, if c, Δ and λ are the matrices

with elements ijc , ij i jΔ = ⋅f f and .ij i jλ = u u , respectively, then we derive a

general expression that should satisfy the coefficients of the linear combinations of Eq. (3.1):

t =c Δc λ (3.2)

If, for example, 1t −=c c then Eq. (3.2) is the diagonalization of the positive-definite, symmetric matrix Δ . Alternatively, if 1/ 2−=c Δ we have Löwdin’s orthogonalization procedure and =λ I is the identity matrix (orthonormal vectors). In what follows we compare the most widely applied strategies. 3.1. Randic orthogonalization This is called “the method of correlations and residuals”. Suppose we consider the order of orthogonalization to be: 1 ,..., dB = f f , then the method proposed by Randic [23] is the following step-by-step process: 1. From a set of d linearly independent vectors 1 ,..., dB = f f that span a

subspace BS V⊆ , select a vector which will be considered the first

orthogonal descriptor omega, 1Ω . According to the order in B, 11=Ω f .

2. Do the remaining d − 1 vectors 2 3 , ,..., df f f orthogonal to 1f by calculating the linear regression residuals of 2 3 , ,..., df f f against 1f ,

which will lead to a new set 11 , 2,...,i i d= − =Ω f f ., where f is the

regression model. This is called the first orthogonalization step.


3. According to the selected order in B, from the set 1

1 , 2,...,i i d= − =Ω f f we select the second orthogonal vector to be 2 2 1=Ω Ω .

4. Do the remaining d − 2 vectors 11 , 3,...,i i d= − =Ω f f orthogonal to 2Ω

by calculating the linear regression residuals of 11 , 3,...,i i d= − =Ω f f

against 2Ω , which will lead to a new set 2,1 2 , 3,...,i i d= − =Ω Ω f .

Notation 2,1iΩ means that 1iΩ has been made orthogonal to 2 1Ω , which in turn is 2f that is already orthogonal to 1f . This is the second orthogonalization step.

5. Select the next orthogonal descriptor according to the order of orthogonalization and proceed as in previous steps. The zth orthogonalization step will consist on making d − z vectors orthogonal to a given one, leading to a new set of vectors. Finally, the orthogonal basis derived from the set 1 ,..., dB = f f will be 1 ,..., dO = Ω Ω .

The Randic orthogonalization process is illustrated in Fig. 1.

Figure 1. Schematic illustration of the Randic orthogonalization process.

3.2. Gram schmidt orthogonalization The Gram Schmidt method [15, 32, 33] consists on the following procedure. Starting from the basis 0 1, ,..., dB = f f f we construct a new basis


0 1 , ,..., du u u hierarchically and with a certain order of orthogonalization according to:

0 0

1

20

,j

k jj j k

k k

−

=

=

⋅= −∑

u fu f

u f uu

(3.3)

The closest approximation to f in terms of the set of orthogonal vectors takes the simpler form:

20

,d

jj j j

j j

b b=

⋅= =∑

f uf u

u (3.3)

and it is seen that the orthogonal coefficients bj are independent of d. 4. Comparison of Randic and Gram Schmidt methods We will try to go beyond a previous discussion on the subject [33] and show that some numerical results discussed previously by other authors can be proved rigorously and easily. By showing all the equations used in present calculation in detail together with some numerical examples we hope to facilitate the reader to reproduce our results more easily. For the case of orthogonal descriptors, it follows from Eq. (2.8) that

2 2

1

d

jj

R R=

= ∑ 22

22 22

0 0

j jj

bR

b=

−

u

f f (4.1)

where Rj is the coefficient of correlation between f and f calculated with a single orthogonal descriptor. As indicated previously in section 3.1., the orthogonalization method proposed by Randic does not modify one of the d vectors, say f1, during orthogonalization. The Gram Schmidt coefficient b0 can be compared to the Randic’s 0

Rb by taking into account that:

1 00 0 1 1 0 0 1 2

2 0

,d

R Rj j

jb b b b b b

=

⋅= + + = −∑ f uf f f u

u (4.2)

Now, note that according to Eq. (3.3) we can write:


0, 1

j

j kj k jjk

d d=

= =∑u f (4.3)

where the explicit expressions for the coefficients dkj are not relevant for present discussion. If we substitute Eq. (4.3) into Eq. (3.4) and take into account that the least - squares solution is unique, we conclude that

dk kj jj k

c d b=

= ∑ (see Eq. (2.1), from which it follows that

d dc b= (4.4) This result was suggested by numerical experiments conducted by Randic [23] and later discussed by this and other authors [23, 25, 27, 37], but it was not proved rigorously until several years later [35]. Equation (4.4) is a consequence of the particular triangular form of the Gram Schmidt linear combination of Eq. (3.3) and one does not expect that it applies to other orthogonalization procedures. According to the standard formula [36], the error of the orthogonal coefficients bj is:

jj

Sbδ =u

(4.5)

It can be demonstrated [35], that in addition to Eq. (4.4) we have

d dc bδ δ= (4.6) We show these results with a previous numerical investigation, where the Hosoya’s Z index is fitted by means of connectivity and higher connectivity indices

jX discussed by Randic [23]. In this case, we choose f = Z and fj =

jX,

j = 1,2,3,4. Consequently, Gram Schmidt orthogonal descriptors uj should be compared with Randic’s

jΩ. The best model with lowest S that results from

the connectivity indices involve descriptors f0,f1,f3,f4. In Table 1 we show the coefficient of the nonorthogonal and orthogonal descriptors and their respective standard errors for the best model. The coefficient b0 (the constant in the language of linear regression) should be modified according to Eq. (4.2) in order to have complete agreement. We appreciate that the errors are smaller for the orthogonal descriptors as argued earlier by Randic [26], and that c4 = b4 and δc4 = δb4 ((Eqs. (4.4) and (4.6), respectively). The calculations demonstrate that the order of orthogonalization affects the coefficients bj, their standard errors δbj, and the absolute relative errors j jb bδ .


Table 1. Coefficients of the nonorthogonal and orthogonal descriptors for the model with lowest S.

descriptor coefficient standard error descriptor coefficient standard error f0 -44.248 1.337 u0 17.000 0.051f1 18.935 0.474 u1 17.967 0.359 f3 0.781 0.204 u3 1.102 0.146 f4 -0.686 0.304 u4 -0.686 0.304

What happens when we add or remove one of the descriptors to or from our model? It is possible to demonstrate [35]

( ) ( ) ( )222 [ 1]2 2[ ] [ 1]

1

dj jd d

b SS S

N d

++

−= +

− −

u (4.7)

which clearly shows the effect on S of adding or removing uj, respectively. Equation (4.7) explains why Lucic et al.[28] could improve the model by removing carefully selected orthogonal descriptors. If

22j jb u

is sufficiently small, then removal of uj may decrease the value of S while slightly reducing the value of R (refer to Eq. (4.1). Therefore, taking into account that

( )2222

j j j jb = ⋅u f u u plays such a relevant role in the values of R and S we consider it to be a measure of the contribution or weight of the orthogonal descriptor uj to the model. We illustrate these concepts by means of the modeling of the boiling points of 18 octanes with a set of connectivity indices, as done by Lucic et al.[28] In this case we choose f0 as usual and 1 , 0 6j

j j+ = = −f χ . The best model found with five nonorthogonal descriptors and that minimizes S is the next one:

0 1 2

3 4 6

(2583.416 450.924) (182.756 25.593) (310.498 67.625)(46.150 12.544) (8.304 1.872) (5.666 1.703)

= ± − ± − ± −

± + ± − ±

bp f f ff f f

(4.8)

with R = 0.993 and S = 0.887oC. In Table 2 we show the weights of the orthogonal descriptors in this subset with smallest S, for four different orders of orthogonalization. The entries of this table suggest that it may be reasonable to remove the orthogonal vector u6 with the smallest weight of 9.980.10-5. If we do that we obtain four new models with four descriptors each that yield R = 0.993 and S = 0.855

oC. This value of

S is smaller than the smallest one that we get by removal of nonorthogonal


descriptors. This notable advantage of orthogonal predictor variables that help us to remove insignificant descriptors was first pointed out by Lucic et al.[28]. Therefore, the possibility of removing orthogonal descriptors from the model depends upon the selected order of orthogonalization.

Table 2. Weights of the descriptors for several orderings of the optimal model.

descriptor indices descriptor weights

2 3 6 1 4 0.674 0.135 9.980.10-5 0.152 0.024 3 2 6 1 4 0.777 0.032 9.980.10-5 0.152 0.0242 3 6 4 1 0.674 0.135 9.980.10-5 0.113 0.0633 2 6 4 1 0.777 0.032 9.980.10-5 0.113 0.063

5. Selection of an appropriate order of orthogonalization Once we have the best set of d nonorthogonal descriptors then we look for the best order of orthogonalization out of d ! possible ways, according to a

Table 3. Best model for descriptors with the orthogonal basis (R = 0.991, S = 0.922 o

C, F = 2.449).

Orthogonalization order: 1 4 5, ,χ χ χ

1 4 5

1 4 5

113.713 30.328 12.270 12.7730.821, 0.404, 0.379R R R

= + + −= = =

bp u u u (5.1)


1 4 5

1 4 5

113.713 30.328 13.129 8.0290.821, 0.495, 0.249R R R

= + − +

= = =

bp u u u (5.2)


1 4 5

1 4 5

113.713 40.726 2.879 12.7730.905, 0.138, 0.379R R R

= − + −

= = =

bp u u u (5.3)


1 4 5

1 4 5

113.713 53.416 2.879 6.3690.952, 0.138, 0.236R R R

= + + −= = =

bp u u u (5.4)


1 4 5

1 4 5

113.713 36.485 13.129 6.7320.820, 0.495, 0.251R R R

= − − += = =

bp u u u (5.5)


1 4 5

1 4 5

113.713 53.416 2.262 6.7310.952, 0.108, 0.251R R R

= + − +

= = =

bp u u u (5.6)


Dominant Component Analysis (DCA) [28-31]. This procedure consists on selecting the order of orthogonalization that leads to the highest value of the coefficient of correlation between f and each orthogonal descriptor individually (Rj, Eq. (4.1). These descriptors are called dominant descriptors. Thus, one searches the order of orthogonalization which generates the greatest contributions for the first, second, third, etc. dominant descriptors. A similar scheme was proposed by Randic [23]. Considering again the modeling of boiling points of several octanes with a set of connectivity indices [28], Table 3 includes different orders of orthogonalization with descriptors 1 4 5, ,χ χ χ . As can be appreciated, DCA establishes the order given by Eq. (5.4) as the best one. The appropriate selection of the order of orthogonalization may lead to insignificant orthogonal descriptors in the model, having smaller Rj than the rest of the variables present in it, thus making it possible to remove descriptors as discussed in the previous section (Eq. (4.7). An order established with the DCA criterion maximizes the contribution of some descriptors of the model and necessarily minimizes the contributions of the remaining variables. 6. Final remarks Among the first data reduction techniques that attempt a description of physicochemical properties or biological activities through a regression of a few number of orthogonal descriptors (component) appear the Principal Component Analysis (PCA) and the Partial Least Squares techniques [38-40]. Furthermore, various publications [41-44] employs the Randic orthogonalization procedure as a previous step to the standardization of the regression coefficients of a linear regression model build with a small number of descriptors. This represents an alternative way that allows interpretation and assigning more importance to those descriptors exhibiting larger absolute standardized orthogonal coefficients, therefore leading to a ranking of contributions of the descriptors to the analyzed property. Other studies apply the Gram Schmidt procedure in the recent proposed Spectral-Structure Activity Relationship (S-SAR) method [45-47]. The advantage of this technique is that it enables one to replace the Multivariable Linear Regression analysis by purely algebraic models with some conceptual and computational advantages. However, it is also proved that they give the same final analytical equation and correlation results as when using MLR. In conclusion, it is recommended the employment of the orthogonalization procedure as a technique for elucidating the structural factors expressed by mathematical descriptors that contribute most during the model design in the


prediction of a given physicochemical property or biological activity. Furthermore, the proposal of effective QSPR - QSAR models requires the employment of orthogonal predictors for avoiding structural redundancy in the developed models. Finally, the co-linearity of the molecular descriptors should be as low as possible, because the interrelatedness among the different descriptors can lead to highly unstable regression coefficients, which makes it impossible to know the relative importance of an index and underestimates the utility of the regression coefficients of the model. Acknowledgements The authors thank to Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) and to Universidad Nacional de La Plata for financial support. References 1. Hansch, C. and Leo, A. 1995, Exploring QSAR. Fundamentals and Applications

in Chemistry and Biology, American Chemical Society, Washington, D. C. 2. Katritzky, A.R., Lobanov, V.S. and Karelson, M. 1995, Chem. Soc. Rev., 24, 279. 3. Hansch, C. 1964, J. Am. Chem. Soc., 86, 1616. 4. Karelson, M. 2000, Molecular Descriptors in QSAR / QSPR, Wiley -

Interscience, New York. 5. Todeschini, R. and Consonni, V. 2000, Handbook of Molecular Descriptors,

Wiley VCH, Weinheim. 6. Akaike, H. 1973, Second International Symposium on Information Theory, B.N.

Petrov, F. Csáki (Eds.), Akademiai Kiado, Budapest, 267. 7. Bader, R.F.W. 1990, Atoms in Molecules - A Quantum Theory, Clarendon Press,

Oxford. 8. Trinajstic, N. 1992, Chemical Graph Theory, CRC Press, Boca Raton (FL). 9. Recon, Version 5.5. Rensselaer Polytechnic Institute, Troy, New York, USA. 10. CoDESSA, http://ufark12.chem.ufl.edu 11. Dragon, http://michem.disat.unimib.it/chm 12. Worth, A.P., Bassan, A., De Bruijn, J., Saliner, A.G., Netzeva, T., Patlewicz, G.,

Pavan, M., Tsakovska, I. and Eisenreich, S. 2007, SAR&QSAR Environ. Res., 18, 111.

13. Noringer, U. 2005, SAR&QSAR Environ. Res., 16, 1. 14. Martin, Y.C. 2005, J. Med. Chem., 48, 3164. 15. Apostol, T.M. 1969, Calculus, Blaisdell Publishing Co., Waltham,

Massachusetts. 16. Mallinowski, E.R. 1991, Factor Analysis in Chemistry, Wiley, New York. 17. Leardi, R. 1996, Genetic Algorithms in Molecular Modeling. Principles of QSAR

and Drug Design, J. Devillers (Ed.), Academic Press, London, 67.


18. Duchowicz, P.R., Castro, E.A. and Fernández, F.M. 2006, MATCH Commun. Math. Comput. Chem., 55, 179.

19. Zupan, J. 1998, Encyclopedia of Computational Chemistry, Wiley, Chichester. 20. Vapanik, V. 1995, The nature of statistical learning theory, Springer - Verlag,

New York. 21. Livingstone, D.J. and Manallack, D.T. 1993, J. Med. Chem., 36, 1295. 22. Tetko, I.V., Luik, A.I. and Poda, G.I. 1993, J. Med. Chem., 36, 811. 23. Randic, M. 1991, J. Chem. Inf. Model., 31, 311. 24. Randic, M. 1991, Croat. Chem. Acta, 64, 43. 25. Randic, M. 1993, J. Comput. Chem., 14, 363. 26. Randic, M. 1994, Int. J. Quantum Chem., 21S, 215. 27. Randic, M. 2001, J. Chem. Inf. Comput. Sci., 41, 602. 28. Lucic, B., Nikolic, S., Trinajstic, N. and Juretic, D. 1995, J. Chem. Inf. Comput.

Sci., 35, 532. 29. Lucic, B. and Trinajstic, N. 1997, SAR&QSAR Environ. Res., 7, 45. 30. Lucic, B. and Trinajstic, N. 1999, J. Chem. Inf. Comput. Sci., 39, 121. 31. Lucic, B., Trinajstic, N., Sild, S., Karelson, M. and Katritzky, A.R. 1999, J.

Chem. Inf. Comput. Sci., 39, 610. 32. Draper, N.R. and Smith, H. 1981, Applied Regression Analysis, John

Wiley&Sons, New York. 33. Klein, D.J., Randic, M., Babic, D., Lucic, B., Nikolic, S. and Trinajstic, N. 1997,

Int. J. Quantum Chem., 63, 215. 34. Katritzky, A.R., Lobanov, V. and Karelson, M. 1996, CODESSA Reference

Manual, University of Florida, Gainesville (FL). 35. Fernandez, F.M., Duchowicz, P.R. and Castro, E.A. 2004, MATCH Commun.

Math. Comput. Chem., 51, 39. 36. Hildebrand, F.B. 1956, Introduction to numerical analysis, McGraw - Hill Book

Company, Inc., New York. 37. Soskic, M. 1996, J. Chem. Inf. Comput. Sci., 36, 829. 38. Cash, G.G. and Breen, J.J. 1992, Chemosphere, 24, 1607. 39. Sutter, J.M., Kalivas, J.H. and Lang, P.K. 1992, J. Chemometrics, 6, 217. 40. Wold, S., Sjöstrom, M. and Eriksson, L. 1998, Encyclopedia of Computational

Chemistry, R. von Schleyer, N.L. Allinger, T. Clark, J. Gasteiger, P.A. Kollman, H.F. Schaefer III and P.R. Schreiner (Eds.), Wiley, Chichester, 2006.

41. Duchowicz, P.R., Vitale, M.G., Castro, E.A., Fernandez, M. and Caballero, J. 2007, Bioorg. Med. Chem., 15, 2680.

42. Duchowicz, P.R., Gonzalez, M.P., Helguera, A.M., Cordeiro, M.N.D.S. and Castro, E.A. 2007, Chemom. Intel. Lab. Syst., 88, 197.

43. Duchowicz, P.R., Garro, J.C.M. and Castro, E.A. 2008, Chemom. Intell. Lab. Syst., 91, 133.

44. Duchowicz, P.R. and Ocsachoque, M.A. 2009, QSAR&Combinat. Sci., 28, 281. 45. Lacrama, A.-M. 2007, Int. J. Mol. Sci., 8, 842. 46. Putz, M.V. and Lacrama, A.-M. 2007, Int. J. Mol. Sci., 8, 363. 47. Putz, M.V., Putz, A.-M., Lazea, M., Ienciu, L. and Chiriac, A. 2009, Int. J. Mol.

Sci., 10, 1193.




8. Modeling some chemical reactions as a spatio-temporal discrete dynamical systems

Juan Luis García Guirao

Departamento de Matemática Aplicada y Estadística. Universidad Politécnica de Cartagena, Hospital de Marina, 30203-Cartagena, Región de Murcia, Spain

Abstract. The aim of the present work is to state some topological dynamics results for a family of lattice dynamical systems stated by K. Kaneko in [Phys. Rev. Lett., 65, 1391-1394, 1990] which is related to the Belusov-Zhabotinskii chemical reactions. We prove that these LDS (Lattice Dynamical Systems) systems are chaotic in the sense of Li-Yorke, in the sense of Devaney and have positive topological entropy for zero coupling constant. Moreover, we present a definition of distributional chaos on a sequence (DCS) for LDS systems and we state two different sufficient conditions for having DCS. These results survey three different papers, two of them written jointly with M. Lampart.

1. Introduction Classical Discrete Dynamical Systems (DDS’s), i.e., a couple composed by a space X (usually compact and metric) and a continuous self-map ψ on X, have been highly considered in the literature (see e.g., [BC] or [D]) because are good examples of problems coming from the theory of Topological Dynamics Correspondence/Reprint request: Dr. Juan Luis García Guirao, Departamento de Matemática Aplicada y Estadística. Universidad Politécnica de Cartagena, Hospital de Marina, 30203-Cartagena, Región de Murcia Spain. E-mail: [email protected]

Juan Luis García Guirao 206

and model many phenomena from biology, physics, chemistry, engineering and social sciences (see for example, [Da], [KO], [Pu] or [Po]). In most cases in the formulation of such models ψ is a C∞, an analytical or a polynomial map. Coming from physical/chemical engineering applications, such a digital filtering, imaging and spatial vibrations of the elements which compose a given chemical product, a generalization of DDS’s have recently appeared as an important subject for investigation, we mean the so called (LDS) Lattice Dynamical Systems or 1d Spatiotemporal Discrete Systems. In the next section we provide all the definitions. To show the importance of these type of systems, see for instance [ChF]. To analyze when one of this type of systems have a complicated dynamics or not by the observation of one topological dynamics property is an open problem. The aim of this work is, by using different notions of chaos and the concept of topological entropy we characterize the dynamical complexity of a family coupled lattice dynamical systems which contains the stated one by K. Kaneko in [K] (for more details see for references therein) which is related to the Belusov-Zhabotinskii’s reactions type. Concretely, we prove that these LDS systems are chaotic in the sense of Li-Yorke, in the sense of Devaney and have positive topological entropy for zero coupling constant. Moreover, we present a definition of distributional chaos on a sequence (DCS) for LDS systems and we state two different sufficient conditions for having DCS. We present some other problems for the future related with physical/chemical applications. 2. Definitions and notation Let us start introducing two of the most well-known notions of chaos for a discrete dynamical systems generated by the iteration of a continuous self-map f defined on a compact metric space X with metric d. Definition 1. A pair of points x, y ∈ X is called a Li-Yorke pair if (1)

(2)

A set S ⊂ X is called a LY-scrambled set for f (Li-Yorke set) if # S ≥ 2 and every pair of different points in S is a LY-pair where # means the cardinality.

Chemical Reactions as spatio-temporal discrete dynamical systems 207

For continuous self-maps on the interval [0, 1], Li and Yorke [LY] suggested that a map should be called “chaotic” if it admits an uncountable scrambled set. This was subsequently accepted as a formal definition. Definition 2. We say that a map f is Li and Yorke chaotic if it has an uncountable LY-scrambled set. One may consider weaker variants of chaos in the sense of Li and Yorke based on the cardinality of scrambled sets (see for instance [GL1]). On the other hand, a map f is: 1) transitive if for any pair of nonempty open sets there exists an

such that 2) locally eventually onto if for every nonempty open set there exists

an such that . Since this property can be regarded as the topological analog of exactness defined in ergodic theory, it is often called topological exactness. We use the second name here.

Recall that a periodic point of period n of f is a point x such that

and . Definition 3. A map f is called Devaney chaotic if it satisfies the following two properties: (1) f is transitive, (2) the set of periodic points of f is dense in X. The original definition given by Devaney [D] contained an additional condition on f, which reflects unpredictability of chaotic systems: sensitive dependence on initial conditions. However, it was proved see, e.g., [Ba] that sensitivity is a consequence of transitivity and dense periodicity under the assumption that X is an infinite set. Let us recall the notion of Positive topological entropy which is known to topological chaos. An attempt to measure the complexity of a dynamical system is based on a computation of how many points are necessary in order to approximate (in some sense) with their orbits all possible orbits of the system. A formalization of this intuition leads to the notion of topological entropy of the map f, which is due to Adler, Konheim and McAndrew [AKM]. We recall here the equivalent definition formulated by Bowen [B], and independently by Dinaburg [Di]: the topological entropy of a map f is a number

defined by


, where E(n, f, ε ) is a (n, f, ε ) _span with minimal possible number of points, i.e., a set such that for any there is satisfying

for 1 ≤ j ≤ n. A map f is topologically chaotic (briefly, PTE) if its topological entropy h (f ) is positive. Lattice dynamical systems. The state space of LDS (Lattice Dynamical System) is the set

,

where d ≥ 1 is the dimension of the range space of the map of state xi, D ≥ 1 is the dimension of the lattice and the l2 norm is usually taken is the length of the vector xi). We deal with the following LDS family of systems which contains the system stated by K. Kaneko in [K] (for more details see for references therein) which is related to the Belusov-Zhabotinskii reactions (see [KO] and for experimental study of chemical turbulence by this method [HGS], [HOY], [HHM]):

, (1)

where m is discrete time index, n is lattice side index with system size L (i.e. n = 1, 2, . . . L), ε is coupling constant and f (x) is the unimodal map on the unite closed interval I = [0, 1], i.e. f(0) = f(1) = 0 and f has unique critical point c with 0 < c < 1 such that f(c) = 1. For simplicity we will deal with so called “tent map”, defined by

(2) In general, one of the following periodic boundary conditions of the system (1) is assumed: (1) , (2) , (3) ,

standardly, the first case of the boundary conditions is used.


The equation (1) was studied by many authors, mostly experimentally or semi-analytically than analytically. The first paper with analytic results is [ChL], where it was proved that this system is Li-Yorke chaotic, we give alternative and easier proof of it in this paper. We consider, as an example the 2-element one-way coupled logistic lattice (see [KW]) H: I2 → I2 written as

,

(3)

where f is the tent map. 3. Li-Yorke, Devaney and topological chaos The following two lemmas will be used for the proof of the main results. The proof of the first one is obvious (or, see e.g. [DK]). Lemma 4. Let f : X → X and g : Y → Y be maps with dense sets of periodic points. Then the Cartesian product f g : X Y → X Y has also dense set of periodic points. Proposition 5 ([BC]). Let f be the tent map defined by (2). Put

where l = 1, 2, 3, . . . , 2k and . Then the restriction of f

k to Ik,l is linear homeomorphism onto [0, 1]. Let us note that the Cartesian product of two topologically transitive maps is not necessarily topologically transitive (see e.g. [DK]). Hence, for the proof of Theorem 7 we need to prove: Lemma 6. The system

is topologically exact for ε = 0. Proof. Let U be given open subset of IL. Then the projection of U to the m-th coordinate contains Um open connected subset of I, for each m = 1, 2, . . . L. Then by Proposition 5 there is km such that f km (Um) = I. If we put K = maxkm m = 1, 2, . . . L then the K-th iteration of U by the system (1) equals to I L. Theorem 7. The system


is chaotic in the sense of Devaney for ε = 0. Proof. The assertion follows by Lemma 4 and Lemma 6. The following Proposition is very powerful tool of symbolic dynamics1 for observing nearly all dynamical properties. Proposition 8. There is a subsystem of (1) which is conjugated2 to . Proof. Since the critical point for the tent map is equal to 1/2 we can divide the interval I into two sets P1 = [0, 1/3) and P2 = (2/3, 1] and get a family

. Then each point can be represented as an infinite symbol sequence C1(x0) = α = a1a2a3 . . . where Λ1 is Cantor ternary set and

Returning to (3) we can divide its range set into four sets

(see the figure below) where the upper index corresponds to the x1 coordinate and x2 to the lower one. Then again each point can be encrypted as an infinite symbol sequence C2(p) = α = a1a2a3 . . . where Λ2 is 2-dimensional Cantor ternary set3 and

__________________________ 1Here, σ2 is the shift operator on the space of all two element sequences Σ2. 2We say that two dynamical systems (X, f) and (Y, g) are topologically conjugated if there is a homeomorphism h : X → Y such that h ° f = g ° h, such homeomorphism is called conjugacy. 3by n-dimensional Cantor set we mean the Cantor set constructed as subset of .


Now, we denote the k-shift operator σk on k symbol alphabet, defined by σk : Σk → Σk and σk (a1a2a3 . . .) = a2a3 . . . where Σk = αα = a1a2a3 . . . and ai ∈1, 2, . . . k, so the effect of this operator is to delete the first symbol of the sequence α. We can observe that Λ2 is invariant4 subset of the range space of the system (3) and that each its point is encoded by exactly one point from Σ4, for ε = 0. So, by [F] the shift operator σ4 acts on Σ4 exactly as (3) on Λ2, for ε = 0. Theorem 9. The system

is chaotic in the sense of Li-Yorke for ε = 0. Proof. By Proposition 8 the system (1) has a subsystem conjugated to

which is Li-Yorke chaotic (see e.g. [BGKM]). Proposition 10 ([W]). If (X, f) and (Y, g) are topologically conjugated systems then h(f) = h(g). For the proof of result concerning topological entropy we use the well known result: Proposition 11 ([W]). Let σk be the k-shift operator. Then h(σk) = k log 2. Theorem 12. The system

has positive topological entropy for ε = 0. Moreover, its entropy equals to L log 2. Proof. By the construction of the Section 2 it follows that the 2-dimensional system (1) contains 2-dimensional Cantor set which is conjugated (see, e.g. [F]) to the shift space Σ4 by the conjugacy map C2, for ε = 0. Then by Proposition 10 the system has topological entropy equal to the entropy of σ4. Consequently, by Proposition 11 its entropy is 2 log 2.

__________________________________________________

4a set M is invariant for the map f if f (M) ⊂ M.


To the end of the proof, it suffice to note, that the construction of the Section 2 can be generalized to the L-dimensional systems. Such system will be conjugated to the 2L-shift by CL conjugacy and by the same arguments, as in the paragraph above, its entropy equals to L log 2. Remark 13. There are many other notions of chaos, like distributional-chaos, ω - chaos or to satisfy the specification property. The system (1) fulfills all this chaotic behavior by the same arguments as in the proof of the Theorem 9 for zero coupling constant. But obviously this system is not minimal, where minimal means that there is no proper subset which is invariant, nonempty and closed. The proof of Theorem 12 can be done in an alternative way. For zero coupling constant it is obvious that each lattice side contains a subsystem conjugated to) (Σ2, σ2). Then the system (1) contains subsystem conjugated to the L-times product of (Σ2, σ2) and by (see, e.g.

[W]) the assertion follows. For non-zero coupling constants the dynamical behavior of the system (1) is more complicated. The first question is how the invariant subsets of phase space look like? Secondly, what are the properties of ω-limit sets (i.e., set of limits points of the trajectories)? The answer for these questions will be nontrivial. Similar system was studied in [BGLL] and there was used the method of resultants to prove existence of periodic points of higher order. The same concept like in [BGLL] should be used. 4. Distributional chaos on a sequence for LDS The aim of this section is, by the introduction of the notion of distributional chaos on a sequence (DCS) for coupled lattice systems (LDS), to characterize the dynamical complexity of the coupled lattice family of systems (1). We present two different sufficient conditions for having DCS for this family of LDS. These results complete and generalize the result surveyed in the previous sections from [GL1, GL2] where Li-Yorke chaos and topological entropy are respectively studied. The statement of the main results in this direction are the following, see [G]: Theorem 14. Let f be a continuous self-map defined on a compact interval [a, b]. If f is Li-Yorke chaotic, then the LDS system defied by f in the form (1)


is distributionally chaotic with respect to a sequence considering [a, b]∝ endowed with the metrics ρi, i = 1, 2, respectively.

and

Theorem 15. Let f be a continuous self-map defined on a compact interval [a, b]. If f has positive topological entropy, then the LDS system defined by f in the form (1) has an uncountable distributionally scrambled set, composed by almost periodic points, with respect to a sequence considering [a, b]∝ endowed with the metrics ρi, i = 1, 2, respectively. 4.1. From LDS to classical DDS Consider the set of sequences of real numbers

Let in R∞ we consider the following two non-equivalent metrics:

(4)

and

(5) Note that (R∞, ρi), i = 1, 2, is a complete metric space. We consider [a, b]∞ the subset of R∞ composed by sequences with terms in the compact interval [a, b] endowed with the restriction of ρi. Let and be a continuous self-map. Let x = be a solution of the LDS system (1) with initial condition where for all . Define for all , and consider the self-map Ff defined on [a, b]∝ in the form

(6) where and . Remark 16. From the previous construction, for a given self-map f defined on a compact interval [a, b], the LDS system (1) associated with f is


equivalent to the classical dynamical system ([a, b]∝, Ff) where Ff is defined in (6). Let us recall the definition of distributional chaos with respect to a sequence in the setting of discrete dynamical systems. Let be an increasing sequence of positive integers, let x, y∈ [a, b] and . Let

where #(A) denotes the cardinality of a set A. Using these notations distributional chaos with respect to a sequence is defined as follows: Definition 17. A pair of points (x, y) ∈ [a, b]2 is called distributionally chaotic with respect to a sequence for some s > 0 and for all t > 0. A set S containing at least two points is called distributionally scrambled with respect to if any pair of distinct points of S is distributionally chaotic with respect to . A map f is distributionally chaotic with respect to , if it has an uncountable set distributionally scrambled with respect to . Definition 18. A point x is called almost periodic of f, if for any ε > 0 there exists N > 0 such that for any q ≥ 0, there exists r, q < r ≤ q +N, holding

. By AP(f) we denote the set of all almost periodic points of f. The following results from Oprocha [1] and Liao et al. [L] will play a key role in the proof of Theorems 14 and 15. Lemma 19. Let f be a continuous self-map on [a, b]. The map f is Li-Yorke chaotic iff there exists an increasing sequence

such that f is distributionally chaotic repect to . Lemma 20. Let f be a continuous self-map on [a, b]. If f has positive topological entropy, then there exists an increasing sequence such that f has an uncountable distributionally scrambled set T with respect to

. Moreover, the set T is composed by almost periodic points.


For details on the definition of topological entropy see [W]. Note that the definition of distributional chaos in a sequence for a continuous self-map f defined on an interval [a, b] is equivalent to the existence of an uncountable subset S ⊂ [a, b] such that for any x, y ∈ S, x ≠ y, • there exists δ > 0 such that

• for every t> 0,

where if x ∈ A and otherwise. Proof of Theorem 14. Since the map f is Li-Yorke chaotic, by Lemma 19 there exists an increasing sequence such that f is distributionally chaotic with repect to . Let S ⊂ [a, b] be the uncountable set distributionally scrambled with respect to for f. Let E ⊂ [a, b]∞ be the uncountable set such that each element of it is a constant sequence equal to an element of S. Let and be two different elements of E. Then, there exists δ > 0 such that

and for every t > 0 is


In a similar way for the distance ρ2 we have that there exists δ* > 0 such that

and for every t > 0 is held

Thus, F is distributionally chaotic with respect to respectively using in [a, b]∞ the metrics ρ1 and ρ2 ending the proof. Proof of Theorem 15. Since f has positive topological entropy by Lemma 20 there exists an increasing sequence such that f is distributionally chaotic with repect to . Let S ⊂ [a, b] be the uncountable set distributionally scrambled with respect to for f composed by almost periodic points. Let E ⊂ [a, b]∞ be the uncountable set such that each element of it is a constant sequence equal to an element of S. The proof of Theorem A states that E is an uncontable distributionally scrambled set for F with respect to . Now, we shall see that E is composed by almost periodic points of F respectively for the metrics ρ1 and ρ2. Indeed, let where x ∈ AP(f). Then, for any ε > 0 there exists N > 0 such that for any q ≥ 0, there exists r, q < r ≤ q + N, holding . In this setting,


and

proving that E ⊂ AP(F) ending the proof. Acknowledgement This work has been partially supported by MCYT/FEDER grant number MTM2008-03679/MTM, Fundación Séneca de la Región de Murcia, grant number 08667/PI/08 and JCCM (Junta de Comunidades de Castilla-La Mancha), grant number PEII09-0220-0222. References [AKM] R.L. Adler, A.G. Konheim and M.H. McAndrew, Topological entropy,

Trans. Amer. Math.Soc., 114, 309-319, 1965. [BGLL] F. Balibrea, J.L. García Guirao, M. Lampart and J. Llibre, Dynamics of a

Lotka-Volterra map, Fund. Math. 191 3, 265-279, 2006. [Ba] J. Banks, J. Brooks,G. Cairns, G. Davis and P. Stacey, On Devaney's

definition of chaos, Amer. Math. Monthly, 4, 332-334, 1992. [BC] L.S. Block and W.A. Coppel, Dynamics in One Dimension, Springer

Monographs in Mathematics, Springer-Verlag, 1992. [BGKM]F. Blanchard, E. Glasner, S. Kolyada and A. Maass, On Li-Yorke pairs,

Journal fur diereine und angewandte Mathematik (Crelle's Journal), 547, 51-68, 2002.

[BCP] E. Bollt, N.J. Corron and S.D. Pethel, Symbolic dynamics of coupled map lattice, Phys. Rew. Lett., 96, 1-4, 2006.

[B] R. Bowen, Entropy for group endomorphisms and homogeneous spaces, Trans. Amer. Math. Soc., 153, 401-414, 1971.

[ChF] J.R. Chazottes and B. FernSndez, Dynamics of Coupled Map Lattices and of Related Spatially Extended Systems, Lecture Notes in Physics, 671, 2005.

[ChL] G. Chen and S. T. Liu, On spatial periodic orbits and spatial chaos, Int. J. of Bifur. Chaos, 13, 935-941, 2003.

[Da] R. A. Dana and L. Montrucchio, Dynamical Complexity in Duopoly Games, J. Econom. Theory, 40 (1986), 40-56.

[D] R.L. Devaney, An Introduction to Chaotics Dynamical Systems, Benjamin/Cummings, Menlo Park, CA., 1986.

[DK] N. Degirmenci and S. Kocak, Chaos in product maps, Turkish J. Math. to appear.


[Di] E.I. Dinaburg, A connection between various entropy characterizations of dynamical systems, Izv. Akad. Nauk SSSR Ser. Mat., 35 (1971), 324-366.

[F] H. Furnsterbeg, Recurrence in Ergodic Theory and Combinational Number Theory. Princeton University Press. XI, Princeton, New Jersey, 1981.

[GL1] J.L. García Guirao and M. Lampart, Positive entropy of a coupled lattice system related with Belusov-Zhabotinskii reaction, Journal of Math. Chem. DOI: 10.1007/s10910-009-9624-3.

[GL2] J.L. García Guirao and M. Lampart, Chaos of a coupled lattice system related with the Belusov_Zhabotinskii reaction, Journal of Math. Chem. DOI: 10.1007/s10910-009-9647-9.

[G] J.L. García Guirao, Distributional chaos of generalized Belusov-Zhabotinskii's reaction models, MATCH Commun. Math.Comput. Chem., 64(2), (2010), 335-344.

[HHM] J.L. Hudson, M. Hart and D. Marinko, An experimental study of multiplex peak periodic and nonperiodic oscilations in the Belusov-Zhabotinskii reaction, J. Chem. Phys., 71, (1979), 1601-1606.

[HOY] K. Hirakawa, Y. Oono and H. Yamakazi, Experimental study on chemical turbulence. II, Jour. Phys. Soc. Jap., 46, (1979), 721-728.

[HGS] J.L. Hudson, K.R. Graziani and R.A. Schmitz, Experimental evidence of chaotic states in the Belusov-Zhabotinskii reaction, J. Chem. Phys., 67, (1977), 3040-3044.

[K] K. Kaneko, Globally Coupled Chaos Violates Law of Large Numbers, Phys. Rev. Lett., 65, 1391-1394, 1990.

[KO] M. Kohmoto and Oono, Discrete model of Chemical Turbulence, Phys. Rev. Lett., 55, 2927 - 2931, 1985.

[KW] K. Kaneko and H.F. Willeboordse, Bifurcations and spatial chaos in an open flow model, Phys. Rew. Lett., 73, 533-536, 1994.

[LY] T. Y. Li and J. A. Yorke, Period three implies chaos. Amer. Math. Monthly, 82(10), 985-992, 1975.

[L] G.F. Liao and G.F. Huang, Almost periodicity and SS scrambled set. Chinese Annals of Mathematics, 23, 685-692, 2002.

[1] P. Oprocha, A note on distributional chaos with respect to a sequence, Nonlinear Analysis, 71, 5835-5839, 2009.

[Po] B. Van der Pool, Forced oscilations in a circuit with nonlinear resistence, London, Edinburgh and Dublin Phil. Mag, 3 (1927), 109-123.

[Pu] T. Puu, Chaos in Duopoly Pricing, Chaos, Solitions and Fractals, 1 (1991), 573-581.

[W] P. Walters, An introduction to ergodic theory. Springer, New York, 1982.

QSPR-QSAR Studies on Desired Properties for Drug Design

Documents

Transcript of QSPR-QSAR Studies on Desired Properties for Drug Design