Chemometrics in QSAR

44
4.05 Chemometrics in QSAR R. Todeschini and V. Consonni, University of Milano–Bicocca, Milan, Italy P. Gramatica, Insubria University, Varese, Italy ª 2009 Elsevier B.V. All rights reserved. 4.05.1 Introduction 129 4.05.2 Short History of QSAR and Molecular Descriptors 131 4.05.3 Chemometrics and QSAR Modeling 133 4.05.4 Specific QSAR Approaches 138 4.05.4.1 Hansch Approach 138 4.05.4.2 Free–Wilson Approach 139 4.05.4.3 LSER Approach 140 4.05.4.4 Group Contribution Methods 141 4.05.4.5 Cluster Significance Analysis 143 4.05.4.6 Read-Across Approach 144 4.05.5 Molecular Descriptors 144 4.05.5.1 Molecular Structure Representations 146 4.05.5.2 0D Descriptors or Count Descriptors 147 4.05.5.3 1D Descriptors or Fingerprints 148 4.05.5.4 2D Descriptors or Topological Descriptors 148 4.05.5.5 3D Descriptors or Geometrical Descriptors 149 4.05.5.6 4D Descriptors or Grid-Based Descriptors 151 4.05.6 Molecular Descriptor Selection 151 4.05.6.1 Variable Reduction 152 4.05.6.2 Variable Subset Selection 152 4.05.6.3 Consensus Modeling 154 4.05.7 Principles for QSAR Modeling 157 4.05.7.1 Unambiguous Model Algorithm 157 4.05.7.2 Applicability Domain 158 4.05.7.3 Validation 159 4.05.7.4 Model Descriptor Interpretability 163 4.05.7.5 Summaries of QSAR Models 163 4.05.8 Conclusions 164 References 164 4.05.1 Introduction The discovery of relationships among different concepts, in particular concepts provided by different scientific fields, represents the most important way to develop new scientific knowledge and transform isolated information into a deeper theoretical knowledge. The concepts of molecular structure, its representation by theoretical molecular descriptors, and its relation- ship with experimental properties of molecules are an interdisciplinary network, where a lot of theories, knowledge, and methodologies and their interrelationships are present, leading to a new scientific research field with a relevant follow-up in several practical applications. 129

Transcript of Chemometrics in QSAR

Page 1: Chemometrics in QSAR

4.05 Chemometrics in QSARR. Todeschini and V. Consonni, University of Milano–Bicocca, Milan, Italy

P. Gramatica, Insubria University, Varese, Italy

ª 2009 Elsevier B.V. All rights reserved.

4.05.1 Introduction 129

4.05.2 Short History of QSAR and Molecular Descriptors 131

4.05.3 Chemometrics and QSAR Modeling 133

4.05.4 Specific QSAR Approaches 138

4.05.4.1 Hansch Approach 138

4.05.4.2 Free–Wilson Approach 139

4.05.4.3 LSER Approach 140

4.05.4.4 Group Contribution Methods 141

4.05.4.5 Cluster Significance Analysis 143

4.05.4.6 Read-Across Approach 144

4.05.5 Molecular Descriptors 144

4.05.5.1 Molecular Structure Representations 146

4.05.5.2 0D Descriptors or Count Descriptors 147

4.05.5.3 1D Descriptors or Fingerprints 148

4.05.5.4 2D Descriptors or Topological Descriptors 148

4.05.5.5 3D Descriptors or Geometrical Descriptors 149

4.05.5.6 4D Descriptors or Grid-Based Descriptors 151

4.05.6 Molecular Descriptor Selection 151

4.05.6.1 Variable Reduction 152

4.05.6.2 Variable Subset Selection 152

4.05.6.3 Consensus Modeling 154

4.05.7 Principles for QSAR Modeling 157

4.05.7.1 Unambiguous Model Algorithm 157

4.05.7.2 Applicability Domain 158

4.05.7.3 Validation 159

4.05.7.4 Model Descriptor Interpretability 163

4.05.7.5 Summaries of QSAR Models 163

4.05.8 Conclusions 164

References 164

4.05.1 Introduction

The discovery of relationships among different concepts, in particular concepts provided by different scientificfields, represents the most important way to develop new scientific knowledge and transform isolatedinformation into a deeper theoretical knowledge.

The concepts of molecular structure, its representation by theoretical molecular descriptors, and its relation-ship with experimental properties of molecules are an interdisciplinary network, where a lot of theories,knowledge, and methodologies and their interrelationships are present, leading to a new scientific research fieldwith a relevant follow-up in several practical applications.

129

Page 2: Chemometrics in QSAR

Molecular descriptors are numerical indices encoding some information related to the molecular structure.They can be both experimental physicochemical properties of molecules and theoretical indices calculated bymathematical formulas or computational algorithms.

Molecular descriptors, tightly connected to the molecular structure, play a fundamental role in scientificresearch, being the theoretical core of a complex network of knowledge, as is shown in Figure 1. Indeed,molecular descriptors are based on several different theories, such as quantum chemistry, information theory,organic chemistry, and graph theory, and are used to model several different properties of chemicals inscientific fields such as toxicology, analytical chemistry, physical chemistry, medicinal, pharmaceutical, andenvironmental chemistry.

Moreover, to obtain reliable estimates of molecular properties, data elucidation, and data mining, moleculardescriptors are processed by several methods provided by statistics, chemometrics, and chemoinformatics. Inparticular, chemometrics for about 30 years has been developing classification and regression methods able toprovide – although not always – reliable models, for both reproducing the known experimental data andpredicting the unknown data. The modeling process usually has not only explanatory purposes but alsopredictive purposes. The interest in predictive models able to give effective reliable estimates has been largelygrowing in the last few years as they are more and more considered useful and safer tools for predicting data onchemicals.

Quantitative structure–activity relationships (QSARs) are the final result of the process that starts with asuitable description of molecular structures and ends with some inference, hypothesis, and prediction on thebehavior of molecules in environmental, biological, and physicochemical systems in analysis (Figure 2).

QSARs are based on the assumption that the structure of a molecule (e.g., its geometric, steric, and electronicproperties) must contain the features responsible for its physical, chemical, and biological properties and on theability to capture these features into one or more numerical descriptors. Using QSAR models, the biologicalactivity (or property, reactivity, etc.) of a newly designed or untested chemical can be inferred from themolecular structure of similar compounds whose activities (properties, reactivities, etc.) have already beenassessed.

Besides the well-known approach called QSARs, other specific approaches aimed at relating the molecularstructure to some experimental (or calculated) properties are quantitative structure–reactivity relationships(QSRRs), quantitative shape–activity relationships (QShARs), the molecular shape being considered as acomponent of the molecular structure, quantitative structure–chromatographic relationships (QSCRs), quan-titative structure–toxicity relationships (QSTRs), quantitative structure–biodegradability relationships(QSBRs), quantitative similarity–activity relationships (QSiARs), quantitative structure–enantioselective

Graph theory, discrete mathematics, physical chemistry,information theory, quantum chemistry, organic chemistry,differential topology, algebraic topology

Molecular descriptors

Derived from ….

QSAR/QSPR, medicinal chemistry, pharmacology, genomics,drug design, toxicology, proteomics, analytical chemistry,environmetrics, virtual screening, library searching

Applied in ….

Statistics,chemometrics, chemoinformatics

Processed by ….

Figure 1 General scheme of the relationships among molecular structure, molecular descriptors, chemometrics, andQSAR/QSPR.

130 Chemometrics in QSAR

Page 3: Chemometrics in QSAR

retention relationships (QSERRs), and so on. Generally speaking, the quantitative structure–property relation-ship (QSPR) acronymous is used when any property different from biological activity is modeled.

Despite the differences among the approaches defined above, in the literature the most common termsreferring to all these approaches are QSAR and QSPR, with a unique simple distinction between ‘activity’ and‘property’. These will also be the main terms used in this chapter, without any further distinctions.

It has been nearly 45 years since the QSAR modeling was first introduced into the practice of agrochemistry,drug design, toxicology, and industrial and environmental chemistry. Its growing power in the following yearsmay be mainly attributed to the rapid and extensive development in methodologies and computationaltechniques that have allowed to delineate and refine the many variables and approaches used to modelmolecular properties.1–11 Furthermore, the interest in QSAR is more and more growing because nowadaysthese tools are used not only for research purposes but also to produce data on chemicals in the interest of timeand cost effectiveness.

Chemometrics is largely applied in QSAR research, both from a methodological and from a technical pointof view. Indeed, it provides tools and ideas to describe molecular structures and model their properties with acontinuous attention to the basic chemometric philosophy, based on model validation, information synthesis bynew indices, and graphical representation of data information.

4.05.2 Short History of QSAR and Molecular Descriptors

The history of QSAR and molecular descriptors is closely related to the history of what can be considered oneof the most important scientific concepts of the last part of the nineteenth century and the whole of twentiethcentury, that is, the concept of molecular structure.

The years between 1860 and 1880 were characterized by a strong dispute about the concept of molecularstructure, arising from the studies on substances showing optical isomerism and the studies of Kekule (1861–67)on the structure of benzene. The concept of the molecule thought of as a three-dimensional (3D) body was firstproposed by Butlerov (1861–65), Wislicenus (1869–73), Van’t Hoff (1874–75), and Le Bel (1874). Thepublication in French of the revised edition of La chimie dans l ’espace by Van’t Hoff in 1875 is considered amilestone of the 3D conception of the chemical structures.

QSAR history started a century earlier than the history of molecular descriptors, being closely related to thedevelopment of the molecular structure theories. QSAR modeling was born in toxicology field. Attempts toquantify relationships between chemical structure and acute toxic potency have been part of the toxicologicalliterature for more than 100 years. In the defense of his thesis entitled ‘Action de l’alcohol amylique surl’organisme’ at the Faculty of Medicine, University of Strasbourg, France, on 9 January 1863, Cros noted that arelationship existed between the toxicity of primary aliphatic alcohols and their water solubility. This

Molecules

Experiments

Physicochemicalproperties

Moleculardescriptors

Theory

QSPR

Experiments

QSAR Biologicalactivities

Figure 2 General scheme of the QSAR/QSPR philosophy.

Chemometrics in QSAR 131

Page 4: Chemometrics in QSAR

relationship demonstrated the central axiom of structure–toxicity modeling, that is, the toxicity of substances isgoverned by their properties, which are determined in turn by their chemical structure. Therefore, there areinterrelationships among structure, properties, and toxicity.

Crum-Brown and Fraser (1868–69)12–14 proposed the existence of a correlation between biological activityof different alkaloids and their molecular constitution. More specifically, the physiological action of a substancein a certain biological system (�) was defined as a function (f ) of its chemical constitution (C):

� ¼ f Cð Þ ð1Þ

Thus, an alteration in chemical constitution, �C, would be reflected by an effect on biological activity ��.This equation can be considered the first general formulation of a QSAR.

A few years later, a hypothesis on the existence of correlations between molecular structure andphysicochemical properties was reported in the work of Korner (1874),15 which dealt with the synthesisof disubstituted benzenes and the discovery of ortho, meta, and para derivatives. The different colors ofdisubstituted benzenes were thought to be related to their differences in molecular structure. Ten yearslater, Mills (1884)16 published a study ‘On Melting Point and Boiling Point as Related to Composition’ inthe Philosophical Magazine.

The quantitative property–activity models, commonly referred to mark the beginning of systematic QSAR/QSPR studies,17 have come out from the search for relationships between the potency of local anesthetics andthe oil/water partition coefficient18 between narcosis and chain length,19 and narcosis and surface tension.20 Inparticular, the concepts developed by Meyer and Overton are often referred to as the Meyer–Overton theory ofnarcotic action.18,19

The first theoretical QSAR/QSPR approaches date back to the end of 1940s and are those that relatebiological activities and physicochemical properties to theoretical numerical indices derived from the mole-cular structure.

On the basis of the graph theory, the Wiener index21 and the Platt number,22 proposed in 1947 to model theboiling point of hydrocarbons, were the first theoretical molecular descriptors based on the graph theory.

In the early 1960s, new molecular descriptors were proposed, giving the start to systematic studies on themolecular descriptors, mainly based on the graph theory.23–32

The use of quantum-chemical descriptors in QSAR studies date back to early 1970s,29 although quantum-chemical descriptors were defined and used a long time before in the framework of quantum-chemistry. During1930–60, the milestones were the works of Pauling33,34 and Coulson35 on the chemical bond, Sanderson36 onelectronegativity, and Fukui et al.37 and Mulliken38 on electronic distribution.

Once the concept of molecular structure was definitively consolidated by the successes of quantumchemistry theories and the approaches to the calculation of numerical indices encoding molecular structureinformation were accepted, all the constitutive elements for the take-off of QSAR strategies were available.

Based on the Hammett equation,39,40 the seminal work of Hammett gave rise to the ‘�–�’ culture in thedelineation of substituent effects on organic reactions, whose aim was the search for linear free energyrelationships (LFERs):41 steric, electronic, and hydrophobic constants were defined, becoming a basic toolfor modeling properties of molecules.

In the 1950s, the fundamental works of Taft42–44 in physical organic chemistry were the foundation ofrelationships between physicochemical properties and solute–solvent interaction energies (linear solvationenergy relationships, LSERs), based on steric, polar, and resonance parameters for substituent groups incongeneric compounds.

In the mid-1960s, led by the pioneering works of Hansch,23,45,46 the QSAR/QSPR approach began to assumeits modern look.

In 1962, Hansch et al.45 published their study on the structure–activity relationships of plant growthregulators and their dependency on Hammett constants and hydrophobicity. Using the octanol/water system,a whole series of partition coefficients were measured, and thus, a new hydrophobic scale was introduced fordescribing the attitude of molecules to move through environments characterized by different degrees ofhydrophilicity such as blood and cellular membranes. The delineation of Hansch models led to explosivedevelopment in QSAR analysis and related approaches.3

132 Chemometrics in QSAR

Page 5: Chemometrics in QSAR

In the same years, Free and Wilson47 developed a model of additive substituent contributions to biologicalactivities, giving a further push to the development of QSAR strategies.

They proposed to model a biological response on the basis of the presence/absence of substituent groups ona common molecular skeleton.47,48 This approach, called ‘de novo approach’ when presented in 1964, was basedon the assumption that each substituent gives an additive and constant effect to the biological activity regardlessof the other substituents in the rest of the molecule.

At the end of 1960s, a lot of structure–property relationships were proposed based not only on substituenteffects but also on indices describing the whole molecular structure. These theoretical indices were derivedfrom a topological representation of molecule, mainly applying the graph theory concepts, and then usuallyreferred to as 2D descriptors.

The fundamental works of Balaban,49,50 Randic,51,52 and Kier et al.53 led to further significant developmentsof the QSAR approaches based on topological indices (TIs).

As a natural extension of the topological representation of a molecule, the geometrical aspects of a moleculewere taken into account since the mid-1980s, leading to the development of the 3D-QSAR, which exploitsinformation on the molecular geometry. Geometrical descriptors were derived from the 3D spatial coordinatesof a molecule, and among them, there were shadow indices,54 charged partial surface area descriptors,55

weighted holistic invariant molecular (WHIM) descriptors,56 gravitational indices,57 EigenVAlue (EVA)descriptors,58 3D-MoRSE descriptors,59 EEVA descriptors,60 and GEometry, Topology, and Atom-WeightsAssemblY (GETAWAY) descriptors.61

In the late 1980s, a new strategy for describing molecule characteristics was proposed, based on molecularinteraction fields (MIFs), which are composed of interaction energies between a molecule and probes, atspecified spatial points in 3D space. Different probes (such as a water molecule, methyl group, and hydrogen)were used for evaluating the interaction energies in thousands of grid points where the molecule wasembedded. As final result of this approach, a scalar field (a lattice) of interaction energy values characterizingthe molecule was obtained. The first formulation of a lattice model to compare molecules by aligning them in3D space and extracting chemical information from MIF was proposed by Goodford62 in the GRID method andthen by Cramer et al.63 in the comparative molecular field analysis (CoMFA).

Still based on MIFs, several other methods were successively proposed, and among them, there werecomparative molecular similarity indices analysis (CoMSIA),64 Compass method,65 G-WHIM descriptors,66

Voronoi field analysis,67 VolSurf approach,68 and GRIND descriptors.69

Finally, an increasing interest of the scientific community has been showing in recent years for combinatorialchemistry, high-throughput screening, substructural analysis, and similarity searching, for which severalsimilarity/diversity approaches have been proposed mainly based on substructure descriptors such as mole-cular fingerprints.10,11,70

4.05.3 Chemometrics and QSAR Modeling

The development of QSAR/QSPR models is a quite complex process, as outlined in Figure 3. Once theresearch goal has been clearly defined, which in most cases means defining the property to be modeled, that is,the endpoint, the decision to be made concerns how much general the final model should be. This entails theselection of the set of molecules the modeling procedure is applied to. For a long time, QSAR models weredeveloped on sets of congeneric compounds, that is, molecules with a common parental structure and differentsubstituent groups. Later, the interest in producing tools for quick molecular property estimations movedforward more general QSAR models suitable for diverse molecules belonging to different chemical classes, thatis, not congeneric sets. The final decision in defining the molecule set mainly depends on the foreseen use of themodel and availability of experimental data.

In this phase of the QSAR process, it is of primary concern to gain an exhaustive knowledge about thecompounds in analysis with specific regard to the endpoint of interest. This obviously implies the acquisition ofreliable experimental data regarding the endpoint and possibly already existing models. Data of the chemicalscan be produced experimentally or retrieved from literature. In both cases, accuracy should be carefullyevaluated: the limiting factor in the development of QSAR/QSPR models is the availability of high quality

Chemometrics in QSAR 133

Page 6: Chemometrics in QSAR

experimental data, because the accuracy of the property estimated by a model cannot exceed the degree ofaccuracy of the input data. Moreover, when data are collected from literature, to avoid an additional variabilityinto the data because of different sources of information, data should be taken just from one source or fromalmost comparable sources.

Another important phase of the QSAR process is the definition of a reliable chemical space; in other words,the selection of those structural features is thought to be the most responsible for modeling the endpoint inanalysis. This implies the selection of proper molecular descriptors but, in most cases, there is no a priori

knowledge about which molecular descriptors are the best. Then, the tendency is to use a huge number ofdescriptors, which hopefully include the candidate variables for modeling, and later apply a variable selectiontechnique. Two basic strategies can be adopted: (1) the use of algorithms to select the optimal subset(s) ofdescriptors and (2) the use of chemometric methods (e.g., principal component analysis (PCA) or partial leastsquares (PLS)) able to condense the large amount of available chemical information into a few principalvariables.

The next step is the selection of the validation procedure, which, in addition to the fitting performanceof the model, allows the evaluation of the model prediction ability. The latter is usually considered themost important characteristic for an acceptable QSAR model. The predictive ability of the models isevaluated by dividing the compounds into the training set, that is, the set by which the model is calculated,and the test set, that is, the set of compounds by which the model predictive ability is evaluated. Thepartition into training/test sets is performed in different ways, depending on the validation procedure (seeSection 4.05.7.3).

Exploratory data analysis is a common preliminary step in all the QSAR/QSPR studies. In particular, PCAand clustering methods (both hierarchical and nonhierarchical) are the most commonly used. A wide impor-tance has been gaining in these last years by the clustering approach based on the Kohonen maps (or self-organizing maps, SOMs), which is an artificial 2D neural network providing easy interpretable informationabout similarity/diversity among objects.71,72

By exploratory analysis, the QSAR expert can evaluate whether the chosen molecular descriptors aresuitable for describing the compounds in analysis and the chemical space is sufficiently represented.Moreover, the tendency observed nowadays is to build a reference chemical space for large categories of

Experimentalresponses

Fitting

Moleculardescriptors

QSAR, QSPR, ...

Training set

Set ofmolecules

MODEL

Moleculardescriptors

Newmolecules

Predicted newresponses

Reversible decodingor inverse QSAR

Experimentalresponses

Moleculardescriptors

Test set

Predictionpower

Figure 3 General scheme of the QSAR/QSPR strategy.

134 Chemometrics in QSAR

Page 7: Chemometrics in QSAR

chemicals for which molecular properties are known by using methods such as PCA on molecular fingerprints.Then, this chemical space is used to analyze similarities among groups of chemicals showing, for example,groups of different biological activity, and to find which regions in the chemical space require to be moreexplored by designing new molecular structures.

The majority of the QSAR strategies aimed at building models are based on regression and classificationmethods, depending on the studied problem. For continuous properties, the typical QSAR/QSPR model isdefined as

P ¼ f x1; x2; . . .; xp

� �ð2Þ

where P is the molecular property/activity, x1, . . ., xp are the p molecular descriptors, and f is a functionrepresenting the relationship between response and descriptors. In most of the cases, the function f is nota priori known and needs to be estimated.

Ordinary least squares (OLS) regression, also called multiple linear regression (MLR), is the most commonregression technique used to estimate the quantitative relationship between molecular descriptors and theproperty. PLS regression is widely applied especially when there are a large number of molecular descriptorswith respect to the number of training compounds, as it happens for 4D-QSAR methods such as GRID andCoMFA.

Several other methods play a fundamental role in QSARs such as principal component regression (PCR),k-nearest neighbor (k-NN) regression, and stepwise regression (SWR), the last being the most applied methodto select a descriptor subset from a not too large set of candidate variables.

Regression techniques based on the artificial neural networks (ANNs) are also frequently used,73 such asbackward propagation (BP) and radial basis functions (RBFs), and on ensemble approaches such as randomforests.74,75

For discrete molecular properties, such as properties defining active/inactive compounds, the typicalclassification model is defined as

C ¼ f x1; x2; . . .; xp

� �ð3Þ

where C is the class which each object is assigned to under the application of the obtained model, x1, . . .� � �, xp

are the p molecular descriptors, and f is a function representing the relationship between class assignment anddescriptors. Note that also classification models are quantitative models, only the response C being a qualitativequantity.

Besides the classical discriminant analysis (DA) and the k-NN methods, other classification methods widelyused in QSAR/QSPR studies are SIMCA, linear vector quantization (LVQ), PLS–DA, classification andregression trees (CARTs), and cluster significance analysis (CSA), specifically proposed for asymmetricclassification in QSARs. Other promising classification techniques have been added to the data analysis toolboxin QSAR discovery, such as support vector machine (SVM),76 embedded cluster modeling (ECM),77 andClassification and Influence Matrix Analysis (CAIMAN).78

In the last few years, ranking methods were also introduced in the structure–response correlation studies,paying attention to rank the chemicals instead of reproducing some quantitative property. They are mainlyused to build priority list of chemicals;79–81 however, they were also proposed for modeling purposes.82,83

Ranking methods are simply aimed at giving a rank to the studied objects, that is, these methods are able toprovide a global index allowing the ranking of the samples (total ordering methods).

Ranking methods based, for example, on desirability, utility, and dominance functions, allow reaching a totalordering of the chemicals evaluating contemporarily more than one descriptor. Moreover, by adding therelationship of incomparability among compounds to the total ordering, partial ordering can be obtainedresulting into the so-called Hasse diagram,81 as shown in Figure 4.

The five objects (a, b, c, d, and e) are chemicals ranked on the basis of their toxicity, measured on differentorganisms. The obtained Hasse diagram shows the hazard levels for these compounds. In particular, chemicalsa, b, and e (maximal elements, Level 3) result more dangerous than the chemicals c and d, but nothing is knownabout which of these is absolutely the most dangerous, because they are incomparable. Moreover, the chemical

Chemometrics in QSAR 135

Page 8: Chemometrics in QSAR

e (isolated element) is not comparable with all the other chemicals and thus, conventionally, is placed at thehighest level; two total orderings are obtained, namely (a, c, d) and (b, c, d), meaning, for example for the former,that chemical a is more dangerous than the chemical c which, in turn, is more dangerous than chemical d

(minimal element).Besides the simple ranking lists, totally or partially ordered, ranking models can be also used as a further tool

together with regression and classification models. In effect, a ranking model is defined as a relationshipbetween one or more dependent attributes (y), investigated experimentally and usually called criteria, and a setof theoretically defined independent attributes (x), called model attributes, which are usually theoreticalcalculated variables such as molecular descriptors.82,83 This kind of model can be defined as

Ri yi1; yi2; . . .; yiKð Þ ¼ f xi1; xi2; . . .; xip

� �1 � Ri � n ð4Þ

where Ri is the rank of the ith object, f is a ranking function, K the number of dependent attributes, and p thenumber of independent attributes. In other words, the ith object ranking obtained by the experimental attributesy is reproduced by a set of independent model attributes. Then, the rank of a new chemical with respect to thetraining set chemicals can be evaluated describing it with the model descriptors.

Once the model is calculated and properly validated, it can be used to estimate property values for newmolecules or obtain information about mechanism of action of a group of compounds or, in general, aboutwhich structural features are responsible for a specific behavior of molecules. In the first case, the attention ispaid more to obtaining models with the highest predictive ability, regardless of the model interpretability.Indeed, when the aim is to produce data on chemicals, the very important aspect is that the model is as reliableas possible and not the reason why some molecular descriptors were selected in the model.

However, even when the predictive ability of the models was high, the estimated property should be takencarefully because a molecule might be ‘far’ from the model chemical space, and then, the response would bethe result of a strong extrapolation, resulting in an unreliable prediction. In order to cope with this problem, theconcept of applicability domain (AD) of a model came out as a relevant aspect for the evaluation ofthe prediction reliability.

For some applications, the primary concern is the possibility to obtain information about molecular structurefrom QSAR/QSPR models. Any procedure capable to reconstruct the molecular structure or fragment starting frommolecular descriptor values is called reversible decoding (or inverse QSAR), that is, once molecular descriptors froma structure representation are obtained, reversibility would lead to structures from descriptors.81,84,85

Reversible decoding is of great importance because, once a QSAR model is established, optimal values of themolecular property can be chosen and values of the model molecular descriptors calculated by usingthe estimated QSAR model; if reversibility is feasible, then molecular descriptors lead to structures. Thepossible molecular structures corresponding to the optimized descriptor values can be designed (and synthe-sized). Unfortunately, this last operation is often a troublesome task when the model molecular descriptors arenot simply and not easily interpretable.

Reversibility is a highly desired property of a descriptor, but is not strictly essential for structure–responsestudies. In effect, if the QSAR model needs to be used for producing reliable data on chemicals, reversibility isnot a necessary requirement. On the contrary, if the model is used for drug design, the requirement ofreversibility needs to be fulfilled.

a b e

c

d

a, b, e : maximals

d : minimals

a, b, e : incomparable alternatives

b, c, d : chain and d ≤ c ≤ b

a, c, d : chain and d ≤ c ≤ a

Level 3a b e

Level 2

Level 1

c

d

Hazard

Figure 4 Example of Hasse diagram.

136 Chemometrics in QSAR

Page 9: Chemometrics in QSAR

Furthermore, although the inverse QSAR requirement is a very useful property, it must be noted that itcould be substituted by a surrogate approach based on an inductive analysis of optimal values of the studiedproperty obtained from new molecules generated by some automatic algorithm.

Finally, to summarize what reported above, the development of a QSAR/QSPR model requires threefundamental components: (1) a data set providing experimental measures of a biological activity or propertyfor a group of chemicals (i.e., the dependent variable of the model); (2) molecular descriptors, which encodeinformation about the molecular structures (i.e., the descriptors or the independent variables of the model); and(3) mathematical methods to find the relationships between a molecule property/activity and the molecularstructure.

An example of QSAR modeling is given below.86 The compounds in analysis are 30 monosubstitutedphenylacetanilides (Figure 5), whose substituents (X) are given in Table 1. The studied response is theanticonvulsant activity log(1/ED50) of these chemicals.

The chemical structures were generated by dedicated software, and the geometry optimization was per-formed by a semiempirical quantum method. On the basis of the generated chemical structures, five differentclasses of molecular descriptors were calculated by a specific tool for molecular descriptor calculation:constitutional, topological, geometrical, quantum-chemical, and interaction field descriptors. Removing allthe descriptors having a square correlation coefficient with the activity less than 0.063 resulted into a set of 304molecular descriptors, which were successively processed by a heuristic algorithm based on MLR and variableselection.

The following final two QSAR models were obtained:

log1

ED50

� �¼ 10:1446 – 4:0793?xsc – 6:1638?avi – 1:4332?A92

– 0:0094�mas – 9:2412?M2þ 0:1434�mam

n ¼ 30 R2 ¼ 0:789 Q 2100 ¼ 0:700 s ¼ 0:14 F ¼ 15:0

ð5Þ

Figure 5 Monosubstituted phenylacetanilides.

Table 1 Phenylacetanilide substituents, experimental anticonvulsant activity, and calculated values from Equations (5)

and (6)

No X Exp. Equation (5) Equation (6) No X Exp. Equation (5) Equation (6)

1 H 3.77 3.57 3.64 16 m-COMe 3.95 3.65 4.00

2 m-Me 3.75 3.65 3.71 17 m-OAc 3.48 3.58 3.49

3 m-Et 3.67 3.58 3.60 18 m-OEt 3.42 3.56 3.46

4 m-F 3.34 3.52 3.35 19 m-OSO2Me 3.77 3.83 3.74

5 m-Cl 3.40 3.37 3.41 20 p-Me 3.26 3.67 3.42

6 m-Br 3.32 3.23 3.32 21 p-F 3.49 3.49 3.52

7 m-I 2.64 2.64 2.64 22 p-OH 3.72 3.64 3.73

8 m-CF3 2.84 2.83 2.85 23 p-OMe 3.78 3.70 3.79

9 m-OH 3.58 3.64 3.54 24 p-COMe 3.51 3.71 3.44

10 m-NH2 3.81 3.75 3.93 25 o-F 3.48 3.46 3.41

11 m-NHMe 4.03 3.82 4.00 26 o-OH 3.33 3.42 3.34

12 m-NHEt 3.91 3.83 3.85 27 o-NH2 3.40 3.46 3.40

13 m-OMe 3.22 3.58 3.53 28 o-OMe 3.43 3.41 3.44

14 m-CN 3.44 3.46 3.51 29 o-NO2 3.29 3.20 3.24

15 m-NO2 3.62 3.68 3.63 30 o-COMe 3.41 3.34 3.44

Chemometrics in QSAR 137

Page 10: Chemometrics in QSAR

log1

ED50

� �¼ 0:9172 – 0:174?mam – 3:2783?A58 – 1:3221?xsc þ 5:1735?F113 – 7:5902?iox

þ 1:7525?P64 – 3:406?M72 – 35:7998?F109þ 1:987?nsf – 17:7329?F96

n ¼ 30 R2 ¼ 0:962 Q 2100 ¼ 0:901 s ¼ 0:06 F ¼ 51:2

ð6Þ

The model descriptors are xsc (maximum net charge for a C atom), avi (average free valence of I atoms), mas

(molecular mass), mam (molecular mass/number of atoms), iox (mass percent of I�maximum net charge for aI atom), nsf (minimum net charge for a F atom), A92 and A58 (sum of attractive electrostatic forces for gridpoints 92 and 58, respectively), F96, F109, and F113 (sum of all electrostatic forces for grid points 96, 109, and113, respectively), M2 and M72 (average parallax for grid points 2 and 72, respectively), and P64 (maximumparallax for grid point 64).

Note that for two models, both fitting (R2) and prediction (Q2) abilities were estimated.

4.05.4 Specific QSAR Approaches

There are a lot of QSAR approaches in the literature, often reported with a large variety of names, which makedifficult to rationalize them into a well-defined classification system. An attempt to classify QSAR approachesmight be made by considering the objective to be reached by a QSAR approach, the type of molecular property ismodeled on, the type of molecular descriptors the model is composed of, and the mathematical method orcomputational algorithm used to estimate the model parameters. Therefore, focusing on the objective of theanalysis, it is possible to distinguish, for example, among drug design, high-throughput screening, and molecularsimilarity analysis. Paying more attention to the property, terms such as ADME (absorption, distribution,metabolism, and elimination properties) analysis, environmental QSAR, LSERs, and binary-QSAR are commonlyused. 2D-QSAR, 3D-QSAR, and 4D-QSAR namely refer to the type of molecular descriptors they are based on.Terms such as group contribution method (GCM), structural analysis, and grid-based QSAR technique mainlyderive from the specific applied methods. In the following, some QSAR approaches are explained more in detail.

4.05.4.1 Hansch Approach

There is a consensus among current predictive toxicologists that Corwin Hansch is the founder of modernQSAR. In the classic article,45 it was illustrated that, in general, biological activity for a group of ‘congeneric’chemicals can be described by a comprehensive model:

log1

C50

� �¼ a�þ b"þ cS þ d ð7Þ

where C, the toxicant concentration at which an endpoint is manifested (e.g., 50% mortality or effect), is relatedto a hydrophobicity term, �, an electronic term, � (originally the Hammett substituent constant, �), and a stericterm, S (typically Taft’s substituent constant, Es), being d a general additional term depending on the kind ofproperty to be modeled.

In particular, the parameter �, which is the relative hydrophobicity of a substituent, was defined as

� ¼ logPX – logPH ð8Þ

where PX and PH represent the partition coefficients of a derivative and the parent molecule, respectively. Thisis a substituent constant denoting the difference in hydrophobicity between a parent compound and asubstituted analog and is usually replaced by the more general molecular term the log of the 1 – octanol/water partition coefficient, log Kow or log P.

A practical example86 of Hansch approach is here given for the data shown in Table 1, referring to 30phenylacetanilides (Figure 5) for which the anticonvulsant activity is known. The analysis was performed onthe following descriptors: log P, the octanol/water partition coefficient; �, the Hammett electronic constant of

138 Chemometrics in QSAR

Page 11: Chemometrics in QSAR

the substituent; Ip, an indicator variable that takes the value 1 for p-derivatives and 0 for the other compounds;Es, the Taft steric constant for o-derivatives; and R, the electronic parameter for o-derivatives. The HanschQSAR models with four, five, and six independent variables are

log1

ED50

� �¼ 2:280þ 0:264 logPð Þ2þ1:222 logPð Þ – 0:161� – 0:079Ip

n ¼ 30 R2 ¼ 0:490 s ¼ 0:228 F ¼ 5:99

ð9Þ

log1

ED50

� �¼ 2:311þ 0:290 logPð Þ2þ1:309 logPð Þ – 0:135� – 0:057Ip þ 0:404Es

n ¼ 30 R2 ¼ 0:640 s ¼ 0:195 F ¼ 10:05

ð10Þ

log1

ED50

� �¼ 2:478þ 0:276 logPð Þ2þ1:229 logPð Þ – 0:353� – 0:223Ip þ 0:278Es þ 0:621R

n ¼ 30 R2 ¼ 0:731 s ¼ 0:172 F ¼ 7:83

ð11Þ

Note that these models contain a nonlinear term (log P)2 and their predictive ability is not considered.The contributions of Hammett and Taft together laid the basis for the development of the QSAR paradigm

by Hansch and Fujita, which combined the hydrophobic constants with Hammett’s electronic constants to yieldthe linear Hansch equation and its many extended forms.

4.05.4.2 Free–Wilson Approach

The Free–Wilson approach47 is based on the assumption that a biological response can be modeled by additivesubstituent contributions, that is, the substituent effects are considered independent of each other, thecompound’s congenericity being also another basic requirement.

Once a common skeleton for the chemical analogs is defined, regression analysis is performed, considering anumber S of substitution sites Rs (s¼ 1, S), and for each site a number Ns of different substituents. Hydrogenatoms are also considered as substituents if present in a substitution site of some compounds. The Free–Wilsondescriptors of the ith compound are indicator variables Ii,ks where Ii,ks¼ 1 if the kth substituent is present in thesth site and Ii,ks¼ 0 otherwise.

The Free–Wilson model is defined as

yi ¼ b0 þXS

s¼1

XNs

k¼1

bks Ii;ks ð12Þ

where b0 is the intercept of the model corresponding to the average biological response calculated from the dataset and bks are the regression coefficients. The biological response y is usually used in the form log(1/C), where C

is the concentration achieving a fixed effect. The regression coefficients bks of the Free–Wilson model give theimportance of each kth substituent in each sth site in increasing/decreasing the response with respect to themean response, that is, the activity contribution of the substituent.

A simple example of Free–Wilson approach is given below, considering eight derivatives of toluene with twosubstitution sites (X and Y) (Figure 6).

In the site X, ethyl, fluorine, chlorine, and bromine substituents are allowed (NX¼ 4), whereas in the site Y,only chlorine and bromine substituents are allowed (NY¼ 2). The eight possible derivatives are coded in theFree–Wilson approach as shown in Table 2.

Figure 6 Toluene parent molecule.

Chemometrics in QSAR 139

Page 12: Chemometrics in QSAR

4.05.4.3 LSER Approach

LSERs constitute the basis on which effects of solvent–solute interactions on physicochemical properties andreactivity parameters are studied. In general, a property P of a species A in a solvent S can be expressed as

PA;S ¼X

j

jj A; Sð Þ ð13Þ

where j are complex functions of both solvents and solutes.87 By assuming that these functions canbe factorized in two contributions separately dependent on solute and solvent, the property can berepresented as

PA;S ¼X

j

fj Að Þg j Sð Þ ð14Þ

where f are functions of the solute and g functions of the solvent.The underlying philosophy of the LSER is based on the possibility to study these two functions, after a

proper choice of the reference systems and properties. Moreover, it has been recognized that solution proper-ties P mainly depend on three factors: a cavity term, a polar term, and hydrogen-bond term:

P ¼ intercept þ cavity termþ dipolarity=polarizability term þ hydrogen-bond term

Therefore, a typical LSER is expressed as88

PA;S ¼ b0 þ b1 �2H

� �1V2 þ b2��1��2 þ b3�1�2 þ b4�1�2 ð15Þ

where b are estimated regression coefficients, and the subscripts 1 and 2 in the solvent/solute propertyparameters refer to the solvent S and the solute A, respectively. This equation is usually known as solvato-chromic equation and the parameters of polarity/dipolarizability and hydrogen-bonding as solvatochromicparameters. The term ‘solvatochromic’ is derived from the origin of this approach referring to the effect solventhas on the color of an indicator which is used for quantitative determination of some molecular attributes(solvatochromic parameters).

From the general solvatochromic equation, two special cases can be encountered. When dealing with effectsof different solvents on properties of a specific solute, the general equation is explicitly on solvent parameters:

PA;Si¼ b0 þ b1 �

2H

� �1þ b2��1;i þ b3�1;i þ b4�1;i ð16Þ

This equation has been used in several correlations of solvent effects on solute properties such as reaction ratesand equilibrium constants of solvolyses, energy of electronic transitions, solvent-induced shifts in ultraviolet/visible, infrared, and nuclear magnetic resonance spectroscopy, fluorescence lifetimes, formation constants ofhydrogen-bonded and Lewis acid/base complexes.89

Table 2 Free–Wilson matrix of eight toluene derivatives

Site X Y

Compound C2H5 F Cl Br Cl Br

1 1 0 0 0 1 0

2 1 0 0 0 0 1

3 0 1 0 0 1 0

4 0 1 0 0 0 1

5 0 0 1 0 1 0

6 0 0 1 0 0 1

7 0 0 0 1 1 0

8 0 0 0 1 0 1

140 Chemometrics in QSAR

Page 13: Chemometrics in QSAR

Conversely, when dealing with solubilities, lipophilicity, or other properties of a set of different solutes in aspecific solvent, the general equation is explicitly on the solute parameters:

PA;S ¼ b0 þ b1Vi þ b2��2;i þ b3�2;i þ b4�2;i ð17Þ

This equation has been mainly used in correlations of aqueous solubility of compounds, octanol/water partitioncoefficients, and some other partition parameters together with some biological properties.89–91

Recently, the terms of the LSER equations were redefined by Abraham et al.92 as

P ¼ c þ e?E þ s?S þ a?Aþ b?B þ l?L ð18Þ

where E is the solute excess molar refractivity, S is the solute dipolarity/polarizability, A and B are the overall orsummation of hydrogen bond acidity and basicity, respectively, and L is the logarithm of the gas–hexadecanepartition coefficient. The terms c, e, s, a, b, and l are the regression coefficients to be estimated.

4.05.4.4 Group Contribution Methods

GCMs search for relationships between structural properties and a physicochemical or biological responsebased on the following general model:

P ¼ f G1;G2; . . .;Gm;n1; n2; . . .; nmð Þ ð19Þ

where the experimental property P for the compound is a function of m group contributions Gj and theiroccurrences nj.

93 The group contributions, also known as fragmental constants, are numerical quantitiesassociated with substructures of the molecule, such as single atoms, atom pairs, atom-centered substructures,molecular fragments, and functional groups. For example, atom contribution models exhibit a one-to-onecorrespondence between atoms and property contributions, that is, the molecular property is a function of allthe single atomic properties. The specification of the structural groups depends on the particular GCM schemeadopted.

Generally, the application of GCM to a molecule requires the following steps:

1. Identification of all groups in the molecule applicable to the particular GCM scheme.2. Calculation of fragmental constants measuring contributions to the molecular property of the considered

groups by employing the function associated with the particular GCM.3. Evaluation of some correction factors that should account for interactions among molecular groups.

The group contributions are usually estimated by multivariate regression analysis on chemicals of knownproperties, but they can also be experimental, theoretical, or user-defined quantities. When estimation of groupcontributions is carried out by regression analysis, large training sets of chemicals are required to obtain reliableestimates. Usually, a battery of group contributions (a set of scalar parameters) is defined taking into accountseveral structural characteristics of the molecules. If correction factors are accounted for, the GCM models areusually called additive-constitutive models.

Linear GCM models are defined as the following:

yi ¼ k0 þXm

j¼1

Gj Iij or yi ¼ k0 þXm

j¼1

Gj nij ð20Þ

where k0 is a model-specified constant, j runs over the m groups defined within the GCM scheme, Gj is thecontribution of the jth group. Iij and nij are substructure descriptors, and, namely, Iij is a binary variable taking avalue equal to 1 if the jth group is present in the ith molecule, 0 otherwise, and nij is the number of occurrencesof the jth group in the ith molecule.

Chemometrics in QSAR 141

Page 14: Chemometrics in QSAR

Nonlinear GCM models are usually defined as

yi ¼ k0 þXm

j¼1

Gj nij –Xm

j¼1

Gj nij

!2

ð21Þ

Moreover, mixed GCM models are defined by adding, usually, one or more descriptors of the whole molecularstructure to the group descriptors:

yi ¼ k0 þXm

j¼1

Gj nij þXp

j¼1

�ij 9 ð22Þ

where the second summation runs over the p molecular descriptors defined in the GCM scheme and �ij 9 is thej 9th descriptor value for the ith molecule.

The group contribution approach was extensively applied for the estimation of the octanol/water partitioncoefficient, which is a powerful lipophilicity descriptor. Examples are the Nys-Rekker method,94 Broto-Moreau-Vandycke log P,95 Ghose-Crippen log P (ALOGP),96 Moriguchi log P (MLOGP),97 and Klopmanlog P (KLOGP).98

Furthermore, group contribution models were proposed for several molecular property estimations, such asboiling and melting points,99,100 molar refractivity,101 pKa,102 critical temperatures, solubilities,103 soil sorptioncoefficients,104 and several thermodynamic properties.105,106 Another well-known group contribution model isthat proposed by Atkinson for the evaluation of reaction rate constants with hydroxyl radicals of organiccompounds.107

An example of GCM for the calculation of the topological polar surface area (TPSA) of molecules is givenbelow. TPSA is calculated according to the model proposed by Ertl et al.,108 whose group contributions arelisted in Table 3.

Table 3 List of surface group contributions of polar atom types

No. Atom type PSA contribution (G) No. Atom type PSA contribution (G)

1 [N](–�)(–�)–� 3.24 23 [nH](:�):� 15.79

2 [N](–�)¼� 12.36 24 [nþ](:�)(:�):� 4.10

3 [N]#� 23.79 25 [nþ](–�)(:�):� 3.88

4 [N](–�)(¼�)¼� (b) 11.68 26 [nHþ](:�):� 14.14

5 [N](¼�)#� (c) 13.60 27 [O](–�)–� 9.23

6 [N]1(–�)–�–�–1 (d) 3.01 28 [O]1–�–�–1 (d) 12.53

7 [NH](–�)–� 12.03 29 [O]¼� 17.07

8 [NH]1–�–�–1 (d) 21.94 30 [OH]–� 20.23

9 [NH]¼� 23.85 31 [O–]–� 23.06

10 [NH2]–� 26.02 32 [o](:�):� 13.14

11 [Nþ](–�)(–�)(–�)–� 0.00 33 [S](–�)–� 25.30

12 [Nþ](–�)(–�)¼� 3.01 34 [S]¼� 32.09

13 [Nþ](–�)#� (e) 4.36 35 [S](–�)(–�)¼� 19.21

14 [NHþ](–�)(–�)–� 4.44 36 [S](–�)(–�)(¼�)¼� 8.38

15 [NHþ](–�)¼� 13.97 37 [SH]–� 38.80

16 [NH2þ](–�)–� 16.61 38 [s](:�):� 28.24

17 [NH2þ]¼� 25.59 39 [s](¼�)(:�):� 21.70

18 [NH3þ]–� 27.64 40 [P](–�)(–�)–� 13.59

19 [n](:�):� 12.89 41 [P](–�)¼� 34.14

20 [n](:�)(:�):� 4.41 42 [P](–�)(–�)(–�)¼� 9.81

21 [n](–�)(:�):� 4.93 43 [PH](–�)(–�)¼� 23.47

22 [n](¼�)(:�):� (f) 8.39

An asterisk (�) stands for any non-hydrogen atom, – for a single bond, ¼ for a double bond, # for a triple bond, and : for an

aromatic bond; atomic symbol in lowercase means that the atom is part of an aromatic system. (b) As in nitro group. (c) Middle

nitrogen in azide group. (d) Atom in a three-membered ring. (e) Nitrogen in isocyano group. (f) As in pyridine N-oxide.

142 Chemometrics in QSAR

Page 15: Chemometrics in QSAR

The TPSA of a molecule is determined by the summation of tabulated surface contributions of polar atomtypes as

TPSAi ¼Xm

j¼1

Gj nij ð23Þ

where the sum runs over the defined types of polar fragments (see Table 3), nij is the frequency of the jth polarfragment type in the ith molecule, and Gj is the surface contribution of the jth fragment type. The surfacecontributions were calculated by least-squares fitting of the TPSA-based fragments to the single conformer 3DPSA of a training set consisting of 34 810 drug-like molecules taken from the World Drug Index database. Thestatistical parameters of the model are R2¼ 0.982 and s¼ 7.83.

4.05.4.5 Cluster Significance Analysis

CSA is contemporarily a QSAR and a variable selection method, being proposed for determining whichmolecular descriptors of a set of compounds are associated with a biological response. The active compoundsare expected to be similar to each other in the chemical space defined by the relevant descriptors and so willcluster.

This approach, originally proposed for binary response variables,109 was extended to the quantitativebiological responses, scaled between 0 and 1, with the name of generalized cluster significance analysis(GCSA).110

Let X be a data matrix of n rows (i.e., the compounds) and p columns (i.e., the descriptors) and y the vector ofthe n biological responses. The mean square distance MSDj was proposed to measure the tightness of the clusterof active compounds with respect to each jth molecular descriptor:

MSDj ¼

Pn – 1

s¼1

Pnt¼sþ1

ysyt xsj – xtj

� �2

n n – 1ð Þ ð24Þ

where n is the number of compounds, ys and yt the biological responses of compounds s and t, xsj and xtj the jthdescriptor values of the two compounds. A small MSD value indicates that the considered descriptor has a goodcapability to cluster compounds with the same biological activity.

The MSD calculated as above is proportional to that calculated:

MSDj ¼Xn

i¼1

yi xij – �xjW

� �2 ð25Þ

where the weighted mean is calculated as

�xjW ¼

Pni¼1

yixij

Pni¼i

yi

ð26Þ

To reach a statistical evaluation of the clustering capability of each descriptor, a test for significance isperformed using a random permutation of the responses and using the permuted values to recalculate MSDvalues; this calculation is repeated N times (e.g., N¼ 100 000). Then, for any given descriptor, the number cj oftimes giving a value less than or equal to MSDj is used to obtain the significance level (‘p-value’) and thestandard error s of this estimate.

pj ¼cj

Nð27Þ

sj ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipj 1 – pj

� �N

sð28Þ

Chemometrics in QSAR 143

Page 16: Chemometrics in QSAR

The best descriptor is chosen based on the minimum p-value.If some descriptors are being considered together, the corresponding MSD random values are added

together, as are the corresponding actual MSD values, before the count is taken.Therefore, the selection of the best subset model can be performed by forward stepwise selection starting

from the variable with the lowest p-value (the current model); next, each of the variables that are not yetincluded in the current model is added to it in turn, producing a set of candidates with corresponding p-values.The candidate model with the lowest p-value is selected, and the process is repeated on the new current model.

4.05.4.6 Read-Across Approach

Recently adopted to feel in gaps on data of chemicals, read-across111 is a nonformalized approach in whichendpoint information for one chemical (called a ‘source chemical’) is used to make a prediction of the endpointfor another chemical (called a ‘target chemical’), which is considered to be similar in some way (usually on thebasis of structural similarity). In principle, read-across can be applied to characterize physicochemical proper-ties, fate, human health effects, and ecotoxicity, and it may be performed in a qualitative or quantitative manner.Read-across can either be qualitative or quantitative, depending on whether the data being used are categoricalor numerical in nature.

To estimate the properties of a given substance, read-across can be performed in a one-to-one manner (oneanalog used to make an estimation) or in a many-to-one manner (two or more analog used). Within the context of achemical category, the read-across can also be performed in a one-to-many manner or in a many-to many manner.

4.05.5 Molecular Descriptors

In the last decades, several scientific researches have been focused on studying how to catch and convert – by atheoretical pathway – the information encoded in the molecular structure into one or more numbers used toestablish quantitative relationships between structures and properties, biological activities or other experi-mental properties. Molecular descriptors are formally mathematical representations of a molecule obtained by awell-specified algorithm applied to a defined molecular representation or a well-specified experimentalprocedure: ‘‘The molecular descriptor is the final result of a logic and mathematical procedure which trans-forms chemical information encoded within a symbolic representation of a molecule into a useful number or theresult of some standardized experiment.’’112

Molecular descriptors play a fundamental role in chemistry, pharmaceutical sciences, environmental protec-tion policy, toxicology, ecotoxicology, health research, and quality control. Evidence of the interest of thescientific community in the molecular descriptors is provided by the huge number of descriptors proposed uptoday: more than 3000 descriptors112 derived from different theories and approaches are actually defined andcomputable by using dedicated software tools.

Each molecular descriptor takes into account a small part of the whole chemical information contained in thereal molecule, and, as a consequence, the number of descriptors is continuously increasing with the increasingrequest of deeper investigations on chemical and biological systems.

Different descriptors are different ways or perspectives to view a molecule, taking into account the various featuresof its chemical structure. By now molecular descriptors have become one among the most important variables used inmolecular modeling, and, consequently, managed by statistics, chemometrics, and chemoinformatics.

The availability of the molecular descriptors has not only been a new opportunity to search for newrelationships but also been a great change of the research paradigm in this field: in effect, the use of themolecular descriptors – calculated by theories – has permitted for the first time to link experimental knowledgeto theoretical information arising from the molecule structure. While until 1960s–70s molecular modelingmainly consisted in searching for mathematical relationships between experimentally measured quantities,nowadays it is mainly performed searching for relationships between a measured property and moleculardescriptors able to catch structural chemical information (Figure 2).

A general consideration about the use of molecular descriptors in modeling problems concerns theirinformation content. This depends on the kind of molecular representation used and the defined algorithm

144 Chemometrics in QSAR

Page 17: Chemometrics in QSAR

for its calculation. There are simple molecular descriptors derived by counting some atom types or structuralfragments in the molecule, as well as physicochemical and bulk properties such as molecular weight, number ofhydrogen bond donors/acceptors, and number of OH-groups.

Other molecular descriptors are derived from algorithms applied to a topological representation and usuallycalled topological or 2D descriptors. Other molecular descriptors are derived from the spatial (x, y, z) coordinatesof the molecule, usually called geometrical or 3D descriptors; another class of molecular descriptors, called 4Ddescriptors, is derived from the interaction energies between the molecule, imbedded into a grid, and some probe.

In Figure 7, a (very) simplified scheme of the major classes of molecular descriptors is shown.It is true that geometrical 3D/4D descriptors have a higher information content than other simpler

descriptors, such as counting descriptors or topological descriptors, which often show relevant levels ofdegeneracy. Then, several people think that it is better to use the most informative descriptors in all modelingprocesses. This thinking is incorrect because the ‘best descriptors’ are those whose information content iscomparable with the information content of the response for which the model is searched for. In effect, too highinformation in the independent variables (the descriptors) with respect to the response is often seen as noise onbehalf of the model, thus giving instable or not predictive models. For example, a property whose values areequal or similar for isomeric structures is better modeled by a simple descriptor with degenerate values forisomeric structures. In this case, descriptors able to discriminate among the isomeric structures have aredundant information which cannot be integrated in the model. In conclusion, it can be stated that the bestdescriptor(s) valid for all the problems does not exist.

In general, molecular descriptors, besides the trivial invariance properties, should satisfy some basicrequirements. A list of desirable requirements of chemical descriptors suggested by Randic113 is shown inTable 4.

A lot of software calculates wide sets of different theoretical descriptors, from SMILES, 2D graphs to 3Dx, y, z spatial coordinates. Some of the most popular software are mentioned here: ADAPT,114 OASIS,115

Molecular graph

Graph invariants

Topostructuraldescriptors

Topochemicaldescriptors

Topographicdescriptors

Topological information indices

2D

Atom list 0D

Counting Summing

Grid-based QSARtechniques

Interaction energyvalues

4D

Substructure list 1D

Counting

Molecular geometryx, y, z coordinates

Geometricaldescriptors

Quantum-chemicaldescriptors

Steric/bulkdescriptors

Molecular surfacedescriptors

3D

Structural keys

Figure 7 General scheme of the different sources of molecular descriptors.

Chemometrics in QSAR 145

Page 18: Chemometrics in QSAR

CODESSA,116 MolConn-Z,117 and DRAGON.118 A website wholly dedicated to the molecular descriptors wascreated in 2007 by the Milano Chemometrics and QSAR Research Group (http://www.moleculardes-criptors.eu), where, together with information about software and books, news and tutorials concerning themolecular descriptors are provided.

4.05.5.1 Molecular Structure Representations

The molecular representation is the way in which a molecule, that is, a phenomenological real body, issymbolically represented by a specific formal procedure and conventional rules. The quantity of chemicalinformation that is transformed to the molecule symbolic representation depends on the kind ofrepresentation.119,120

The simplest molecular representation is the chemical formula, which is the list of the different atom types,each accompanied by a subscript representing the number of occurrences of the atoms in the molecule. Forexample, the chemical formula of p-chlorotoluene is C7H7Cl, indicating the presence in the molecule of 15atoms distinguished into NC¼ 7, NH¼ 7, and NCl¼ 1. This representation is independent of any knowledgeconcerning the molecular structure, and hence, molecular descriptors obtained from the chemical formula canbe called 0D descriptors. Examples are the atom number, molecular weight, atom-type count, and, in general,constitutional descriptors and any function of the atomic properties.

The atomic properties constitute the weights used to characterize molecule atoms; the most common atomicproperties are atomic mass, atomic charge, covalent and van der Waals radii, atomic polarizability, andhydrophobic atomic constants.

The substructure list representation can be considered as a 1D representation of a molecule and consists of alist of structural fragments of a molecule; the list can be only a partial list of fragments, functional groups, orsubstituents of interest, thus not requiring a complete knowledge of the molecule structure. The descriptorsderived by this representation can be referred to as 1D descriptors and are typically used in substructuralanalysis and substructure searching with a common name of molecular fingerprints.

The 2D representation of a molecule considers how the atoms are connected, that is, it defines the connectivityof atoms in the molecule in terms of the presence and nature of chemical bonds. Approaches based on themolecular graph allow a 2D representation of a molecule, usually known as topological representation.

A molecular graph is usually denoted as G ¼ (V , E ), where V is a set of vertices which correspond to themolecule atoms and E is a set of elements representing the binary relationship between pairs of vertices;unordered vertex pairs are called edges, which correspond to bonds between atoms.

A molecular graph obtained excluding all the hydrogen atoms is called H-depleted molecular graph, whereasa molecular graph where hydrogen atoms are also included is called H-filled molecular graph (or, simply,molecular graph). In Figure 8, examples of H-depleted molecular graphs are given for 2-methyl-3-butenoicacid, 1-ethyl-2-methyl-cyclobutan, and 5-methyl-1,3,4-oxathiazol-2-one.

Table 4 List of desirable requirements for molecular descriptors

No. Descriptors

1 Should have structural interpretation

2 Should have good correlation with at least one property

3 Should preferably discriminate among isomers

4 Should be possible to apply to local structure

5 Should possible to generalize to ‘higher’ descriptors

6 Descriptors should be preferably independent

7 Should be simple

8 Should not be based on properties

9 Should not be trivially related to other descriptors

10 Should be possible to construct efficiently

11 Should use familiar structural concepts

12 Should have the correct size dependence

13 Should change gradually with gradual change in structures

146 Chemometrics in QSAR

Page 19: Chemometrics in QSAR

The molecular graph depicts the connectivity of atoms in a molecule irrespective of the metric parameterssuch as equilibrium interatomic distances between nuclei, bond angles, and torsion angles. Thus, a moleculargraph is a topological representation of the molecule, and it is from this that a lot of molecular descriptors arederived. These are 2D descriptors and usually are graph invariants known with the name of TIs.

Two-dimensional representations alternative to the molecular graph are the linear notation systems, such asWiswesser Line Notation (WLN) system121 and SMILES notation.122

The 3D representation views a molecule as a rigid geometrical object in space and allows a representationnot only of the nature and connectivity of the atoms but also of the overall spatial configuration of the molecule.This representation of a molecule is called geometrical representation and defines a molecule in terms of atomtypes constituting the molecule and the set of (x, y, z) coordinates associated to each atom. Figure 9 shows ageometrical representation of lactic acid. Molecular descriptors derived from this representation are called 3Ddescriptors, and examples are the geometrical descriptors, several steric descriptors, and size descriptors.

Several molecular descriptors derive from multiple molecular representations and can then be classified withdifficulty. For example, graph invariants derived from a molecular graph weighted by properties obtained bycomputational chemistry are both 2D and 3D descriptors.

The bulk representation of a molecule describes the molecule in terms of a physical object with 3D attributessuch as bulk and steric properties, surface area, and volume.

The stereoelectronic representation (or lattice representation) of a molecule is a molecular descriptionrelated to those molecular properties arising from electron distribution, interaction of the molecule with probescharacterizing the space surrounding them (e.g., MIFs). This representation is typical of the GRID-basedQSAR techniques. Descriptors at this level can be considered 4D descriptors, being characterized by a scalarfield, that is, a lattice of scalar numbers, derived from the 3D molecular geometry (Figure 10).

Finally, the stereodynamic representation of a molecule is a time-dependent representation that addsstructural properties to the 3D representations, such as flexibility, conformational behavior, and transportproperties. Dynamic QSAR is an example of a multiconformational approach.123,124

4.05.5.2 0D Descriptors or Count Descriptors

All the molecular descriptors for which no information about molecular structure and atom connectivities isneeded belong to the class of 0D descriptors. Atom and bond counts, as well as sum or average of the atomicproperties are typical of this class of descriptors. These descriptors can be always easily calculated, are naturallyinterpreted, do not require optimization of the molecular structure, and are independent of any conformational

O

O

O S

NO

3

46

23

46

7

1 6

1

5

7

4

32

557

21

Figure 8 Some molecular graph representations of molecules.

Figure 9 The 3D structure representation of a molecule.

Chemometrics in QSAR 147

Page 20: Chemometrics in QSAR

problem. They usually show a very high degeneration, that is, they have equal values for several molecules,such as isomers. Their information content is low, but nevertheless they can play an important role in modelingseveral physicochemical properties or take a part into more complex models.

4.05.5.3 1D Descriptors or Fingerprints

All the molecular descriptors that can be calculated from substructural information about the molecule belongto the 1D descriptors. Counting of functional groups and substructure fragments, as well as atom-centereddescriptors, are the most known 1D descriptors. These descriptors are often presented as fingerprint, that is, abinary vector where 1 indicates the presence of the defined substrutcure and 0 its absence. A relevant advantagein describing molecules by fingerprints is the possibility to perform quick calculations for molecule similarity/diversity problems.

Like 0D descriptors, these descriptors can be usually easily calculated, are naturally interpreted, do notrequire optimization of the molecular structure, and are independent of any conformational problem. Theyusually show a medium-high degeneration and are often very useful in modeling both physicochemical andbiological properties.

4.05.5.4 2D Descriptors or Topological Descriptors

TIs are molecular descriptors based on a graph representation of the molecule and represent graph–theoreticalproperties that are preserved by isomorphism, that is, properties with identical values for isomorphic graphs. Agraph invariant may be a characteristic polynomial, a sequence of numbers, or a single numerical indexobtained by the application of algebraic operators to matrices representing molecular graphs and whose valuesare independent of vertex numbering or labeling.

TIs are usually derived from a H-depleted molecular graph. They can be sensitive to one or more structuralfeatures of the molecule such as size, shape, symmetry, branching, and cyclicity and can also encode chemicalinformation concerning atom type and bond multiplicity. In fact, TIs are usually divided into two categories:topostructural and topochemical indices.125 Topostructural indices encode only information on the adjacencyand distance of atoms in the molecular structure; topochemical indices quantify information on topology butalso specific chemical properties of atoms such as their chemical identity and hybridization state.

Figure 10 A lattice of grid point with an embedded molecule.

148 Chemometrics in QSAR

Page 21: Chemometrics in QSAR

Topological information indices are graph invariants, based on information theory and calculated asinformation content of specified equivalence relationships on the molecular graph.

In general, TIs do not uniquely characterize molecular topology; different structures may have some of thesame TIs. A consequence of TIs nonuniqueness is that they do not, in general, allow reconstructing molecule.

There are several ways to obtain topological descriptors. Simple TIs consist in the counting of some specificgraph elements; examples are the Hosoya Z index,126 path counts,127 walk counts, self-returning walk counts,28

Kier shape descriptors,128 path/walk shape indices.129 However, the most common TIs are derived by applyingsome algebraic operators (e.g., the Wiener operator) to a matrix representation of the molecular structure, suchas adjacency and distance matrices. Among them are the Wiener index,130 spectral indices,131 and Hararyindices.132

Molecular matrices are the most common mathematical tool to encode structural information of molecules.Very popular molecular matrices are the graph–theoretical matrices, a huge number of which were proposed inthe last decades in order to derive TIs and describe molecules from a topological point of view. Graph–theoretical matrices are matrices derived from a molecular graph G (often from a H-depleted molecular graph).A comprehensive collection of graph–theoretical matrices is reported by Janezic et al.133 Vertex matrices areundoubtedly the graph–theoretical matrices most frequently used for characterizing a molecular graph. Thematrix entries encode different information about pairs of vertices such as their connectivities, topologicaldistances, sums of the weights of the atoms along the connecting paths; the diagonal entries can encodechemical information about the vertices. From vertex matrices a huge number of TIs were proposed.

Other topological molecular descriptors can be obtained by using suitable functions applied to local vertexinvariants (LOVIs), these being numerical representations of the atoms derived from molecular graphs. Themost common functions are atom and/or bond additives, resulting in descriptors that correlate well physico-chemical properties that are atom and/or bond additives themselves. For example, Zagreb indices,31 Randicconnectivity index,134 related higher-order connectivity indices,135 and Balaban distance connectivityindices136 are derived according to this approach.

Particular TIs are derived from weighted molecular graphs where vertices and/or edges are weighted byquantities representing some 3D features of the molecule, like those obtained by computational chemistry. Thegraph invariants obtained in this way encode both information on molecular topology and molecular geometry.BCUT descriptors137 are an example of these topological descriptors. Graph invariants have been successfullyapplied in characterizing the structural similarity/diversity of molecules and in QSAR/QSPR modeling.

4.05.5.5 3D Descriptors or Geometrical Descriptors

Another class of molecular descriptors, called geometrical or 3D descriptors, is derived from a geometricalrepresentation of the molecule, that is, from x–y–z Cartesian coordinates of the molecule atoms. Some of themost known geometrical descriptors are here shortly presented.

WHIM56 descriptors are molecular descriptors based on statistical indices calculated on the projections ofthe atoms along principal axes of the molecule. They are built in such a way as to capture relevant molecular3D information regarding molecular size, shape, symmetry, and atom distribution with respect to invariantreference frames. The algorithm consists in performing a PCA on the centered Cartesian coordinates of amolecule by using a weighted covariance matrix obtained from different weighting schemes for the atoms. Foreach weighting scheme, a set of statistical indices is calculated on the atoms projected onto each principalcomponent, that is, the scores.

Gravitational indices57 are geometrical descriptors reflecting the mass distribution in a molecule, defined as

G1 ¼XA – 1

i¼1

XA

j¼iþ1

mi mj

r 2ij

ð29Þ

G2 ¼XB

b¼1

mi mj

r 2ij

!b

ð30Þ

Chemometrics in QSAR 149

Page 22: Chemometrics in QSAR

where mi and mj are the atomic masses of the considered atoms, rij the corresponding interatomic distances,A and B the number of atoms and bonds of the molecule, respectively. The G1 index takes into account all atompairs in the molecule, whereas the G2 index is restricted to pairs of bonded atoms. These indices are related tothe bulk cohesiveness of the molecules accounting, simultaneously, for both atomic masses (volumes) and theirdistribution within the molecular space.

EVA descriptors58 were proposed to extract chemical structural information from mid- and near-infraredspectra. The approach is to use, as a multivariate descriptor, the vibrational frequencies of a molecule, afundamental molecular property characterized reliably and easily from the potential energy function. The EVAdescriptor is a function of the eigenvalues obtained from the normal coordinate matrix; it corresponds to thefundamental vibrational frequencies of the molecule, which can be calculated using standard quantum ormolecular mechanical methods from computational chemistry.

The EEVA descriptors60 are analogous to the EVA descriptors, but semiempirical molecular orbital energies,that is, the eigenvalues of the Schrodinger equation, are used instead of the vibrational frequencies of themolecule.

3D-MoRSE descriptors59 are based on the idea of obtaining information from the 3D atomic coordinates bythe transform used in electron diffraction studies for preparing theoretical scattering curves. The derivedexpression is the following:

I sð Þ ¼XA – 1

i¼1

XA

j¼iþ1

wi wj

sin srij

� �srij

ð31Þ

where I(s) is the scattered electron intensity, w an atomic property (e.g., the atomic number), rij the interatomicdistance between the ith and the jth atoms, and A the number of atoms. Radial distribution function (RDF)descriptors138 are based on the distance distribution in the geometrical representation of a molecule andconstitute a RDF code that shows certain characteristics in common with the 3D-MoRSE descriptors.

The GETAWAY61 descriptors are derived from the molecular influence matrix (H), which is a representa-tion of the molecular structure, defined as

H ¼ M� MT �M� � – 1�MT ð32Þ

where M is the molecular matrix consisting of the centered Cartesian coordinates x, y, z of the molecule atomsin a chosen conformation. Atomic coordinates are assumed to be calculated with respect to the geometricalcenter of the molecule in order to obtain translational invariance. The molecular influence matrix is asymmetric A� A matrix, where A represents the number of atoms, and shows rotational invariance with respectto the molecule coordinates, thus resulting independent of molecule alignment rules. The diagonal elements hii

of this matrix range from 0 to 1 and encode atomic information related to the ‘influence’ of each molecule atomin determining the whole shape of the molecule; in effect, mantle atoms always have higher hii values thanatoms near the molecule center. GETAWAY descriptors are obtained by using double-weighted autocorrela-tion functions, where one weighting scheme is the leverage and the other an atomic property (e.g., atomic mass).

As a geometrical representation involves the knowledge of the relative positions of the atoms in 3D space,that is, the (x, y, z) atomic coordinates of the molecule atoms, geometrical descriptors usually provide moreinformation and discrimination power also for similar molecular structures and molecule conformations thantopological descriptors. Despite their high information content, geometrical descriptors usually show somedrawbacks. They require geometry optimization and therefore the overhead to calculate them. Moreover, forflexible molecules, several molecule conformations are available: on one hand, new information is available andcan be exploited, but, on the other hand, the problem complexity can significantly increase.

For these reasons, topological descriptors, fingerprints based on fragment counts, and other simple descrip-tors are usually preferred for the screening of large databases of molecules. On the contrary, searching forrelationships between molecular structures and complex properties, such as biological activities, can oftenefficiently be performed by using geometrical descriptors, exploiting their large information content.Moreover, it is important to remember that the biologically active conformation of the studied chemicals isseldom known. Some authors overcome this problem by using a multiconformation dynamic approach.123,124

150 Chemometrics in QSAR

Page 23: Chemometrics in QSAR

4.05.5.6 4D Descriptors or Grid-Based Descriptors

GRID62 and CoMFA63 approaches were the first methods based not uniquely on the molecular structure but onthe calculation of the interaction energy between molecule and probe. The focus of these approaches is toidentify and characterize quantitatively the interactions between the molecule and the receptor’s active site.

They place the molecules in a 3D lattice constituted by several thousands of evenly spaced grid points anduse a probe (steric, electrostatic, hydrophilic, etc.) to map the surface of the molecule on the basis of themolecule interaction with the probe.

QSAR models are obtained by the application of PLS regression to the interaction field matrix. It should benoted that the use of the grid points as molecular descriptors requires the careful step of aligning the consideredmolecules in such a way that each of the thousands of grid points represents, for all the molecules, the same kindof information and not spurious information because of the lack of invariance in the rotation of the molecules inthe grid.

Besides the two most popular methods GRID and CoMFA, the other known methods based on this approachare CoMSIA,64 Compass,65 G-WHIM descriptors,66 Voronoi Field Analysis,67 SOMFA,139 VolSurf descrip-tors,68 and GRIND.69 Although these descriptors are often called 3D descriptors, they can be more properlycalled 4D descriptors (or grid-based descriptors) because to geometrical information is added another source ofinformation given by the interaction energy with a specific probe at each point of a 3D grid embedding themolecule. Therefore, the molecular descriptors are the MIFs generated by probes. These scalar fields can beefficiently visualized and used to think visually about new drug candidates, thus resulting very helpful in thedrug discovery process.140,141

An advantage of these approaches is that final results show where and how to modify the compounds to reachthe desirable values of the studied molecular property. On the contrary, a drawback is the need of molecularalignment in order to achieve molecular comparability and the selection of the most appropriate conformation.

The alignment determines to what extent the descriptors differ from one molecule to the next.Consequently, it substantially influences the results of the evaluation. Hence, significant and relevant resultscan only be expected if the alignment was carried out properly and unambiguously. Often, the need for analignment limits the application of certain descriptors to homogeneous data sets, and even then the alignmentis not always easily performed. As a consequence, different research groups started to develop alignment-independent molecular descriptors. The first set of descriptors based on scalar fields but alignment indepen-dent were G-WHIM descriptors,66 based on the theoretical principles of the WHIM descriptors56 but appliedto the MIFs. VolSurf68 and GRIND69 descriptors are also independent of any previous alignment of themolecules.

4.05.6 Molecular Descriptor Selection

In the last few years, a great attention of the scientific community has been paid to the techniques devoted tothe variable selection, namely the molecular descriptor selection in QSARs. As there are thousands ofdescriptors available for describing a molecule and often there is no a priori knowledge about which molecularfeatures are more responsible for a specific property, subsets of the most appropriate descriptors are searchedfor by using different strategies.

It is now inconfutable that model reliability is affected not only by the presence of noise, correlated orredundant descriptors, but also by the presence of irrelevant descriptors. Therefore, variable selection techni-ques are largely used to remedy this situation and improve the accuracy and the prediction power ofclassification or regression models.

The exhaustive search, sometimes called all possible models (APMs), can be applied in all but the simplestcases, the search space being impractical when there are a number of molecular descriptors: in effect, given p

candidate variables, the number of APMs containing a number k of variables between 1 and V (V < p) is

Total number of models ¼XV

k¼1

p!

k! p – kð Þ!

� �< 2p ð33Þ

Chemometrics in QSAR 151

Page 24: Chemometrics in QSAR

The total maximum number of models obtained from p candidate variables is exactly 2p – 1 (the model withzero variables is not counted, it being not very useful!). For example, given 50 candidate variables, the totalnumber of possible models containing from 1 to 5 variables is 2 369 935. Two main approaches can be used forextracting nonredundant but relevant variables from the pool of available variables: the variable reduction andthe variable subset selection (VSS).

4.05.6.1 Variable Reduction

When the molecular descriptors to be used in a model are chosen on the basis of general principles and notaccounting for a specific goal (i.e., some experimental property to model), the term ‘variable reduction’ can bemore properly used than variable selection. By variable reduction techniques, molecular descriptors are chosenby comparison among the descriptors themselves regardless of the specific molecular property that needs to bemodeled.

For instance, descriptors can be selected on the basis of their information content (e.g., the Shannon entropyor the most commonly used standardized entropy): descriptors with high information content are moreeffective in discriminating different molecules and, thus, are expected to be more effective in modeling anyproperty of molecules. Moreover, in order to avoid redundant information, the check of descriptor pairwisecorrelations is advisable. One of the two descriptors having an absolute correlation value higher than apredefined threshold (often selected in the range 0.90–0.99) has to be discarded, but how to choose which ofthe two descriptors is better to delete from the data set? A good solution may be to delete the descriptor showingthe highest average correlation with the other descriptors in the data set or the lowest variance or entropy.

Variable reduction can also be performed by multivariate techniques such as PCA-based feature selection,iteratively deleting variables with the largest loadings in the last components or retaining the variables with thelargest loadings in the first components.142,143

Moreover, all the clustering methods applied on the transposed matrix of the original data, where descriptorsbecome rows and molecules columns, can be used for variable reduction purposes. Representative descriptorsare chosen (one or more) from each cluster. Together with the classical clustering methods (k-means, Jarvis-Patrick method, hierarchical clustering, etc.), the SOMs, such as the Kohonen maps, are nowadays very popularbecause of their efficiency and simple use. Optimal design techniques, such as D-optimal design, can also beused for the same purposes.

Other variable reduction techniques are based on the ranking of the molecular descriptors according to theirglobal correlation with the other descriptors in the data set. To this regard, the method called K-correlationanalysis exploits the K-multivariate correlation index for the iterative ranking from the most correlated to theleast correlated descriptors.144,145 In this approach, the K-correlation of the data set, constituted by p – 1variables after deleting one variable, is calculated and such value is attributed to the excluded variable. Theprocedure is repeated excluding in turn one variable at a time. Then, the variable showing the lowest K-value isdefinitively eliminated. The whole procedure is repeated on the remaining variables until only two variablesremain. In other words, at each step the variable that shows the highest global correlation with all the othervariables is excluded from the data set.

4.05.6.2 Variable Subset Selection

VSS techniques, unlike variable reduction techniques, take into account the specific property to be modeled.For instance, in regression analysis, these techniques aim at finding the subset of molecular descriptors that leadto the best predictive model for the studied property.

A number of different variable selection techniques are nowadays available: besides the classical SWR,proposed by Efroymson146 in the late 1960s and based on alternating forward selections and backwardeliminations, other more powerful techniques were devised and are largely used for variable selection purposes.Owing to the huge number of possible combinations of descriptors, this high-complexity problem is oftensolved by machine learning techniques: Genetic Algorithms (GAs),147,148 Simulated Annealing (SA),149 Tabu

152 Chemometrics in QSAR

Page 25: Chemometrics in QSAR

Search (TS),150 and Evolutionary Programming (EP)151 are the most common in QSAR research. More recenttechniques, and thus less known, are Artificial Ants,152 Particle Swarm,153 and an approach based on ProjectionPursuit using robust estimators.154

Several modifications of the original PLS regression method were also proposed with the aim to performvariable selection and, among them, Iterative Variable Selection for PLS (IVS-PLS)155,156 and UniformativeVariable Elimination by PLS (UVE-PLS)157 are the most popular.

The general approach to the VSS is shown in Figure 11. The first step (A) is the definition of the algorithmperforming the selection of one or more variables within the whole set of candidate variables. This step can beperformed by selecting the variables by a random strategy or by using a genetic strategy (based on repeatedreproduction and mutation steps), or other approaches. Then, from each subset of variables, a model iscalculated.

The second step (B) is the evaluation of the quality associated to each model by using proper optimizationfunctions (often called fitness functions). In this phase, both the method for estimating the models and thefitness function to be optimized have been previously defined.

The most popular regression methods used in the model estimation are OLS (or MLR), PLS, BP-ANN,k-NN estimator.

In regression studies, the most popular fitness function is the prediction ability (Q2) based on leave-one-out(LOO) or leave-more-out (LMO), even if the LOO procedure is the most common during the model selectionphase.

However, the acceptability of a final regression model (step C) should not be evaluated simply looking at itsprediction ability but considering also additional rules. For instance, models whose differences between R2 andQ2 (obtained by the LOO procedure) are too large158 should be rejected because a significant decrease in theprediction ability can be expected in their practical use on new chemicals. Therefore, in order to prevent theacceptability of not real predictive models and/or chance correlated models, severe optimization functionsneed to be used as the AIC index,159 the LOF function,160 the FIT function,161 and the RQK functions,162 theselast including more than one rule for the model acceptability.

The iterative step (D) depends on the chosen variable selection technique. During the iterative procedure,the conditions for the stop are checked, and the accepted models are properly managed.

Estimate of the model optimization function

Subset of selected descriptors

Evaluation of the model acceptability

Var

iabl

e su

bset

sel

ectio

n te

chni

que

A

B

CD

Check of the predictive ability of the selected models

E

Set of candidate descriptors

Figure 11 General scheme of the variable selection approach.

Chemometrics in QSAR 153

Page 26: Chemometrics in QSAR

The simplest VSS techniques preserve only the best model at each step, providing a final unique model at theend of the optimization procedure. Other strategies (typically, e.g., those based on the GAs) provide apopulation of accepted models or more than one population of models.

The final selected model(s) is processed (E) further to check its effective predictive ability and the eventualpresence of chance correlation. To this end, strong validation procedures are applied such as bootstrap or usingan external data set; chance correlation can be evaluated by the Y-randomization test (see 4.05.7.3).

An example of a variable selection algorithm is given in Figure 12, as proposed by Zheng and Tropsha.164

The whole optimization procedure is managed by the SA algorithm. The crucial step B is performed by theLOO validation technique and the fitness function is the Q2 evaluated by the k-NN algorithm. This procedurewas applied for variable selection, but the best final model, applicable for reliable prediction, was selected aftervalidation on an external data set.163,164

4.05.6.3 Consensus Modeling

Owing to the large availability of different models predicting the same molecular property, such as thosemodels selected by GAs, the consensus modeling strategy165–167 can be used in order to produce more reliableestimates of the studied property. This strategy can be applied for both regression and classification purposes.

Consensus analysis consists in selecting not just one model, but more than one. Predictions are performedcontemporarily using the average response obtained from all the selected models or, better, using the weighted averageresponse, considering as the statistical weight the leverage hk of the object from each kth model, as Equation (34):

�y ¼

PMk¼1

yk

hkPMk¼1

1

hk

ð34Þ

�y ¼

PMk¼1

wk

yk

hk

� �PMk¼1

1

hk

ð35Þ

where M is the number of selected models and yk is the response estimated by the kth model.167 The leverage isa measure of the ‘distance’ of the object from the model, that is, small leverages correspond to objects well

Subset of randomly selected descriptors

Q 2 estimate of the model

Select the best model

Leave-one-out compound

Prediction of the response of the excludedcompound by the average k-NN estimate

Sim

ulat

ed a

nnea

ling

Figure 12 Variable subset selection based on simulated annealing and k-NN method.

154 Chemometrics in QSAR

Page 27: Chemometrics in QSAR

represented by the model, whereas high leverages represent objects far from the model, thus the response likelybeing extrapolated and less reliable. In Equation (35), the weight wk takes into account the quality (e.g., the Q2

LOO) of each model and is defined as

wk ¼Q 2

kPMk¼1

Q 2k

ð36Þ

A consensus modeling can also be performed by evaluating, for each sample, the standard deviation of theresponses predicted by the selected models, thus obtaining a measure of the convergence of all the selectedmodels toward a unique response.

Once a set of acceptable models has been obtained, the models to be used for consensus analysis can be chosensimply taking into account the variables in each model, possibly preferring models with simple and interpretablevariables. In this stage, attention has to be paid to models with variables that are different but highly correlated amongthemselves, because the average prediction from these models will be biased toward a reduced source of information.

In order to avoid the selection of models that are only seemingly diverse, because of the presence of differentdescriptors, but closely correlated among themselves, a measure of distance between two models can be accountedfor.167,168 This distance, called model distance, allows the selection of models which are really diverse and, then, toperform a consensus analysis taking into account different molecular characteristics. The model distance takesinto account the correlation of variables within and between models and allows the finding of clusters of similarmodels, catching the most diverse models in such a way as to preserve maximum information and diversity.

Comparing two models means comparing two p-dimensional binary vectors where each position correspondsto a variable. The most common way to represent the relationships between two binary vectors, representedhere by models A and B, is a two-way table as shown in Table 5, a the number of cases with 1 in the sameposition in both vectors, d the number of cases with 0 in the same position in both vectors, b the number of casessuch that for a given position there is 1 in vector A and 0 in vector B, c the number of cases such that for a givenposition there is 1 in vector B and 0 in vector A. Therefore, b and c represent the number of variables not sharedby the two models, b is the number of variables in model A but not in model B, and c the number of variables inmodel B but not in model A. The degree of similarity between the two models is in some way related to a and d,whereas their degree of diversity is related to b and c.

The most common distance measure for two binary vectors IA and IB, which represent two models A and B isthe Hamming distance dH defined as

dH ¼ b þ c ð37Þ

where b and c are the numbers defined earlier. It has been demonstrated that the Hamming distance usuallyoverestimates the distance between two models, neglecting the variable correlations.

In order to measure the distance between two binary vectors IA and IB also accounting for variablecorrelation, the model distance can be calculated as the following.

As the first step, all the pairs of variables of a model having a correlation equal to 1 have to be identified andexcluded from further calculations. Note that, if the models to be analyzed have been searched for by anyvariable selection procedure based on least squares regression, the case of pairs of variables in a model with a

Table 5 Two-way table collecting

variable frequencies between two binaryvectors, represented by models A and B

Model B

1 0

1 a bModel A

0 c d

Chemometrics in QSAR 155

Page 28: Chemometrics in QSAR

correlation equal to 1 is not possible. In any case all these redundant variables should definitely be excludedfrom the model, together with the common variables of the two models which are deleted for practical reasons.At this point, the number of diverse variables in the two models is calculated, this number being b9þ c9

resembling that used for the Hamming distance even if the symbols b9 and c9 replace b and c because, in this case,a preliminary variable reduction has been made.

To better explain this stage, let us look at an example. Suppose a set of 10 ordered variables is given, let themodel A (IA) be constituted by six variables and the model B (IB) by four variables, with two common variables(x3 and x9), then their binary vector representations are

IA ¼ 0 0 1 1 1 1 1 0 1 0½ � IB ¼ 1 0 1 0 0 0 0 0 1 1½ �

and the corresponding phenotypic representations:

A :x3; x4; x5; x6; x7; x9 and B :x1; x3; x9; x10

Now suppose that the variables x4 and x5 of the model A have a correlation equal to 1 and the same holds forvariables x9 and x10 of the model B. Therefore, in both models one of the two variables, either x4 or x5 in modelA and either x9 or x10 in model B, has to be deleted. Moreover, also the common variables have to be deleted,namely x3 and x9 (or x10 which is the same as x9).

Then, the reduced models will be composed of the following variables:

A :x5; x6; x7 and B :x1

and their binary vectors will become

IA ¼ 0 0 0 0 1 1 1 0 0 0½ � IB ¼ 1 0 0 0 0 0 0 0 0 0½ �

It results that b9¼ 3 and c9¼ 1. For these reduced models the Hamming distance is equal to 4, whereas for theoriginal models it would be 6.

The second step of the procedure deals with the evaluation of the correlation among all the variables of thetwo reduced models. It involves the calculation of the cross-correlation matrix CAB, which containsthe correlations between all the possible pairs of variables of the two models. This matrix has b9 rows, that is,the number of variables in the reduced model A, and c9 columns, that is, the number of variables in the reducedmodel B. The counterpart of CAB (size b9� c9) is the cross-correlation matrix CBA (size c9� b9).

The cross-correlation matrix can be transformed into a symmetric matrix as the following:

Q A ¼ CABCBA b9; b9ð Þ ð38Þ

Q B ¼ CBACAB c9; c9ð Þ ð39Þ

The nonzero eigenvalues of both matrices QA and QB coincide and the sum rAB of the square root of theseeigenvalues gives the desired information related to the intermodel variable correlation:

rAB ¼X ffiffiffiffiffi

j

pð40Þ

Finally, the model distance d 2M is derived modifying the Hamming distance as follows:

d 2M A;Bð Þ ¼ b9þ c9 – 2

X ffiffiffiffiffij

p¼ b9þ c9 – rAB ð41Þ

As is easily seen, if no preliminary variable reduction is carried out, that is, b9¼ b and c9¼ c, and no correlationexists between the two variable blocks, that is, rAB¼ 0, the model distance coincides with the Hamming distance.

The model distance satisfies the first two main postulates for a distance measure:

1. dij¼ dji

2. dii¼ 0

Moreover, it was observed that the model distance does not always satisfy the triangle’s inequality:

dij þ djk � dik

thus belonging to the class of non-Euclidean distances.

156 Chemometrics in QSAR

Page 29: Chemometrics in QSAR

Consensus modeling has been employed in different QSAR studies,169–172 often giving better statistical fitsand predictive abilities with respect to the single models; moreover, consensus analysis has also been shown todiminish the effect of noisy data.173

4.05.7 Principles for QSAR Modeling

In recent years, the basic philosophy underlying QSAR modeling has been changing to account for new needsQSAR models should satisfy in order to be effectively used.

First of all, in order to be reproducible, all the models have to be fully described; in other words, the methodsused for their calculation and assessment have to be well defined, as well as molecular descriptors appearing inthe models, the modeled property, and the chemicals used in the training set. Furthermore, for several years,QSAR models have been performed not only on congeneric data sets but also on noncongeneric sets ofcompounds, because of the need to obtain more general relationships and exploit the great potential of bigdata sets of compounds nowadays available. Moreover, evaluation of model AD has been greatly recognized as asafe tool to predict responses for new chemicals avoiding extrapolation, and validation has by now entered thecommon practice of QSAR modeling. In effect, QSAR models are accepted only if validated, that is, somepredictive ability parameter has to be estimated for a reliable use of the model on new compounds.

All the general principles to properly produce valid QSAR modes were recently taken into account by theOECD (http://www.oecd.org/document/23/0,2340,en_2649_201185_33957015_1_1_1_1,00.html) and for-mally declared fundamental tools to estimate data on chemicals by QSARs.

The New Chemicals Policy of the European Commission (REACH) (http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_396/l_39620061230en00010849.pdf) explicitly states that at chemical regis-tration level the registrant ‘‘should include information from alternative sources (e.g., from QSARs)which may assist in identifying the presence or absence of hazardous properties of the substance and whichcan, in certain cases, replace the results of animal tests. Obviously, for the purposes of the REACH legislation, itis essential to use QSAR models that produce reliable estimates, that is, validated QSAR models’’ (http://ecb.jrc.it/qsar/). Model validation has been the subject of much recent debate in the scientific and regulatorycommunities.163,164,174–182 After the REACH legislation, it was considered important to develop an inter-nationally recognized set of principles for QSAR validation, to provide regulatory bodies with a scientificbasis for making decisions on the acceptability of QSAR estimates of regulatory endpoints and promote themutual acceptance of QSAR models.

Several principles for assessing the validity of QSARs were proposed in 2004 by the OECD WorkProgramme on QSARs and are actually known as the OECD Principles for QSAR model validation andregulatory purposes (http://www.oecd.org/dataoecd/33/37/37849783.pdf):

To facilitate the consideration of a (Q)SAR model for regulatory purposes, it should be associated with the following

information: 1) a defined endpoint; 2) an unambiguous algorithm; 3) a defined domain of applicability; 4) appropriate

measures of goodness-of–fit, robustness and predictivity; 5) a mechanistic interpretation, if possible.

Some considerations about the basic principles for QSAR modeling will be discussed later.

4.05.7.1 Unambiguous Model Algorithm

For a QSAR model to be acceptable in chemical regulations, it must be clearly defined, easily and continuouslyapplicable in such a way that the calculations for the prediction of the endpoint can be reproduced by everyone,also for new chemicals. Thus, the unambiguous algorithm is characterized not only by the mathematicalmethod of calculation used but also by the specific molecular descriptors required in the model mathematicalequation. Therefore, the exact procedure used to calculate the descriptors, including compound pretreatment(e.g., energy minimization and partial charge calculation), the software employed, and the variable selectionmethod for QSAR model development should be considered integrative parts of the overall definition of anunambiguous algorithm.

Chemometrics in QSAR 157

Page 30: Chemometrics in QSAR

4.05.7.2 Applicability Domain

The concept of AD concerns the predictive use of QSAR/QSPR models and, then, is closely related to theconcept of model validation. In other words, the AD is a concept related to the quality of the QSAR/QSPRmodel predictions and prevention of the potential misuse of model’s results. A key component of the qualityprediction is to define when a QSAR/QSPR model is suitable to predict a property/activity of a newcompound, that is, define model’s AD.164,174,176–178,180,181

A model will yield reliable predictions when model assumptions are fulfilled and unreliable predictions whenthey are violated. In particular, for QSAR/QSPR models, based on statistical mining techniques, the training setand the model prediction space are the basis for estimation of chemical space where predictions are reliable.

Two basic approaches were proposed to evaluate the AD. The first approach to AD evaluation is the analysisof the training set, which has its grounds in statistics, because the interpolated prediction results are morereliable than extrapolated. Extrapolation is not a problem in principle, because extrapolated results fromtheoretical well-founded models can often be reliable. However, QSAR/QSPR models are usually based onempirical and limited experimental evidence and/or are only locally valid; therefore, extrapolation alwaysresults in higher uncertainty and usually in unreliable predictions.

Different approaches to estimate interpolation regions in a multivariate space were evaluated byJaworska,178,179 based on (1) ranges of the descriptor space; (2) distance-based methods, using Euclidean,Manhattan, and Mahalanobis distances, Hotelling T2 method, and leverage values; and (3) probability densitydistribution methods based on parametric and nonparametric approaches. Both ranges and distance-basedmethods were also evaluated in the principal component space.

One of the common tools used to visualize the AD of a QSAR model is the plot of standardized residuals inprediction (ri) versus leverage values (hi) for each ith sample. This plot, called Williams plot, allows animmediate and simple graphical detection of both the response outliers (i.e., compounds with standardizedresiduals in prediction greater than three standard deviation units, ri > 3�) and structurally influential chemi-cals in a model (hi > h�), where h� is a threshold value, usually 2 or 3 times the average leverage value. In effect,when the leverage value of a compound is lower than the critical value h�, the probability of accordancebetween predicted and actual values is as high as that for the training set chemicals. Conversely, a high leveragechemical is structurally distant from the other chemicals; thus, it can be considered outside the AD of the model.

Figure 13 shows the Williams plot of a model for polar narcotics in Pimephales promelas as an example.183

Here, chemical 347 is wrongly predicted (ri > 3�); it is a test chemical completely outside the AD of the model,because its leverage value is beyond the vertical leverage threshold line; thus, it is both a response outlier and ahigh leverage chemical.

Two other chemicals (squares at 0.35 h) slightly exceed the critical leverage value but are close to threechemicals of the training set (rhombus), slightly influential in the model development. The predictions for thesetest chemicals can be considered as reliable as those of the training chemicals. Chemical 283 is wronglypredicted (ri > 3�), but in this case it belongs to the model AD, being within the cutoff leverage value.Therefore, although the predicted response for chemical 347 should not be accepted because not reliable,prediction for chemical 283 should be.

Another approach to AD evaluation is based on the similarity/diversity, evaluated in the model descriptor space,of the considered compound with respect to those belonging to the training set; in fact, a QSAR/QSPR predictionshould be reliable if the compound is – in some way – similar to one or more compounds present in the trainingset.184 High similarity is simply another way to use the interpolation ability of the model in place of the extrapolation.

A stepwise procedure was also proposed177 based on a four stage procedure. General parametric requirementsare imposed in the first stage, specifying in the domain only those chemicals that fall in the range of variation ofthe physicochemical properties of the chemicals in the training set. Such properties (e.g., molecular weight,absorption, water solubility, and volatility) are not usually the driving forces for the studied phenomenon, but theymay implicitly affect the measured endpoint, for example, by reducing the bioavailability of chemicals. Thesecond stage defines similarity measures that can be used to quantify the structural similarity between pairs ofmolecules. Atom-centered fragments are the molecular descriptors used to determine such a similarity. The thirdstage in defining the domain is based on a mechanistic understanding of the modeled phenomenon. This goal isvery difficult to reach because structure and mathematical formalism of the model, computational method used

158 Chemometrics in QSAR

Page 31: Chemometrics in QSAR

for its derivation, accepted hypotheses, and so forth should be taken into account. The suggested approach is anattempt to reduce the diversity in this matter, where the analysis is focused on functional groups whose reactivitymodulates the studied endpoint and structural fragments used in group contribution models. Finally, thereliability of simulated metabolism (metabolites, pathways, and maps) is taken into account in assessing thereliability of predictions, if metabolic activation of chemicals is a part of the QSAR model.

In any case, regardless of the specific method chosen for AD evaluation, this is always a very important taskin order to avoid unreliable predictions and a misuse of the results.

4.05.7.3 Validation

Since several years, model validation has become the crucial part of the development of QSAR models, becausethe main interest of the people has focused on the use of effective predictive models.163,164,174,182,185–191

In chemometrics, the term validation assumes its specific meaning in the framework of finding and producingmodels and consists in the evaluation of the predictive ability of the model and in detecting model pathologiessuch as chance correlation, redundancy, and useless model complexity.

In general terms, model validation can be carried out by using a subset of the available data as the training setto build the model and the remaining part of the data as the test set to evaluate the predictive ability of themodel, comparing the test set experimental responses with the predicted ones (Figure 14).

Several statistical parameters are used to estimate the model quality. Among these, the most popularparameters are the coefficient of determination R2, measuring the model fitting ability (Equation (42)), andthe corresponding coefficient Q2, measuring the model predictive ability (Equation (43)):

R2 ¼ 1 –RSS

TSS¼ 1 –

Pni¼1

yi – yið Þ2

Pni¼1

yi – �yið Þ20 � R2 � 1 ð42Þ

R2cv XQ 2 ¼ 1 –

PRESS

TSS¼ 1 –

Pni¼1

yi – yi=i

� �2

Pni¼1

yi – �yið Þ2Q 2 � 1 ð43Þ

Hat

Sta

ndar

dize

d re

sidu

als

–4

–3

–2

–1

0

1

2

3

4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Training setPrediction set

347

283

Figure 13 Williams plot for an externally validated model for polar narcotics (leverage cutoff value: 2.5 h�). Reproducedfrom Papa, E.; Villa, F.; Gramatica, P. Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic

Toxicity of Organic Chemicals in Pimephales promelas (Fathead Minnow). J. Chem. Inf. Model. 2005, 45, 1256–1266.

Chemometrics in QSAR 159

Page 32: Chemometrics in QSAR

where RSS is the residual sum of squares; PRESS is the predictive error sum of squares; y and y are theexperimental and estimated responses, respectively; the notation i/i indicates that the response of the ithsample is evaluated when the ith sample is not participating to the model building, that is, it is not included inthe training set. The average response �y is calculated from the training set samples and TSS stands for the totalsum of squares.

The R2 is the most widely used measure of the ability of a QSAR model to reproduce the data in the training(goodness of fit), but nothing is known of its predictivity. It is important to remember that, in contrast to the fittingparameter R2, which increases as more and more descriptors are added (until there is dangerous overfitting), thevalue of Q2 generally increases only when the added predictors are useful in predicting left out compounds. BothR2 and Q2 are often reported as percentage, that is, in terms of the percentage of explained variances.

Several validation techniques were proposed to estimate model predictivity, thus leading to different Q2.Validation techniques differ among themselves in how the objects are partitioned into training and test sets.

The most natural approach to the model validation is the so-called ‘training/test splitting’, that is, atechnique based on the extraction from the original data set of a percentage of objects (usually from 10 to50%) which therefore do not participate to the model building but are only used to test the model predictionability. The training/test splitting should be performed randomly thousands of times, with leaving a fixedpercentage of objects in the test set. A unique training/test splitting randomly performed is not suggestedbecause of the too high dependence of the results on the resulting unique random splitting.192

Without doubt, LOO and LMO cross-validation techniques are the most popular techniques for internalvalidation.191,193–195 By these techniques, each object or each group of objects is put only once in the test set. Bythe LOO technique, each object is in turn left out, the model built by using the n – 1 remaining objects, and theprediction calculated for the object left out.

By the LMO technique, the data set is partitioned into k cross-validation groups (usually from 2 to 10), eachcontaining a number nv of objects (approximately n/k); at each run the model is built using the n – nv objects ofthe training set. The responses of each cross-validation group are predicted using the partial model built by theobjects belonging to the remaining groups.

The predictive power (Q 2LOO and Q 2

LMO, from LOO and LMO procedures, respectively) is calculated bysumming the squared differences obtained for the n predictions.

As it has been demonstrated that for infinite samples,196 Q 2LOO tends to R2, it is obvious that Q 2

LOO can veryoften be too optimistic, and thus, it results in an unreliable estimate of the prediction power.

An interesting variant of the LMO technique is the Monte Carlo LMO cross-validation (MCCV), by whichthe partition of the objects into cross-validation groups is carried out randomly and repeated several times.

Bootstrap validation185,186,197 is another commonly used technique by which the training set is repeatedly –thousands of times – built by including the same initial number (n) of objects, but selected from the raw datawith replacement, that is, repeated objects are allowed in the training set. Therefore, if repeated objects areallowed, each test set is composed of the objects that are not selected in the training set. This procedure allowsto build models from training sets always having the same size and to check the model predictive performanceon test sets having different objects and sizes.

Final model

External set

Data set

Training set Test set

Fitting Internal prediction

External set

External prediction

Partial

models

Figure 14 General scheme of the validation of QSAR/QSPR models.

160 Chemometrics in QSAR

Page 33: Chemometrics in QSAR

Bagging validation method, introduced by Breiman,198 is a modification of the bootstrap method, both with andwithout replacement, which produces samples of smaller size than the original one (often n/2). Bagging involvesreplacing an estimator by the average (or mode) of the values it takes when computed from B resamples.

Boosted leave-many-out (boosted LMO)199 is a validation method used to systematically vary the balancebetween representativeness and diversity of training and test sets, proposed in the framework of CoMFA. It is anonparametric method employed to create external data sets including balanced sampling across the responserange and/or structural classes and maximizing training set diversity by a predefined criterion. The formeremphasizes making both test and training sets as representative as possible. The latter favors assignment of the‘most unusual’ compounds to the training set to increase the statistical power of the models obtained.

Moreover, in the QSAR field, it is nowadays advised to carry out an external validation of previouslyinternally validated models at the model development step.163,164,182

Together with training and test sets, an additional data set, called external data set, is advised to obtain anunbiased estimate of the prediction ability of a model. The external data set is usually built as a further test setusing some deterministic algorithm to split the raw data. A single random splitting is not suggested because thevalidation results would depend too strongly on the performed unique random splitting. Therefore, clusteringmethods are commonly used, such as k-means, Jarvis-Patrick, hierarchical clustering methods, together withmore recent approaches based on Kohonen maps (or SOMs)200 and sphere-exclusion algorithms.190 Moreover,D-optimal and distance-based optimal designs can also be efficiently used.169,201,202

These techniques allow a partition of the objects by exploiting different similarity/diversity analyses,spanning the whole chemical space and trying to perform the partition by a uniform covering. Then, theobjects are selected by these techniques in such a way that the training set objects are evenly distributed withinthe whole chemical space and external set objects satisfy some condition of closeness to the training set objects.

A comparison among SOMs, Kennard-Stone design, D-optimal design, and random splitting was performedby Wu et al.203 The best models were built when Kennard-Stone and D-optimal designs were used; SOMsresulted better than random selection, and D-optimal design was slightly better than the random selection.

While the test set is used during the optimization procedure, that is, searching for the best subset of modelvariables or the best architecture of neural networks, the external set is only used to evaluate the predictiveperformance of the final selected model(s).

The external Q 2EXT is among the most used parameters for evaluating the prediction ability for the external

set and is defined as

Q 2EXT ¼ 1 –

PnTEST

i¼1

yi – yið Þ2

PnTEST

i¼1

yi – yTRð Þ2Q 2

EXT � 1 ð44Þ

where the sums run over the external samples and the average response �yTR is that obtained from the trainingset samples.167

The limiting condition to external validation is the total number of samples, being difficult to contemporarilybuild a meaningful training set, a test set, and an external data set when not too many samples are available. In effect,if very few chemicals are available, a model cannot be verified for its predictivity by checking only a few chemicals,as in such cases the results could be obtained by chance, and it is impossible to derive general conclusions.

The role of external validation is more or less important depending on whether the model variables are(1) already univocally defined or (2) selected among several candidate variables by using some selectionprocedure. In effect, for case (1), if the external data set is chosen in such a way that its samples are similar tothose of training and test sets, the selected external samples depend on the whole data set, that is, they areselected because they are in some way represented by other samples. Consequently, even if the samples of theexternal data set do not participate in the model building, they are not completely independent of the trainingsamples, and thus, a good prediction ability can be reasonably expected.

In case (2), selection of external data set is based on the information given by all the candidate variables,looking at the distribution of the compounds in the whole chemical space. External compounds are usuallyselected to be similar to training compounds in the original chemical space, because by definition they must not

Chemometrics in QSAR 161

Page 34: Chemometrics in QSAR

participate in the optimization procedure. However, the chemical space used to select external compounds isobviously different from the chemical space associated to each model defined by a small number of selectedmolecular descriptors, and, accordingly, similarity relationships among compounds may change significantly. Inother words, external compounds may differ from the training compounds within a specific model space, thusresulting a useful tool to assess the general applicability of the model. In conclusion, external validation allowsto detect models lacking a sufficient generalizability.

On the contrary, external compounds that result uniformly distributed with the training compounds into thewhole chemical space could result outliers into the chemical space of the obtained specific models, and a lowprediction ability simply indicates that the external samples are not represented in the model chemical space.

External validation is not proposed as an alternative to internal validation but as an additional validation stepto be taken when models are obtained by variable selection procedures. In effect, the best models should beselected by optimizing internal validation parameters, usually the LOO Q 2. Then, only the good models, stableand internally predictive, are subjected to external validation.

It is not unusual that models with high internal predictivity, verified by different internal validation methods,but externally less predictive or even absolutely unpredictive, are present in the population of modelsdeveloped, for example, by a GA technique.

An example of this situation is highlighted in Table 6, which lists the first 30 models of a GA population ofPAH mutagenicity models (TA100 on 48 PAHs).204

Table 6 GA population of modelsa for 48 Nitro-PAH mutagenicity (31 in training and 17 in prediction set), fitting (R2), internal

validation (Q LOO2 and QBOOT

2 ) and external validation (QEXT2 ) parameters

ID Model descriptors R2 Q LOO2 Q BOOT

2 Q EXT2

1 PW2 SIC1 85.70 82.44 82.36 72.27

2 PW2 CIC1 84.88 80.78 80.71 75.34

3 X1A MATS1e 82.42 79.32 79.00 85.75

4 Mv MATS2e 83.37 79.04 79.25 84.27

5 Mv MATS1e 81.76 78.47 78.42 74.86

6 Mv GATS2m 81.57 77.87 78.10 69.13

7 GATS1e VED2 81.07 77.64 77.68 88.06

8 Xt nPyr 80.25 77.48 77.41 81.71

9 Mv PW2 80.95 77.39 77.97 71.85

10 PW2 IC1 80.89 77.04 77.32 60.07

11 JGI3 VED2 80.27 76.76 76.91 66.67

12 Mp LUMO 80.78 76.54 76.55 70.13

13 Mv LUMO 80.26 76.15 76.11 63.74

14 BELe8 HATS4u 80.53 76.10 76.17 47.59

15 IC1 VED2 80.17 76.09 76.55 80.94

16 Xt MATS1e 80.23 76.08 75.96 86.79

17 PW2 HIC 80.14 75.99 76.16 69.62

18 SIC1 VED2 79.92 75.78 76.11 81.65

19 VED2 Hy 79.55 75.52 75.63 86.98

20 VED2 R6uþ 79.27 75.52 75.50 27.18

21 HATS3u R3v 79.55 75.52 75.23 0

22 Mv MATS2m 79.25 75.37 75.64 69.21

23 Xt BELm2 79.89 75.35 75.40 69.54

24 GGI3 VED2 79.10 75.34 75.58 63.50

25 BELe8 R4uþ 80.06 75.32 75.30 50.23

26 SIC2 BEHm8 79.14 75.13 75.48 61.48

27 VED2 RTe 78.65 75.13 75.32 69.76

28 CIC2 VED2 79.49 75.06 75.08 77.75

29 SIC2 BELv5 79.40 75.02 75.36 58.31

30 X1A LUMO 79.13 74.96 74.91 78.98

a In bold the models with reduced predictive performance in external validation in comparison to internal validation.

Reproduced from Gramatica, P.; Pilutti, P.; Papa, E. Approaches for Externally Validated QSAR Modelling of NitratedPolycyclic Aromatic Hydrocarbon Mutagenicity. SAR QSAR Environ. Res. 2007, 18, 169–178.

162 Chemometrics in QSAR

Page 35: Chemometrics in QSAR

Some models (in bold) appear stable and predictive by internal validation parameters (Q2 and Q 2BOOT), but

are less predictive (or even unpredictive: Q 2EXT � 0) when applied to external chemicals that were really never

presented to the GA during model development. It is also important to note that the less predictive models (inbold) are based on different kinds of molecular descriptors; thus, model instability cannot be attributed to aparticular descriptor. The best combination of modeling variables must be chosen in this GA population fromthe models, guaranteeing, first of all, a stable and internally predictive model and, additionally, externallypredictive ability.

Moreover, if the model is found by using some variable selection technique from a huge number ofpotential candidate variables, high correlations with the modeled response can occur in the models purely bychance.205–207 Therefore, it is very important to evaluate whether the models provided by some variableselection tool are good only by chance.

A statistical tool able to detect the presence of chance correlation in a model and/or the lacking of modelrobustness is the permutation test (also known as Y-scrambling or Y-randomization)208 and its recent devel-opments such as progressive scrambling209 and other variants.210 This validation technique consists of repeatingthe calculation procedure with randomized responses and subsequent probability assessment of the resultantstatistics. Frequently, it is used along with cross-validation. It is expected that models obtained for the same databut with randomized responses should have low values of the quality parameter (e.g., Q2 or R2). However,models based on the randomized responses which sometimes have high Q2 (R2) values are rejected because of asuspected chance correlation.

4.05.7.4 Model Descriptor Interpretability

Regarding the interpretability of the descriptors, it is important to take into account that modeled response isfrequently the result of a series of complex biological or physicochemical mechanisms; thus, it is very difficultand reductionist to ascribe too much importance to the mechanistic meaning of the molecular descriptors usedin a QSAR model. Moreover, it must also be highlighted that in multivariate models such as MLR models, eventhough the interpretation of the singular molecular descriptor can be certainly useful, it is only the combinationof the selected set of descriptors that is able to model the studied end-point. If the main aim of QSAR modelingis to fill the gaps in available data, the modeler attention should be focused on model quality. In relation to thispoint, Livingstone states:211 ‘‘The need for interpretability depends on the application, since a validatedmathematical model relating a target property to chemical features may, in some cases, be all that is necessary,though it is obviously desirable to attempt some explanation of the ‘mechanism’ in chemical terms, but it isoften not necessary, per se.’’ Zefirov and Palyulin175 took the same position, differentiating predictive QSARs,where attention essentially concerns the best prediction quality, from descriptive QSARs where major attentionis paid to descriptor interpretability.

4.05.7.5 Summaries of QSAR Models

Several parameters are available for describing the QSAR model quality. This topic exceeds the scope of thischapter, and then, only a simple example of necessary parameters for regression QSAR models is here given.

Typical regression QSAR models are usually reported (or should be reported) as in the following example,where both R2 and Q2 are reported as percentages:

logP ¼3:1 0:15ð Þ – 0:0056 0:0002ð ÞX1 þ 12 1ð ÞX2

n ¼ 15 Q 2LOO ¼ 93:6 RMSEP ¼ 0:792

R2 ¼ 97:7 RMSEC ¼ 0:821

ð45Þ

where X1 and X2 are two molecular descriptors, log P is the studied property, n is the number of trainingsamples, and the subscript LOO indicates that validation was performed by the LOO technique. The numbersin the equation are the regression coefficients with their uncertainties.

Chemometrics in QSAR 163

Page 36: Chemometrics in QSAR

Other useful parameters to be considered are the RMSEs (root mean square errors) calculated on training set(also called SDEC or SEC) and test set (also called SDEP or SEP), representing the average errors in fitting andin prediction, respectively, and defined as

SDEC XRMSEC ¼ffiffiffiffiffiffiffiffiffiRSS

n

rð46Þ

SDEP XRMSEP ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiPRESS

n

rð47Þ

Obviously, other estimates of the prediction power performed on an external test set (reporting also the numberof objects in the external set) or using LMO (together with the percentage of objects left out in each step) orbootstrap techniques should be reported, together with other specific information about the adopted validationtechniques:

n ¼ 15 Q 2LMO 20%ð Þ ¼ 91:1 Q 2

LMO 30%ð Þ ¼ 90:3 Q 2BOOT ¼ 90:6

n ¼ 5 Q 2EXT ¼ 88:0 RMSEPEXT ¼ 0:872

ð48Þ

4.05.8 Conclusions

The scientific community is showing more and more interest in the QSAR field. Several chemometric methodswere specifically conceived trying to solve QSAR problems, answering to the demand to know in a more andmore deep way chemical systems and their relationships with biological systems.

Several questions are still open and matters of debate, such as the problem of the validation strategies toobtain predictive models, the interpretability of complex molecular descriptors, and the introduction of newmodeling tools.

Nowadays, the need to deal with biological systems described by peptide/protein or DNA sequences, todescribe proteomics maps, or to give effective answers to ecological and health problems pushes further towardnew borders where mathematics, statistics, chemistry, and biology and their interrelationships may producenew effective useful knowledge.

References

1. Martin, Y. C. Advances in the Methodology of Quantitative Drug Design. In Drug Design, Vol. VIII; Ariens, E. J., Ed.; AcademicPress: New York, NY, 1979; pp 1–72.

2. Kubinyi, H., Ed. 3D QSAR in Drug Design. Theory, Methods, and Applications; ESCOM: Leiden, The Netherlands, 1993; 760 pp.3. Hansch, C.; Leo, A. Exploring QSAR. Fundamentals and Applications in Chemistry and Biology; American Chemical Society:

Washington, DC, 1995.4. van de Waterbeemd, H.; Testa, B.; Folkers, G. Eds. Computer-Assisted Lead Finding and Optimization. Wiley-VCH: Weinheim,

Germany, 1997; 554 pp.5. Devillers, J., Ed. Comparative QSAR; Taylor & Francis: Washington, DC, 1998; 371 pp.6. Kubinyi, H.; Folkers, G.; Martin, Y. C., Eds. 3D QSAR in Drug Design – Vol. 3; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998;

352 pp.7. Kubinyi, H.; Folkers, G.; Martin, Y. C., Eds. 3D QSAR in Drug Design – Vol. 2; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998;

416 pp.8. Martin, Y. C. 3D QSAR: Current State Scope, and Limitations. In 3D QSAR in Drug Design; Kubinyi, H., Folkers, G., Martin, Y. C.,

Eds.; Kluwer/ESCOM: Dordrecht, The Netherlands, 1998; Vol. 3, pp 3–23.9. Charton, M.; Charton, B. I. Advances in Quantitative Structure–Property Relationships; JAI Press: Amsterdam, The Netherlands,

2002; 228 pp.10. Gasteiger, J. Handbook of Chemoinformatics. From Data to Knowledge in 4 Volumes; Wiley-VCH: Weinheim, Germany, 2003;

1870 pp.11. Oprea, T. I. 3D QSAR Modeling in Drug Design. In Computational Medicinal Chemistry for Drug Discovery; Bultinck, P., De Winter,

H., Langenaeker, W., Tollenaere, J. P., Eds.; Marcel Dekker: New York, NY, 2004; pp 571–616.12. Crum-Brown, A. On the Theory of Isomeric Compounds. Trans. R. Soc. Edinb. 1864, 23, 707–719.13. Crum-Brown, A. On an Application of Mathematics to Chemistry. Proc. R. Soc. (Edinb.) 1866, VI (73), 89–90.

164 Chemometrics in QSAR

Page 37: Chemometrics in QSAR

14. Crum-Brown, A.; Fraser, T. R. On the Connection between Chemical Constitution and Physiological Action. Part 1. On thePhysiological Action of Salts of the Ammonium Bases, Derived from Strychnia, Brucia, Thebia, Codeia, Morphia and Nicotia.Trans. R. Soc. Edinb. 1868, 25, 151–203.

15. Korner, W. Studi sulla Isomeria delle Cosı Dette Sostanze Aromatiche a Sei Atomi di Carbonio. Gazz. Chim. Ital. 1874, 4, 242.16. Mills, E. J. On Melting Point and Boiling Point as Related to Composition. Philos. Mag. 1884, 17, 173–187.17. Richet, M. C. Note sur la Rapport entre la Toxicite et les Proprietes Physiques des Corps. Compt. Rend. Soc. Biol. (Paris) 1893,

45, 775–776.18. Meyer, H. Zur Theorie der Alkoholnarkose. Arch. Exp. Pathol. Pharmacol. 1899, 42, 109–118.19. Overton, E. Studien uber die Narkose, zugleich ein Beitrag zur allgemeinen Pharmakologie; Verlag Gustav Fischer: Jena, Germany

1901; 141 pp.20. Traube, I. Theorie der Osmose und Narkose. Arch. fur die ges. Physiol. 1904, 105, 541–558.21. Wiener, H. Influence of Interatomic Forces on Paraffin Properties. J. Chem. Phys. 1947, 15, 766.22. Platt, J. R. Influence of Neighbor Bonds on Additive Bond Properties in Paraffins. J. Chem. Phys. 1947, 15, 419–420.23. Fujita, T.; Iwasa, J.; Hansch, C. A New Substituent Constant, �, Derived from Partition Coefficients. J. Am. Chem. Soc. 1964, 86,

5175–5180.24. Gordon, M.; Scantlebury, G. R. Non-Random Polycondensation: Statistical Theory of the Substitution Effect. Trans. Faraday Soc.

1964, 60, 604–621.25. Smolenskii, E. A. Application of the Theory of Graphs to Calculations of the Additive Structural Properties of Hydrocarbons. Russ.

J. Phys. Chem. 1964, 38, 700–702.26. Spialter, L. The Atom Connectivity Matrix (ACM) and Its Characteristic Polynomial (ACMCP). J. Chem. Doc. 1964, 4, 261–269.27. Balaban, A. T.; Harary, F. Chemical Graphs. V. Enumeration and Proposed Nomenclature of Benzenoid Catacondensed

Polycyclic Aromatic Hydrocarbons. Tetrahedron 1968, 24, 2505–2516.28. Harary, F. Graph Theory; Addison-Wesley: Reading, MA, 1969.29. Kier, L. B. Molecular Orbital Theory in Drug Research; Academic Press: New York, NY, 1971.30. Cammarata, A. Interrelationship of the Regression Models Used for Structure–Activity Analyses. J. Med. Chem. 1972, 15, 573–577.31. Gutman, I.; Trinajstic, N. Graph Theory and Molecular Orbitals. Total �-Electron Energy of Alternant Hydrocarbons. Chem. Phys.

Lett. 1972, 17, 535–538.32. Hosoya, H. Topological Index as a Sorting Device for Coding Chemical Structures. J. Chem. Doc. 1972, 12, 181–183.33. Pauling, L. The Additivity of the Energies of Normal Covalent Bonds. Proc. Natl. Acad. Sci. USA 1932, 14, 414–416.34. Pauling, L. The Nature of the Chemical Bond; Cornell University Press: Ithaca, NY, 1939.35. Coulson, C. A. The Electronic Structure of Some Polyenes and Aromatic Molecules. VII. Bonds of Fractional Order by the

Molecular Orbital Method. Proc. R. Soc. London A 1939, 169, 413–428.36. Sanderson, R. T. Electronegativity I. Orbital Electronegativity of Neutral Atoms. J. Chem. Educ. 1952, 29, 540–546.37. Fukui, K.; Yonezawa, Y.; Shingu, H. Theory of Substitution in Conjugated Molecules. Bull. Chem. Soc. Jpn. 1954, 27, 423–427.38. Mulliken, R. S. Electronic Population Analysis on LCAO-MO Molecular Wave Functions. I. J. Chem. Phys. 1955, 23, 1833–1840.39. Hammett, L. P. Reaction Rates and Indicator Acidities. Chem. Rev. 1935, 17, 67–79.40. Hammett, L. P. The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives. J. Am. Chem. Soc.

1937, 59, 96–103.41. Hammett, L. P. Linear Free Energy Relationships in Rate and Equilibrium Phenomena. Trans. Faraday Soc. 1938, 34, 156–165.42. Taft, R. W. Polar and Steric Substituent Constants for Aliphatic and o-Benzoate Groups from Rates of Esterification and

Hydrolysis of Esters. J. Am. Chem. Soc. 1952, 74, 3120–3128.43. Taft, R. W. The General Nature of the Proportionality of Polar Effects of Substituent Groups in Organic Chemistry. J. Am. Chem.

Soc. 1953, 75, 4231–4238.44. Taft, R. W. Linear Steric Energy Relationships. J. Am. Chem. Soc. 1953, 75, 4538–4539.45. Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Correlation of Biological Activity of Phenoxyacetic Acids with Hammett

Substituent Constants and Partition Coefficients. Nature 1962, 194, 178–180.46. Hansch, C.; Muir, R. M.; Fujita, T.; Maloney, P. P.; Geiger, F.; Streich, M. The Correlation of Biological Activity of Plant Growth

Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. J. Am. Chem. Soc. 1963, 85,2817–2824.

47. Free, S. M.; Wilson, J. W. A Mathematical Contribution to Structure–Activity Studies. J. Med. Chem. 1964, 7, 395–399.48. Kubinyi, H. Free Wilson Analysis. Theory, Applications and Its Relationship to Hansch Analysis. Quant. Struct. -Act. Relat. 1988,

7, 121–133.49. Balaban, A. T.; Harary, F. The Characteristic Polynomial Does Not Uniquely Determine the Topology of a Molecule. J. Chem. Doc.

1971, 11, 258–259.50. Balaban, A. T. Ed. Chemical Applications of Graph Theory; Academic Press: New York, NY, 1976; 390 pp.51. Randic, M. On the Recognition of Identical Graphs Representing Molecular Topology. J. Chem. Phys. 1974, 60, 3920–3928.52. Randic, M. On Characterization of Molecular Branching. J. Am. Chem. Soc. 1975, 97, 6609–6615.53. Kier, L. B.; Hall, L. H.; Murray, W. J.; Randic, M. Molecular Connectivity. I: Relationship to Nonspecific Local Anesthesia. J.

Pharm. Sci. 1975, 64, 1971–1974.54. Rohrbaugh, R. H.; Jurs, P. C. Descriptions of Molecular Shape Applied in Studies of Structure/Activity and Structure/Property

Relationships. Anal. Chim. Acta 1987, 199, 99–109.55. Stanton, D. T.; Jurs, P. C. Development and Use of Charged Partial Surface Area Structural Descriptors in Computer-Assisted

Quantitative Structure–Property Relationship Studies. Anal. Chem. 1990, 62, 2323–2329.56. Todeschini, R.; Lasagni, M.; Marengo, E. New Molecular Descriptors for 2D- and 3D-Structures, Theory. J. Chemom. 1994, 8,

263–273.57. Katritzky, A. R.; Mu, L.; Lobanov, V. S.; Karelson, M. Correlation of Boiling Points with Molecular Structure. 1. A Training Set of

298 Diverse Organics and a Test Set of 9 Simple Inorganics. J. Phys. Chem. 1996, 100, 10400–10407.58. Ferguson, A. M.; Heritage, T. W.; Jonathon, P.; Pack, S. E.; Phillips, L.; Rogan, J.; Snaith, P. J. EVA: A New Theoretically Based

Molecular Descriptor for Use in QSAR/QSPR Analysis. J. Comput. Aided Mol. Des. 1997, 11, 143–152.

Chemometrics in QSAR 165

Page 38: Chemometrics in QSAR

59. Schuur, J.; Selzer, P.; Gasteiger, J. The Coding of the Three-Dimensional Structure of Molecules by Molecular Transformsand Its Application to Structure-Spectra Correlations and Studies of Biological Activity. J. Chem. Inf. Comput. Sci. 1996, 36,334–344.

60. Tuppurainen, K. EEVA (Electronic Eigenvalue): A New QSAR/QSPR Descriptor for Electronic Substituent Effects Based onMolecular Orbital Energies. SAR QSAR Environ. Res. 1999, 10, 39–46.

61. Consonni, V.; Todeschini, R.; Pavan, M. Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAYDescriptors. Part 1. Theory of the Novel 3D Molecular Descriptors. J. Chem. Inf. Comput. Sci. 2002, 42, 682–692.

62. Goodford, P. J. A Computational Procedure for Determining Energetically Favorable Binding Sites on Biologically ImportantMacromolecules. J. Med. Chem. 1985, 28, 849–857.

63. Cramer, R. D. III; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding ofSteroids to Carrier Proteins. J. Am. Chem. Soc. 1988, 110, 5959–5967.

64. Klebe, G.; Abraham, U.; Mietzner, T. Molecular Similarity Indices in a Comparative Analysis (CoMSIA) of Drug Molecules toCorrelate and Predict Their Biological Activity. J. Med. Chem. 1994, 37, 4130–4146.

65. Jain, A. N.; Koile, K.; Chapman, D. Compass: Predicting Biological Activities from Molecular Surface Properties. PerformanceComparisons on a Steroid Benchmark. J. Med. Chem. 1994, 37, 2315–2327.

66. Todeschini, R.; Moro, G.; Boggia, R.; Bonati, L.; Cosentino, U.; Lasagni, M.; Pitea, D. Modeling and Prediction of MolecularProperties. Theory of Grid-Weighted Holistic Invariant Molecular (G-WHIM) Descriptors. Chemom. Intell. Lab. Syst. 1997, 36,65–73.

67. Chuman, H.; Karasawa, M.; Fujita, T. A Novel 3-Dimensional QSAR Procedure – Voronoi Field Analysis. Quant. Struct. -Act. Relat.1998, 17, 313–326.

68. Cruciani, G.; Pastor, M.; Guba, W. VolSurf: A New Tool for the Pharmaceutic Optimization of Lead Compounds. Eur. J. Pharm.Sci. 2000, 11 (Suppl.), S29–S39.

69. Pastor, M.; Cruciani, G.; McLay, I. M.; Pickett, S. D.; Clementi, S. GRid-INdependent Descriptors (GRIND): A Novel Class ofAlignment-Independent Three-Dimensional Molecular Descriptors. J. Med. Chem. 2000, 43, 3233–3243.

70. Kubinyi, H. QSAR in Drug Design. In Handbook of Chemoinformatics; Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003;Vol. 4, pp 1532–1554.

71. Kohonen, T. Self-Organization and Associative Memory; Springer: Berlin, Germany, 1989.72. Zupan, J.; Novic, M.; Gasteiger, J. Neural Networks with Counter-Propagation Learning Strategy Used for Modelling. Chemom.

Intell. Lab. Syst. 1995, 27, 175–187.73. Livingstone, D. J.; Salt, D. W. Regression Analysis for QSAR Using Neural Networks. Bioorg. Med. Chem. Lett. 1992, 2, 213–218.74. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.75. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression

Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958.76. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, 1995.77. Worth, A. P.; Cronin, M. T. D. Embedded Cluster Modelling – A Novel Method for Analysing Embedded Data Sets. Quant. Struct. -

Act. Relat. 1999, 18, 229–235.78. Todeschini, R.; Ballabio, D.; Consonni, V.; Mauri, A.; Pavan, M. CAIMAN (Classification and Influence Matrix Analysis): A New

Classification Method Based on Leverage-Scaled Functions. Chemom. Intell. Lab. Syst. 2007, 87, 3–17.79. Sabljic, A. Predictions of the Nature and Strength of Soil Sorption of Organic Pollutants by Molecular Topology. J. Agric. Food

Chem. 1984, 32, 243–246.80. Halfon, E.; Galassi, S.; Bruggemann, R.; Provini, A. Selection of Priority Properties to Assess Environmental Hazard of Pesticides.

Chemosphere 1996, 33, 1543–1562.81. Bruggemann, R.; Pudenz, S.; Carlsen, L.; Sørensen, P. B.; Thomsen, M.; Mishra, R. K. The Use of Hasse Diagrams as a Potential

Approach for Inverse QSAR. SAR QSAR Environ. Res. 2001, 11, 473–487.82. Pavan, M.; Mauri, A.; Todeschini, R. Total Ranking Models by the Genetic Algorithms Variable Subset Selection (GA-VSS)

Approach for Environmental Priority Settings. Anal. Bioanal. Chem. 2004, 380, 430–444.83. Pavan, M.; Consonni, V.; Todeschini, R. Partial Ranking Models by Genetic Algorithms Variable Subset Selection (GA-VSS)

Approach for Environmental Priority Settings. MATCH Commun. Math. Comput. Chem. 2005, 54, 583–609.84. Gordeeva, E. V.; Molchanova, M. S.; Zefirov, N. S. General Methodology and Computer Program for the Exhaustive Restoring of

Chemical Structures by Molecular Connectivity Indices. Solution of the Inverse Problem in QSAR/QSPR. Tetrahedron Comput.Method. 1990, 3, 389–415.

85. Zefirov, N.; Palyulin, V. A.; Skvortsova, M. I.; Baskin, I. I. Inverse Problems in QSAR. In QSAR and Molecular Modelling: Concepts,Computational Tools and Biological Applications; Sanz, F., Giraldo, J.; Manaut, F., Eds.; Prous Science: Barcelona, Spain, 1995;pp 40–41.

86. Tarko, L.; Ivanciuc, O. QSAR Modeling of the Anticolvulsant Activity of Phylacetanilides with PRECLAV (Property Evaluation byClass Variables). MATCH Commun. Math. Comput. Chem. 2001, 44, 201–214.

87. Kamlet, M. J.; Abboud, J.-L. M.; Taft, R. W. An Examination of Linear Solvation Energy Relationships. Prog. Phys. Org. Chem.1981, 13, 485–630.

88. Kamlet, M. J.; Doherty, P. J.; Taft, R. W.; Abraham, M. H.; Veith, G. D.; Abraham, D. J. Solubility Properties in Polymers andBiological Media. 8. An Analysis of the Factors that Influence Toxicities of Organic Nonelectrolytes to the Golden Orfe Fish(Leuciscus idus melanotus). Environ. Sci. Technol. 1987, 21, 149–155.

89. Kamlet, M. J.; Doherty, R. M.; Abboud, J.-L. M.; Abraham, M. H.; Taft, R. W. Solubility. A New Look. Chemtech 1986, 16, 566–576.90. Kamlet, M. J.; Abraham, M. H.; Doherty, R. M.; Taft, R. W. Solubility Properties in Polymers and Biological Media. 4. Correlations

of Octanol/Water Partition Coefficients with Solvatochromic Parameters. J. Am. Chem. Soc. 1984, 106, 464–466.91. Kamlet, M. J.; Doherty, R. M.; Carr, P. W.; Mackay, D.; Abraham, M. H.; Taft, R. W. Linear Solvation Energy Relationships. 44.

Parameter Estimation Rules that Allow Accurate Prediction of Octanol/Water Partition Coefficients and Other Solubility andToxicity Properties of Polychlorinated Biphenyls and Polycyclic Aromatic Hydrocarbons. Environ. Sci. Technol. 1988, 22,503–509.

166 Chemometrics in QSAR

Page 39: Chemometrics in QSAR

92. Abraham, M. H.; Ibrahim, A.; Acree, W. E. Jr. Air to Blood Distribution of Volatile Organic Compounds: A Linear Free EnergyAnalysis. Chem. Res. Toxicol. 2005, 18, 904–911.

93. Reinhard, M.; Drefahl, A. Handbook for Estimating Physicochemical Properties of Organic Compounds; Wiley: New York, NY,228 pp.

94. Nys, G. G.; Rekker, R. F. Statistical Analysis of a Series of Partition Coefficients with Special Reference to the Predictability ofFolding of Drug Molecules. The Introduction of Hydrophobic Fragmental Constants (f Values). Eur. J. Med. Chem. 1973, 8,521–535.

95. Broto, P.; Moreau, G.; Vandycke, C. Molecular Structures: Perception, Autocorrelation Descriptor and SAR Studies. System ofAtomic Contributions for the Calculation of the n-Octane/Water Partition Coefficients. Eur. J. Med. Chem. 1984, 19, 71–78.

96. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-Structure-Directed QuantitativeStructure–Activity Relationships. I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565–577.

97. Moriguchi, I.; Hirono, S.; Liu, Q.; Nakagome, I.; Matsushita, Y. Simple Method of Calculating Octanol/Water Partition Coefficient.Chem. Pharm. Bull. 1992, 40, 127–130.

98. Klopman, G.; Li, J. Y.; Wang, S.; Dimayuga, M. Computer Automated log P Calculations Based on an Extended GroupContribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752–781.

99. Wang, S.; Milne, G. W. A.; Klopman, G. Graph Theory and Group Contributions in the Estimation of Boiling Points. J. Chem. Inf.Comput. Sci. 1994, 34, 1242–1250.

100. Krzyzaniak, J. F.; Myrdal, P. B.; Simamora, P.; Yalkowsky, S. H. Boiling Point and Melting Point Prediction for Aliphatic, Non-Hydrogen-Bonding Compounds. Ind. Eng. Chem. Res. 1995, 34, 2530–2535.

101. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-Structure-Directed QuantitativeStructure–Activity Relationships. 2. Modeling Dispersive and Hydrophobic Interactions. J. Chem. Inf. Comput. Sci. 1987, 27,21–35.

102. Perrin, D. D.; Dempsey, B.; Serjeant, E. P. pKa Prediction for Organic Acids and Bases; Chapman & Hall: London, UK, 1981.103. Klopman, G.; Wang, S.; Balthasar, D. M. Estimation of Aqueous Solubility of Organic Molecules by the Group Contribution

Approach. Application to the Study of Biodegradation. J. Chem. Inf. Comput. Sci. 1992, 32, 474–482.104. Tao, S.; Piao, H.; Dawson, R.; Lu, X.; Hu, H. Estimation of Organic Carbon Normalized Sorption Coefficient (KOC) for Soils Using

the Fragment Constant Method. Environ. Sci. Technol. 1999, 33, 2719–2725.105. Yoneda, Y. An Estimation of the Thermodynamic Properties of Organic Compounds in the Ideal Gas State. I. Acyclic

Compounds and Cyclic Compounds with a Ring of Cyclopentane, Cyclohexane, Benzene or Naphthalene. Bull. Chem. Soc.Jpn. 1979, 52, 1297–1314.

106. Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids; McGraw-Hill: New York, NY, 1988.107. Atkinson, R. A Structure–Activity Relationships for the Estimation of Rate Constants for the Gas-Phase Reactions of OH

Radicals with Organic Compounds. Int. J. Chem. Kinet. 1987, 19, 799–828.108. Ertl, P.; Rohde, B.; Selzer, P. Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and

Its Application to the Prediction of Drug Transport Properties. J. Med. Chem. 2000, 43, 3714–3717.109. McFarland, J. W.; Gans, D. J. Cluster Significance Analysis: A New QSAR Tool for Asymmeric Data Sets. Drug Inf. J. 1990, 24,

705–711.110. Rose, V. S.; Wood, J. Generalized Cluster Significance Analysis and Stepwise Cluster Significance Analysis with Conditional

Probabilities. Quant. Struct. -Act. Relat. 1998, 17, 348–356.111. Worth, A. P.; Bassan, A.; Fabjan, E.; Gallegos Saliner, A.; Netzeva, T. I.; Patlewicz, G.; Pavan, M.; Tsakovska, I. The Use of

Computational Methods in the Grouping and Assessment of Chemicals – Preliminary Investigations. Eur. Tech. Rep. 2008, inpress.

112. Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000; 668 pp.113. Randic, M. Molecular Bonding Profiles. J. Math. Chem. 1996, 19, 375–392.114. ADAPT. Jurs, P.C., Pensilvania State University (PN).115. Mekenyan, O.; Karabunarliev, S.; Bonchev, D. The OASIS Concept for Predicting Biological Activity of Chemical Compounds.

J. Math. Chem. 1990, 4, 207–215.116. CODESSA – Reference Manual 2.0. Katritzky, A.R.; Lobanov, V.S.; Karelson, M., Gainsville (FL).117. MolConn-Z: A Program for Molecular Topology Analysis 3. Hall Associates Consulting, Quincy (MA).118. DRAGON (Software for molecular descriptor calculations) 5.5. Talete s.r.l., Via V.Pisani 13, Milano (Italy).119. Testa, B.; Kier, L. B. The Concept of Molecular Structure in Structure–Activity Relationship Studies and Drug Design. Med. Res.

Rev. 1991, 11, 35–48.120. Jurs, P. C.; Dixon, J. S.; Egolf, L. M. Representations of Molecules. In Chemometrics Methods in Molecular Design; van de

Waterbeemd, H., Ed.; VCH Publishers: New York, NY, 1995; Vol. 2, pp 15–38.121. Smith, E. G.; Baker, P. A. The Wiswesser Line-Formula Chemical Notation (WLN); Chemical Information Management: Cherry

Hill, NJ, 1975.122. Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules.

J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.123. Mekenyan, O.; Ivanov, J.; Veith, G. D.; Bradbury, S. P. Dynamic QSAR: A New Search for Active Conformations and Significant

Stereoelectronic Indices. Quant. Struct. -Act. Relat. 1994, 13, 302–307.124. Mekenyan, O.; Nikolova, N.; Schmieder, P. Dynamic 3D QSAR Techniques: Applications in Toxicology. J. Mol. Struct.

(Theochem) 2003, 622, 147–165.125. Basak, S. C.; Gute, B. D.; Grunwald, G. D. Use of Topostructural, Topochemical, and Geometric Parameters in the Prediction of

Vapor Pressure: A Hierarchical QSAR Approach. J. Chem. Inf. Comput. Sci. 1997, 37, 651–655.126. Hosoya, H. Topological Index. A Newly Proposed Quantity Characterizing the Topological Nature of Structural Isomers of

Saturated Hydrocarbons. Bull. Chem. Soc. Jpn. 1971, 44, 2332–2339.127. Randic, M.; Wilkins, C. L. Graph Theoretical Ordering of Structures as a Basis for Systematic Searches for Regularities in

Molecular Data. J. Phys. Chem. 1979, 83, 1525–1540.128. Kier, L. B. A Shape Index from Molecular Graphs. Quant. Struct. -Act. Relat. 1985, 4, 109–116.

Chemometrics in QSAR 167

Page 40: Chemometrics in QSAR

129. Randic, M. Novel Shape Descriptors for Molecular Graphs. J. Chem. Inf. Comput. Sci. 2001, 41, 607–613.130. Wiener, H. Structural Determination of Paraffin Boiling Points. J. Am. Chem. Soc. 1947, 69, 17–20.131. Ivanciuc, O.; Balaban, A. T. The Graph Description of Chemical Structures. In Topological Indices and Related Descriptors in

QSAR and QSPR; Devillers, J., Balaban, A. T., Eds.; Gordon and Breach Science Publishers: Amsterdam, The Netherlands,1999; pp 59–167.

132. Ivanciuc, O.; Balaban, T.-S.; Balaban, A. T. Design of Topological Indices. Part 4. Reciprocal Distance Matrix, Related LocalVertex Invariants and Topological Indices. J. Math. Chem. 1993, 12, 309–318.

133. Janezic, D.; Milicevic, A.; Nikolic, S.; Trinajstic, N. Graph Theoretical Matrices in Chemistry; University of Kragujevac:Kragujevac, Serbia, 2007; 205 pp.

134. Randic, M. Graph Theoretical Approach to Local and Overall Aromaticity of Benzenoid Hydrocarbons. Tetrahedron 1975, 31,1477–1481.

135. Kier, L. B.; Hall, L. H. The Nature of Structure–Activity Relationships and Their Relation to Molecular Connectivity. Eur. J. Med.Chem. 1977, 12, 307–312.

136. Balaban, A. T. Highly Discriminating Distance-Based Topological Index. Chem. Phys. Lett. 1982, 89, 399–404.137. Burden, F. R. A Chemically Intuitive Molecular Index Based on the Eigenvalues of a Modified Adjacency Matrix. Quant. Struct.

Act. Relat. 1997, 16, 309–314.138. Raevsky, O. A.; Trepalin, S. V.; Razdol’skii, A. N. New QSAR Descriptors Calculated from Interatomic Interaction Spectra.

Pharm. Chem. J. 2000, 34, 646–649.139. Robinson, D. D.; Winn, P. J.; Lyne, P. D.; Richards, W. G. Self-Organizing Molecular Field Analysis: A Tool for Structure–Activity

Studies. J. Med. Chem. 1999, 42, 573–583.140. Buolamwini, J. K.; Assefa, H. CoMFA and CoMSIA 3D QSAR and Docking Studies on Conformationally-Restrained Cinnamoyl

HIV-1 Integrase Inhibitors: Exploration of a Binding Mode at the Active Site. J. Med. Chem. 2002, 45, 841–852.141. Xu, M.; Zhang, A.; Han, S.; Wang, L.-S. Studies of 3D-Quantitative Structure–Activity Relationships on a Set of Nitroaromatic

Compounds: CoMFA, Advanced CoMFA and CoMSIA. Chemosphere 2002, 48, 707–715.142. Jolliffe, I. T. Discarding Variables in a Principal Component Analysis. I. Artificial Data. Appl. Stat. 1972, 21, 160–173.143. Jolliffe, I. T. Discarding Variables in a Principal Component Analysis. II. Real Data. Appl. Stat. 1973, 22, 21–31.144. Todeschini, R. Data Correlation, Number of Significant Principal Components and Shape of Molecules. The K Correlation Index.

Anal. Chim. Acta 1997, 348, 419–430.145. Todeschini, R.; Consonni, V.; Maiocchi, A. The K Correlation Index: Theory Development and Its Applications in Chemometrics.

Chemom. Intell. Lab. Syst. 1998, 46, 13–29.146. Efroymson, M. A. Multiple Regression Analysis. In Mathematical Methods for Digital Computers; Ralston, A., Wilf, H. S., Eds.;

Wiley: New York, NY, 1960.147. Leardi, R. Application of Genetic Algorithms to Feature Selection under Full Validation Conditions and to Outlier Detection.

J. Chemom. 1994, 8, 65–79.148. Luke, B. T. Evolutionary Programming Applied to the Development of Quantitative Structure–Activity Relationships and

Quantitative Structure–Property Relationships. J. Chem. Inf. Comput. Sci. 1994, 34, 1279–1287.149. Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative Structure–Property Relationship Approach Based on the

k-Nearest-Neighbor Principle. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194.150. Baumann, K.; Albert, H.; von Korff, M. A Systematic Evaluation of the Benefits and Hazards of Variable Selection in Latent

Variable Regression. Part I. Search Algorithm, Theory and Simulations. J. Chemom. 2002, 16, 339–350.151. Kubinyi, H. Variable Selection in QSAR Studies. I. An Evolutionary Algorithm. Quant. Struct. -Act. Relat. 1994, 13,

285–294.152. Agrafiotis, D. K.; Cedeno, W.; Lobanov, V. S. On the Use of Neural Network Ensembles in QSAR and QSPR. J. Chem. Inf.

Comput. Sci. 2002, 42, 903–911.153. Cedeno, W.; Agrafiotis, D. K. Using Particle Swarms for the Development of QSAR Models Based on K-Nearest Neighbor and

Kernel Regression. J. Comput. Aided Mol. Des. 2003, 17, 255–263.154. Lin, Z. H.; Xingguo, C.; Zhide, H. A New Approach for the Identification of Important Variables. Chemom. Intell. Lab. Syst. 2006,

80, 130–135.155. Lindgren, F.; Geladi, P.; Rannar, S.; Wold, S. Interactive Variable Selection (IVS) for PLS. Part I: Theory and Algorithms.

J. Chemom. 1994, 8, 349–363.156. Lindgren, F.; Geladi, P.; Berglund, A.; Sjostrom, M.; Wold, S. Interactive Variable Selection (IVS) for PLS. Part II: Chemical

Applications. J. Chemom. 1995, 9, 331–342.157. Centner, V.; Massart, D. L.; de Noord, O. E.; De Jong, S.; Vandeginste, B. G. M.; Sterna, C. Elimination of Uniformative Variables

for Multivariate Calibration. Anal. Chem. 1996, 68, 3851–3858.158. Sutter, J. M.; Peterson, T. A.; Jurs, P. C. Prediction of Gas Chromatographic Retention Indices of Alkylbenzene. Anal. Chim.

Acta 1997, 342, 113–122.159. Akaike, H. A New Look at the Statistical Model Identification. IEEE Trans. Automat. Contr. 1974, AC-19, 716–723.160. Friedman, J. H. Multivariate Adaptive Regression Splines; Report; Laboratory of Computational Statistics – Department of

Statistics: Stanford, CA.161. Kubinyi, H. Evolutionary Variable Selection in Regression and PLS Analyses. J. Chemom. 1996, 10, 119–133.162. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. Detecting ‘Bad’ Regression Models: Multicriteria Fitness Functions in

Regression Analysis. Anal. Chim. Acta 2004, 515, 199–208.163. Golbraikh, A.; Tropsha, A. Beware of q2!. J. Mol. Graph. Model. 2002, 20, 269–276.164. Tropsha, A.; Gramatica, P.; Gombar, V. K. The Importance of Being Earnest: Validation Is the Absolute Essential for Successful

Application and Interpretation of QSPR Models. QSAR Comb. Sci. 2003, 22, 69–77.165. Sutherland, J. J.; Weaver, D. F. Development of Quantitative Structure–Activity Relationships and Classification Models for

Anticonvulsant Activity of Hydantoin Analogues. J. Chem. Inf. Comput. Sci. 2003, 43, 1028–1036.166. van Rhee, A. M. Use of Recursion Forest in the Sequential Screening Process: Consensus Selection by Multiple Recursion

Trees. J. Chem. Inf. Model. 2003, 43, 941–948.

168 Chemometrics in QSAR

Page 41: Chemometrics in QSAR

167. Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. MOBYDIGS: Software for Regression and Classification Models by GeneticAlgorithms. In Chemometrics: Genetic Algorithms and Artificial Neural Networks; Leardi, R., Ed.; Elsevier: Amsterdam, TheNetherlands, 2003; pp 141–167.

168. Todeschini, R.; Consonni, V.; Pavan, M. A Distance Measure between Models: A Tool for Similarity/Diversity Analsysis of ModelPopulations. Chemom. Intell. Lab. Syst. 2004, 70, 55–61.

169. Gramatica, P.; Pilutti, P.; Papa, E. Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-Test Sets and Consensus Modeling. J. Chem. Inf. Comput. Sci. 2004, 44, 1794–1802.

170. Asikainen, A. H.; Ruuskanen, J.; Tuppurainen, K. A. Consensus kNN QSAR: A Versatile Method for Predicting the EstrogenicActivity of Organic Compounds In Silico. A Comparative Study with Five Estrogen Receptors and a Large, Diverse Set ofLigands. Environ. Sci. Technol. 2004, 38, 6724–6729.

171. Baurin, N.; Mozziconacci, J. C.; Arnoult, E.; Chavatte, P.; Marot, C.; Morin-Allory, L. 2D QSAR Consensus Prediction for High-Throughput Virtual Screening. An Application to COX-2 Inhibition Modeling and Screening of the NCI Database. J. Chem. Inf.Comput. Sci. 2004, 44, 276–285.

172. Gramatica, P.; Giani, E.; Papa, E. Statistical External Validation and Consensus Modeling: A QSPR Case Study for Koc

Prediction. J. Mol. Graph. Model. 2007, 25, 755–766.173. Votano, J. R.; Parham, M.; Hall, L. H.; Kier, L. B.; Oloff, S.; Tropsha, A.; Xie, Q.; Tong, W. Three New Consensus QSAR Models

for the Prediction of Ames Genotoxicity. Mutagenesis 2004, 19, 365–377.174. Eriksson, L.; Jaworska, J. S.; Worth, A. P.; Cronin, M. T. D.; McDowell, R. M.; Gramatica, P. Methods for Reliability, Uncertainty

Assessment, and Applicability Evaluations of Regression Based and Classification QSARs. Environ. Health Perspect. 2003, 111,1361–1375.

175. Zefirov, N. S.; Palyulin, V. A. QSAR for Boiling Points of ‘Small’ Sulfides. Are the ‘High-Quality Structure-Property-ActivityRegressions’ the Real High Quality QSAR Models? J. Chem. Inf. Comput. Sci. 2001, 41, 1022–1027.

176.. Jaworska, J. S.; Nikolova-Jeliazkova, N.; Aldenberg, T. Review of Methods for Applicability Domain Estimation; Report; TheEuropean Commission – Joint Research Centre: Ispra, Italy.

177. Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A Stepwise Approach for Definingthe Applicability Domain of SAR and QSAR Models. J. Chem. Inf. Model. 2005, 45, 839–849.

178. Jaworska, J. S.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR Applicability Domain Estimation by Projection of the Training Setin Descriptor Space: A Review. ATLA 2005, 33, 445–459.

179. Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.;Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; vande Sandt, J. J. M.; Tong, W. D.; Veith, G. D.; Yang, C. H. Current Status of Methods for Defining the Applicability Domain of(Quantitative) Structure–Activity Relationships. ATLA 2005, 33, 155–173.

180. Nikolova-Jeliazkova, N.; Jaworska, J. S. An Approach to Determining Applicability Domains for QSAR Group ContributionModels: An Analysis of SRC KOWWIN. ATLA 2005, 33, 461–470.

181. Tetko, I. V.; Bruneau, P.; Mewes, H.-W.; Rohrer, D. C.; Poda, G. I. Can We Estimate the Accuracy of ADME-Tox Predictions?Drug Discov. Today 2006, 11, 700–707.

182. Gramatica, P. Principles of QSAR Models Validation: Internal and External. QSAR Comb. Sci. 2007, 26, 694–701.183. Papa, E.; Villa, F.; Gramatica, P. Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity

of Organic Chemicals in Pimephales promelas (Fathead Minnow). J. Chem. Inf. Model. 2005, 45, 1256–1266.184. Nikolova, N.; Jaworska, J. S. Approaches to Measure Chemical Similarity – A Review. QSAR Comb. Sci. 2003, 22, 1006–1026.185. Efron, B. The Jackknife, the Bootstrap and Other Resampling Planes; Society for Industrial and Applied Mathematics:

Philadelphia, PA, 92 pp.186. Cramer, R. D. III; Bunce, J. D.; Patterson, D. E.; Frank, I. E. Crossvalidation, Bootstrapping and Partial Least Squares Compared

with Multiple Regression in Conventional QSAR Studies. Quant. Struct. -Act. Relat. 1988, 7, 18–25.187. Wold, S. Validation of QSAR’s. Quant. Struct. -Act. Relat. 1991, 10, 191–193.188. Wold, S.; Eriksson, L. Statistical Validation of QSAR Results. Validation Tools. In Chemometrics Methods in Molecular Design;

van de Waterbeemd, H., Ed.; VCH Publishers: Weinheim, Germany, 1995; Vol. 2, pp 309–318.189. Burden, F. R.; Brereton, R. G.; Walsh, P. T. A Comparison of Cross-Validation and Non-Cross-Validation Techniques:

Application to Polycyclic Aromatic Hydrocarbons Electronic Absorption Spectra. Analyst 1997, 122, 1015–1022.190. Golbraikh, A.; Shen, M.; Xiao, Z.; Xiao, Y.-D.; Lee, K.-H.; Tropsha, A. Rational Selection of Training and Test Sets for the

Development of Validated QSAR Models. J. Comput. Aided Mol. Des. 2003, 17, 241–253.191. Baumann, K. Cross-Validation as the Objective Function for Variable-Selection Techniques. Trends Analyt. Chem. 2003, 22,

395–406.192. Lanteri, S. Full Validation Procedures for Feature Selection in Classification and Regression Problems. Chemom. Intell. Lab.

Syst. 1992, 15, 159–169.193. Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictors. J. R. Stat. Soc. 1974, B 36, 111–147.194. Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models.

Technometrics 1978, 20, 397–405.195. Osten, D. W. Selection of Optimal Regression Models via Cross-Validation. J. Chemom. 1988, 2, 39.196. Miller, A. J. Subset Selection in Regression; Chapman & Hall: London, UK, 1990; 230 pp.197. Efron, B. Better Bootstrap Confidence Intervals. J. Am. Stat. Assoc. 1987, 82, 171–200.198. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 26, 123–140.199. Clark, R. D. Boosted Leave-Many-Out Cross-Validation: The Effect of Training and Test Set Diversity on PLS Statistics.

J. Comput. Aided Mol. Des. 2003, 17, 265–275.200. Guha, R.; Serra, J. R.; Jurs, P. C. Generation of QSAR Sets with a Self-Organizing Map. J. Mol. Graph. Model. 2004, 23, 1–14.201. Snarey, M.; Terrett, N. K.; Willett, P.; Wilton, D. J. Comparison of Algorithms for Dissimilarity-Based Compound Selection.

J. Mol. Graph. Model. 1997, 15, 372–385.202. Golbraikh, A.; Tropsha, A. Predictive QSAR Modeling Based on Diversity Sampling of Experimental Datasets for the Training

and Test Set Selection. Mol. Divers. 2002, 5, 231–243.

Chemometrics in QSAR 169

Page 42: Chemometrics in QSAR

203. Wu, W.; Walczak, B.; Massart, D. L.; Heuerding, S.; Erni, F.; Last, I. R.; Prebble, K. A. Artificial Neural Networks in Classificationof NIR Spectral Data: Design of the Training Set. Chemom. Intell. Lab. Syst. 1996, 33, 35–46.

204. Gramatica, P.; Pilutti, P.; Papa, E. Approaches for Externally Validated QSAR Modelling of Nitrated Polycyclic AromaticHydrocarbon Mutagenicity. SAR QSAR Environ. Res. 2007, 18, 169–178.

205. Clark, M.; Cramer, R. D. III The Probability of Chance Correlation Using Partial Least Squares (PLS). Quant. Struct. -Act. Relat.1993, 12, 137–145.

206. Baumann, K.; Stiefl, N. Validation Tools for Variable Subset Regression. J. Comput. Aided Mol. Des. 2004, 18, 549–562.207. Nicholls, A.; MacCuish, N. E.; MacCuish, J. D. Variable Selection and Model Validation of 2D and 3D Molecular Descriptors.

J. Comput. Aided Mol. Des. 2004, 18, 451–474.208. Lindgren, F.; Hansen, B.; Karcher, W.; Sjostrom, M.; Eriksson, L. Model Validation by Permutation Tests: Applications to

Variable Selection. J. Chemom. 1996, 10, 521–532.209. Clark, R. D.; Fox, P. C. Statistical Variation in Progressive Scrambling. J. Comput. Aided Mol. Des. 2004, 18, 563–576.210. Rucker, C.; Rucker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47,

2345–2357.211. Livingstone, D. J. The Characterization of Chemical Structures Using Molecular Properties. A Survey. J. Chem. Inf. Comput. Sci.

2000, 40, 195–209.

170 Chemometrics in QSAR

Page 43: Chemometrics in QSAR

Biographical Sketches

Roberto Todeschini is full professor of chemometrics at the Department of EnvironmentalSciences of the University of Milano–Bicocca (Milano, Italy), where he constituted theMilano Chemometrics and QSAR Research Group. His main research activities concernchemometrics in all its aspects, QSAR, molecular descriptors, multicriteria decision makingand software development. President of the International Academy of MathematicalChemistry, President of the Italian Chemometric Society, and ‘ad honorem’ professor ofthe University of Azuay (Cuenca, Ecuador), he is author of more than 150 publications oninternational journals and of the books The Data Analysis Handbook, by I. E. Frank andR. Todeschini; Elsevier, 1994 and Handbook of Molecular Descriptors, by R. Todeschini andV. Consonni; Wiley-VCH, 2000.

Viviana Consonni got her Ph.D. in Chemical Sciences at the University of Milano in 2000and is now full researcher of chemometrics at the Department of Environmental Sciences ofthe University of Milano–Bicocca (Milano, Italy). She is a member of the MilanoChemometrics and QSAR Research Group and has 10 years experience in multivariateanalysis, QSAR, molecular descriptors, multicriteria decision making, and software devel-opment. She is author of more than 25 publications in peer reviewed journals and of the bookHandbook of Molecular Descriptors, by R. Todeschini and V. Consonni; Wiley-VCH, 2000. Sheobtained an Award for distinguished young researchers by the International Academy ofMathematical Chemistry in 2006.

Chemometrics in QSAR 171

Page 44: Chemometrics in QSAR

Paola Gramatica is full professor of Environmental Chemistry, past-Associate Professor ofOrganic Chemistry, at the University of Insubria (Varese-Italy). She has been the head ofQSAR Research Unit in Environmental Chemistry and Ecotoxicology, since 1995, at theDepartment of Structural and Functional Biology (DBSF), now under her direction. Herpresent research field is in QSAR modeling and chemometric methods’ applications toenvironmental organic pollutants. Recent studies deal with tropospheric oxidations ofVOC, POP persistence, pesticide partition properties, PAH mutagenicity, BCF, and endo-crine disruptor (ED) modeling. The main field of interest is relative to persistentbioaccumulative and toxic (PBT) chemicals and to the validation of QSAR models. She isauthor of more than 100 papers in international journals (more than 60 in QSAR field), andabout 200 presentations at meetings. She is Member of the Managing Boards of theEnvironmental and Cultural Heritage Division of Italian Chemical Society (SCI), of theSCI Interdivisional Group of Green Chemistry, and also Member of the OECD ExpertGroup on QSARs.

172 Chemometrics in QSAR