QSAR Design of Discovery Libraries for Solids Based on QSAR Models 2005 QSAR and rial Science

8/14/2019 QSAR Design of Discovery Libraries for Solids Based on QSAR Models 2005 QSAR and rial Science

1/16

Design of Discovery Libraries for Solids Based on QSAR ModelsD. Farrusseng a, C. Klanner b, L. Baumes a, M. Lengliz a, C. Mirodatos a, F. Schth b*a Institut de Recherches sur la Catalyse CNRS 2, Av. A. Einstein, F-69626 Villeurbanne, Franceb MPI fr Kohlenforschung, Kaiser-Wilhelm-Platz 1, D-45470 Mlheim, Germany, E-mail: [email protected]

Keywords: Combinatorial chemistry, High-throughput screening, Material library design, Virtualscreening, Descriptor, Data mining, Heterogeneous catalysis

Received: 17. 09. 2004; Accepted: 27. 10. 2004

AbstractA method is described which is used to construct a descriptor vector of solid catalysts inthe oxidation of propene. Different methods are described which allow one to construct acorrelation between characteristics of the catalysts and their performance in propeneoxidation. Successful descriptor vectors are generated which predict catalytic performancesubstantially better than statistically expected. These descriptor vectors do not containexplicit information on the elemental composition of the catalysts any more, but onlyparameters that are either derived from the elemental composition, such as the enthalpyof oxide formation, or are related to the synthetic method. The general concept canprobably be extended to the development of descriptors for solids to be used in otherapplications as well.

1 Introduction

Over only a few years, the Quantitative Structure Activi-ty Relationship (QSAR) approach has substantiallychanged the process of drug discovery, because it enablesone to select an adapted and as opposed to a random se-lection improved subset of experiments among an infin-ite number of candidate molecules by applying virtualscreening techniques. In addition, the number of com-pounds that can be taken into consideration is increasedby orders of magnitude with respect to a conventionalstrategy. This helps to increase the chance of discoverieswhile saving experiments by discarding compounds beforethey have to be synthesized. Depending on whether thegoal is to design discovery or targeted libraries, differentcriteria for library design are applied. For a discovery li-brary, samples that constitute the subset should be as di-verse as possible and the set should cover the whole searchspace. For a targeted library, on the other hand, one seeks

drug candidates that are similar to a lead structure. Thus,the diversity profiling of drugs/molecules is the key con-cept in the design of libraries of molecules. It relies on thesimilar property principle, that is, the assumption thatstructurally similar molecules should have similar biologi-cal activities. The similarity/diversity of two molecules canbe assessed by measuring a distance between them ac-cording to different methods, such as, for instance, Euclidi-an metrics or Tanimoto indices. It is clear that the distancecalculation depends strongly on the coordinates of the twomolecules, which in turn rely on the way they have beenencoded. Molecules and drugs, which can be described at

the atomic level, can be encoded as fingerprints that ena-ble their molecular features or molecular characteristics tobe captured through the so-called descriptors. These de-scriptors form a vector of variables, so that a library of compounds can be described by a tabular array where arow represents a molecule and the columns the differentdescriptors. Molecules can be described by a vast numberof descriptors that are related to structural features or mo-lecular properties. Commonly, property descriptors aresub-divided into one-, two-, and three-dimensional (1D,2D, and 3D, respectively) descriptors, indicating the re-quired type of structural representation of a molecule forits calculation. 1D descriptors include bulk properties andphysiochemical parameters, for instance, molecularweight. Properties that can be computed from a 2D-struc-ture representation include, for example, defined structur-al fragments and connectivity indices. 3D properties are,for instance, the solvent-accessible surface area, molecularvolumes, or spatial pharmacophores. After selection of the

relevant descriptors, the diversity of any two moleculescan be assessed by computing the distance between them,based on the descriptors.

The combinatorial approach has recently been extendedfrom drug discovery to materials science and catalysis [1 6]. Unfortunately, the QSAR approach cannot be equallywell transferred to the discovery of materials, because (i)solids can hardly be characterized at the atomic scale,which is a serious obstacle for fingerprint encoding, asmentioned above for molecules, and (ii) the similar prop-erty principle is not generally applicable to solids in mate-rials science [7]. Hence, truly QSAR and virtual screening

78 QSAR Comb. Sci. 2005 , 24 DOI: 10.1002/qsar.200420066 2005 WILEY-VCH Verlag GmbH& Co. KGaA, Weinheim

Full Papers


2/16

methods based on descriptors have not been applied tocatalysts so far, even if data-mining methods have been re-cently used [8 13]. Therefore, there is no method todayfor the library design of solids that can guarantee that theproperties of the solids will be diverse. It is the high com-plexity of solids, as compared to molecules or drugs, which

makes the design of libraries a serious challenge [14]. Oneof the main hurdles is the description of a solid, especially,if the solid has not been synthesized in reality and, there-fore, no characterization results are available.

The motivation of this work is to demonstrate for thefirst time an implementation strategy for the building of aQSAR analogue model for solids. Since the solids are notrepresented as structures, in the following, the term QPAR(Quantitative Property Activity Relationship) will beused. The methodology to generate and select relevant de-scriptors is described. Different QPAR models are com-pared and discussed with respect to previous knowledgeand the available literature. A short account on some as-

pects of this work has been given recently [15].

2 Materials and Methods

2.1 Workflow Overview

Figure 1 gives a general description of the workflow usedin this study, which, however, is generic and should be ap-plicable to other problems as well. The data-processingsteps to generate the catalyst descriptors are shown on theleft side in Figure 1, while the workflow to assess and clas-

sify the catalytic behavior experimentally is depicted onthe right side. The whole process consists of the followingstages: (1) collection and synthesis of a diverse library of solids, in which the selection is based upon experience andchemical intuition; (2) description of these solids by nu-merous attributes; (3) selection of a subset of relevant and

uncorrelated attributes to create a first possible descriptorvector; (4) high-throughput (HT) testing of the whole li-brary of solids in a catalytic reaction and classification of the catalysts in distinct classes of performance, resulting inclusters containing solids which exhibit similar catalyticproperties; and (5) computing QPAR models between de-scriptor vectors and catalytic performances, where in thisprocess the descriptor vectors are further modified. De-tails of each step are reported in the following sections.

2.2 Sample Collection and HT Synthesis

For the first stage, we collected approximately 500 differ-

ent solids with the aim to cover a wide chemical space.Hence, samples had to be as diverse as possible with re-spect to the elements, the material types, and the synthesisprocedures, which also means that compounds were in-cluded in the study for which it was clear that they were es-sentially inactive in the target reaction, such as silica. Thisinitial diverse library was created based on a priori knowl-edge and intuition. A first selection of 367 catalysts wascarried out among solids already available in our laborato-ries. Then, 100 additional solids were synthesized by meansof HT equipment in order to expand the chemical spacewith respect to the element and support representation.

QSAR Comb. Sci. 2005 , 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 79

Figure 1. General workflow for the development of descriptors for solid catalysts; for explanation, see text.

Design of Discovery Libraries for Solids Based on QSAR Models


3/16

Thus, we ensure that most of the elements of the periodictable (excluding exotics) were represented, and that theoccurrence of each element was well distributed over thelibrary. As it is far beyond the scope of this work, the prep-aration method of each solid catalyst will not be discussedin detail. However, synthetic methods include impregna-

tion, ion exchange, precipitation, co-precipitation, deposi-tion precipitation, the activated-carbon route [16], sol-gelsynthesis, and others.

2.3 Computation of Catalyst Attributes

A Microsoft Access database was implemented to recordthe description of the catalyst synthesis and to perform inan automatic manner the calculation of thousands of at-tributes (also called meta data) for each catalyst. A varietyof different information is recorded in the database, name-ly, (1) synthesis parameters of the catalysts, (2) elementalcomposition of the catalysts, (3) properties of constituent

elements, (4) properties of constituent-element oxides, and(5) properties of constituent-element ions (Figure 2). Theentries concerning the properties of elements, element ox-ides, and element ions were collected from the Handbookof Chemistry and Physics [17] and other sources of physi-co-chemical data. The workflow that describes the combi-natorial generation of the attributes is shown in Figure 3.By combining the elemental composition of a catalyst andthe respective properties of the elements, ions, or oxides,3100 attributes were computed in a combinatorial mannerusing operators such as the average of a certain propertyfor the constituents of the catalyst, the maximum, the min-imum, weighted averages, and so on. For instance, fromthe enthalpies of formation of the element oxides, one cancompute the average enthalpy of formation, or the spreadbetween the highest and the lowest enthalpy of formation.Such values should, in some complex way, be related tothe availability of oxygen at the surface of a solid catalyst,and thus to the performance in an oxidation reaction. Themotivation to generate a vast number of attributes is thefact that relevant and discriminative attributes are a priorinot known, and obviously the relationship between prop-erties of a catalyst and its performance is not simple.

Finally, different types of attribute sets were generated:X1 accounted for the composition of all catalysts (60 in to-

tal), X2 consists of parameters calculated on the basis of X1 and physical data (3100 in total), and X3 contains syn-thesis parameters of the solid catalysts. The last attributeset consists of 19 categorical (mostly binary) variableswhich provide information on the last synthesis step,namely main synthesis parameters, precursor types, addi-tive types, etc.

2.4 Search Space Reduction the Descriptor Vector

The number of attributes must be reduced since there areno modeling techniques available which are able to deal

with 3000 attributes. We assumed that a set of around 75attributes would be a maximum with respect to the num-ber of catalysts (467), otherwise the risk of self learningwould be too high. The task thus consists of selecting dis-criminative attributes that are not strongly correlated be-tween each others. A ranking of the 3119 attributes (X2

and X3 attribute sets), based on their discriminative pow-er, was carried out using the Relief algorithm [18]. Then,56 attributes were selected from the 3100 X2 attributeswith respect to their ranking and in order to map differentproperties (Table 1). This ensemble of 56 X2 attributesplus the 19 X3 attributes form a first descriptor vectorwhich is used as input for the QPAR model (and whichwas further reduced during the model building). Note thatthis descriptor vector does not contain information on theelemental composition any more. This only enters the vec-tor via the computed values that are correlated to the ele-ments present in the catalyst.

2.5 HT Testing and Clustering of Catalytic Performance

The library of solids has been tested in the gas-phase oxi-dation reaction of propene with oxygen. This reaction of-fers the advantage to provide a wide spectrum of possibleproducts from C1 to C6 including a number of alkenes andoxygenates. A complete list is given in Table 2. Data analy-sis serves to classify the performances of all catalysts into asmall number of distinct classes that are representative of typical catalytic behavior. An automated HT set-up hasbeen used to evaluate the performance of each catalyst. Itconsists of a set of mass-flow controllers for oxygen, pro-pene, and nitrogen, which feed the reagents via a commonline into a 16-fold plug flow reactor. The principle set-upof this system corresponds to the one described in Ref.[19]. A gas chromatograph equipped with a capillary col-umn and flame-ionization detector in combination with amethanizer has been used for analysis. Catalytic tests werecarried out with a gas consisting of 1% propene, 5% oxy-gen (slightly over-stoichiometric for full oxidation to waterand carbon dioxide), and nitrogen as balance, at a spacevelocity of 225 mL h 1 (gcat) 1 at five different tempera-tures (200, 250, 300, 400, and 500 8 C). Catalyst was used asgrains of between 250 and 500 m m in size. Propylene con-version and the selectivity to 27 products were determined

for each catalyst. Each test was repeated twice. Each fullcycle to analyze 16 catalysts needs 8 h. This means that thecatalysts were not analyzed at the same time on stream.However, the activation or deactivation information wasalso taken into account to some extent by calculating thedifference between the propene conversion of the first andthe second measurement. Using this procedure, the per-formance of the 467 catalysts was described by 120 varia-bles: propene conversion, 21 selectivities, deactivation be-havior, and mass balance (a measure for possible cokingor residue formation on the catalysts) at five different tem-peratures. Because of obvious strong correlations in prod-

80 2005 WILEY-VCH Verlag GmbH & C o. KGaA, Weinheim QSAR Comb. Sci. 2005 , 24

Full Papers D. Farrusseng et al.


4/16

uct selectivities accompanied with low variance, somecomposite variables were constructed by the combinationof variables to reduce the complexity of the problem: Thevariables mass balance at all temperatures and temporalbehavior at 200 8 C, 250 8 C, and 300 8 C have been discarded,as their information content was extremely low. It is ex-

pected, that this will be different for other reaction condi-tions under net reducing conditions, where coke formationcan be a serious problem. 17 variables with high informa-tion content have been taken directly for analysis: conver-sion, selectivity to CO and to CO 2 at all five temperatures,and temporal behavior at 400 8 C and 500 8 C. Four variables


Figure 2. Simplified scheme of the database for solid catalysts.



5/16

with medium information content were taken into ac-count by summing up the respective variable for all five

temperatures giving formaldehyde all temperatures , acetalde-hyde all temperatures , acrolein all temperatures , and benzene all temperatures .Variables that were assumed to be important from a chem-ical point of view, but had a low variance, were taken intoaccount by grouping products with similar chemical func-tion or chemical structure The values of the respectivegroups for all temperatures were summed up, givingS C3H 6O all temperatures (propionaldehyde and acetone), Sacids all temperatures (acetic acid and acrylic acid), S alkanes alltemperatures (ethane, butane, and pentane), S C4 hydro-carbons all temperatures (1-butene, 2-butene, and 2-methylpro-pene), S C5 hydrocarbons all temperatures (2-methyl-1-butene, 2-

methyl-2-butene, and pentenes), and S C6 hydro-carbons all temperatures (S C6H 12). Then, a principal component

analysis (PCA) was carried out to decrease the dimension-ality and to orthogonalize the dataset. From the scores onthe PCs axis, distinct catalytic classes were generated bymeans of hierachical clustering (Ward s distance) and thek-means technique as implemented in Statistica 6.1.

2.6 QPAR Model

The task here consists of modeling the correlation betweenthe descriptors (56 19) and the performance clusters,which represent typical catalytic behavior. In other words,we seek classification models that enable us to assign a cat-


Figure 3. Scheme for the generation of the attribute set X2. For the elements only one operation was necessary, for oxides and ionstwo steps were needed; first the different states for one element had to be taken into account and in the second step the differentconstituents in the catalyst were generated. (SD standard deriation).



6/16


Table 1. A list of the 56 attributes from the attribute set X2 used for the correlation. 1 24 are element properties and 25 56 are ox-ide properties. Since often more than one oxide exists, an additional operation to calculate the value for an element is necessary, forinstance, the mean of the densities of different oxides for one element.

Number Code Property Calculation for cata-lyst

1 meanbc_1ie first ionization energy mean all metals and semi-metals

2 meanbc_ar atomic radius mean all metals and semi-metals3 difec_ar difference from highest to lowest value4 meanbc_bseo bond strength element oxygen mean all metals and semi-metals5 difec_bseo difference from highest to lowest value6 meanbc_bsee bond strength element e lement mean all metals and semi-metals7 difec_bsee difference from highest to lowest value8 meanec_ea electron affinity mean all elements9 meanbc_ea mean all metals and semi-metals

10 difec_ea difference from highest to lowest value11 meanec_pe Pauling electronegativity mean all elements12 meanbc_pe mean all metals and semi-metals13 difec_pe difference from highest to lowest value14 minec_nffefmsmo normalized formation free-enthalpy for most stable

metal oxideminimum value all elements

15 maxec_nffefmsmo maximum value all elements16 minec_sedmsmoom smallest formation free-enthalpydifference from

the most stable metal oxide to another metal oxideminimum value

17 maxec_sedmsmoom maximum value all elements18 wmec_ms molar mass weighted mean all elements19 meanec_ms mean all elements20 meanbc_ms mean all metals and semi-metals21 wmbc_no number of element oxides weighted mean all metals and half metals22 minec_no minimum value all elements23 maxec_no maximum value all elements24 nvec number of elements in catalyst number all elements

Number Code Property Calculation for ele-ment

Calculation for cata-lyst

25 meanec_moe_d density of oxides mean mean all elements

26 meanec_moe_dc dielectric constant of oxides mean mean all elements27 meanec_meanvho-

soe_ffeformation free-enthalpy of oxides mean value for the

highest oxidation statemean all elements

28 meanec_difoe_ffe difference highest tolowest value

mean all elements

29 meanec_moe_ffe mean mean all elements30 meanec_meanvho-

soe_mpmelting point of oxides mean value for the

highest oxidation statemean all elements

31 meanec_moe_mp mean mean all elements32 meanec_difoe_mp difference highest to

lowest valuemean all elements

33 meanec_nvoe_os oxidation states of oxides number mean all elements34 sumec_nvoe_os number sum all elements35 meanbc_mie_cn coordination number of ions mean mean all metals and

semi-metals36 meanbc_difie_cn difference highest to

lowest valuemean all metals andsemi-metals

37 meanec_mie_cn mean mean all elements38 meanec_difie_cn difference highest to


39 meanbc_mie_icp ionic covalency parameter of ions mean mean all metals andsemi-metals

40 meanbc_difie_icp difference highest tolowest value

mean all metals andsemi-metals

41 meanec_mie_icp mean mean all elements42 meanec_difie_icp difference highest to




7/16

alyst to a specific cluster. The model quality is assessed bymeans of prediction-rate criteria. Two classification tech-niques were used: Artificial Neural Networks (ANN) andClassification tree as implemented in Statistica 6.1.

In the search for appropriate ANN models, both Multi-Layer Perceptron (MLP) and Probabilistic Neural Net-works (PNN) were applied using the Intelligent Solver of Statistica. In order to get robust models, we have enabledthe solver to discard attributes during the screening of theneural networks. In addition, after models were built, apruning was performed for discarding irrelevant variables.Pruning refers to a sensitivity analysis, e.g., a ranking of

variables based on their discriminative power. The datasetof 467 catalysts was randomly divided into three subsets.The learning step was performed on one half of the wholedataset, the verification step on one quarter, and the inde-pendent testing step on the remaining quarter.

For the Classification tree models, we used the C&RTmethod with the Gini criteria as splitting conditions andFACT-style direct stopping as pruning rule (Statistica 6.1).Cross-validation was carried out on one third of the data-set to ensure that models were not prone to overlearning.


Table 1. (cont.)

Number Code Property Calculation for ele-ment

Calculation for cata-lyst

43 meanbc_mie_ir ionic radius of ions mean mean all metals andsemi-metals

44 meanbc_difie_ir difference highest tolowest value


45 meanec_mie_ir mean mean all elements46 meanec_difie_ir difference highest to


47 meanbc_mie_l optical basicity of ions mean mean all metals andsemi-metals

48 meanbc_difie_l difference highest tolowest value


49 meanec_mie_l mean mean all elements50 meanec_difie_l difference highest to


51 meanec_maxvhosie_os oxidation states of Ions maximum value forthe highest oxidationstate

mean all elements

52 meanec_difie_os difference highest tolowest value

mean all elements

53 sumec_sumie_os sum sum all elements54 sumec_nvie_os number sum all elements55 sumec_minvhosie_os minimum value for

the highest oxidationstate

sum all elements

56 sumec_maxvhosie_os maximum value forthe highest oxidationstate

sum all elements

Table 2. Products of the reaction clearly identified by gas chromatography.

C1 methane C 3 propene C 4 butane C 6 hexaneformaldehyde allyl alcohol 2-methylpropene cyclohexanecarbon monoxide propylene oxide 1-butene benzenecarbon dioxide acetone

propionaldehyde2-butene

C2 ethane acrolein C 5 pentaneacetaldehyde acrylic acid 2-methyl-1-buteneacetic acid 2-methyl-2-butene



8/16

2.7 Quality Assessment of QPAR Models

The assessment was performed by comparing the predic-tions made by the model with our observations. The resultsare reported in a so-called confusion matrix. This table re-veals the number of correctly classified catalysts, how

many catalysts were misclassified, and for which classesthe misclassification occurred. The prediction rate is de-rived from the confusion matrix to estimate the quality of the prediction. It accounts for the correctly classified casesin the respective predicted class and can be considered asstatistical benchmark for the quality of the prediction,since it can directly be compared to the ratio given inthe confusion matrix. This ratio gives the statistical distri-bution probabilities into the respective classes for the orig-inal dataset. Hence, a model becomes meaningful whenthe prediction rates are higher than the corresponding ra-tios.

3 Results

3.1 Feature Selection Descriptor Vector

In order to assess the relevance of the different attributes,the Relief algorithm was used, which indicates the discrim-inative power of the attributes with respect to the differentclasses of catalytic behavior. Among the 3119 attributes, 5attributes describing the synthesis procedure are by far themost relevant and 11 attributes related to the synthesis arein the top 50. Therefore, all 19 synthesis attributes from at-tribute set X3 were selected, as their influence on the cata-lytic performance was estimated to be very strong. Thisalso corresponds to experience in the field of catalysis,where it is known that the synthesis method to create acatalyst is of paramount importance in determining theperformance. On the other hand, clear trends of the rele-vance of the continuous attributes from the attribute setX2 can hardly be identified. Nevertheless, it was noticea-ble that attributes calculated with simple operators such asthe minimum of a set of values for a given catalyst, themaximum, or the average have more discriminative powerthan other, more complex operators such as standard devi-ation. Attributes generated by the complex operators

where thus discarded. In order to reduce the number of at-tributes, strongly correlated attributes (e.g., melting pointand boiling point of elements) and attributes based onproperties that were only accessible for few elements (forinstance, heat capacity of element oxides) were discarded.Finally, among the rest we decided to pick 56 attributesfrom the X2 attribute set according to their ranking, in or-der to map all defined properties. The list of continuousdescriptors is reported in Table 1. A PCA analysis on the56 continuous attributes was carried out in order to get in-sights on the features of the dataset. The Eigenvalue plotshows that up to eight PCs are required to explain 75% of

the variance, which indicates the high dimensionality of the search space (Figure 4). In addition, the loading plotsPC1 vs. PC2 and PC3 versus PC4 enable us to visualize thefairly good covering of the variable search space for the at-tributes which contain most information (Figures 5a and

5b). From the distribution study of the attributes one canextract that most of them follow a normal distribution, asshown for the example of the attribute meanecmie_l (#49)reported in Figure 6a. This is in contrast with the attributeset X1, which accounts for the elemental compositions(Figure 6b). Indeed, the box and whisker plot indicatesthat the median and quartiles for most of the variables re-garding the elemental compositions are equal to zero. Thisresults from the fact that a catalyst contains typically be-tween three and eight elements, which implies that all oth-er elements are assigned the value 0. Therefore, in addi-tion to higher information content of the attributes, gener-ating the X2 attribute set enables us to obtain normallydistributed attributes, which is beneficial with respect tothe QPAR modeling task. In conclusion, this ensemble of 56 X2 attributes plus the 19 attributes from the X3 form

the initial descriptor vector that is used as input for theQPAR model.

3.2 Data Analysis and Clustering of CatalyticPerformance

Catalytic performance for 467 very different catalytic ma-terials in a reaction such as propene oxidation is highlycomplex. Altogether 120 variables fully describe the cata-lytic performance. As already explained in the methodssection, various variables among the 120 have been group-ed since they were obviously very strongly correlated re-


Figure 4. Eigenvalue plot and the corresponding percentage of variance for each principal component.



9/16

sulting in a reduced dataset of 27. In order to make the da-taset orthogonal for allowing proper clustering, a subse-quent PCA step was carried out. Taking the first eight PCs79.4% of the variance based information is retained. Mostof the variables are represented above 75%, except thevariables accounting for the selectivities to formaldehydeand acetaldehyde (35 and 45%, respectively). The firstPCs are closely correlated with specific catalytic behavior.High PC1 levels mean high conversion at low temperatureand high CO 2 selectivity, whereas low PC1 levels meanhigh CO selectivity. High PC2 values, one the other hand,are related with high selectivity for alkane and alkene for-mation. Some obvious trends concerning the variables

conversion, CO, and CO 2 at all temperatures can also bedistinguished in the PC1 PC2 and PC1 PC3 loadingplots (Figures 7a and 7b). All conversion variables and allCO 2 variables are directly correlated and both inverselywith all CO variables. It can also be observed that all othervariables are independent (orthogonal) of propene conver-sion, CO, and CO 2. The variables benzene, S acids, S alka-nes, S C4 hydrocarbons, S C5 hydrocarbons, and S C6 hy-drocarbons are strongly correlated and orthogonal to allremaining variables.

Different clustering methods were applied to classifythe catalytic behavior of the 467 catalysts. The clusteringwas performed on the PCA scores (coordinates of catalystsin the space defined by the first 8 PCs axis). The selectioncriteria were (1) to generate a minimum number of classes


Figure 6. a) Typical distribution of normalized values shown for the example of the attribute meanecmie (#49). b) Box and whiskerplot for the distribution of the elements in the catalysts. Since most catalysts only contain few elements, most values for a given cata-lyst are zero. Since most materials are oxides, only the median for oxygen differs substantially from zero. Quartiles significantly differ-ing from zero are only observed for silicon and aluminium, which occur frequently in the supports.

Figure 5. Loading plots for the PCA analysis of the selected56 continuous attributes from the attribute set X2. Projection of all variables on the PC1/PC2 plane (a) and the PC3/PC4 plane(b).



10/16

which represent the different obvious catalytic behavior(coverage), (2) to avoid that a class is represented by lessthan 15 catalysts of the whole population (representative-ness), and (3) to get classes as distinct as possible. Addi-tionally, the goal of the clustering was to identify each clus-ter with distinct chemical behavior. With respect to these

criteria, the hierarchical clustering indicates that four dis-tinct classes can be generated when cutting at a linkagedistance of 150 (Figure 8). On the other hand, k-meansclustering enables us to generate five distinct classes with ahigher degree of discrimination. The distribution of the467 catalysts in the five clusters is shown in Figure 9. It re-veals that four classes encompass about 100 catalysts,whereas the last class contains only 17 catalysts, which rep-resent about 4% of the whole dataset. The results of the k-means clustering are shown on the PCA score plot (Fig-ure 10). For further analysis, these five clusters were inves-tigated, since they fulfill all our requirements. In addition

it is possible to assign a specific catalytic behavior to eachclass. The chemical significance of each of the clusters is:

cluster #1: low conversion, high selectivity to CO 2,cluster #2: medium conversion, high selectivity to CO 2,cluster #3: low conversion, high selectivity to CO, partialoxidation products,cluster #4: low selectivity to (CO 2 CO), hydrocarbons,andcluster #5: high conversion, high selectivity to CO 2.

3.3 QPAR Model

After the selection of the 56 (X2) plus 19 (X3) attributesand the identification of five distinct classes of catalytic be-havior, a model was built in order to establish QuantitativeProperty Activity Relationships between the attributesand the classes. In addition, because models can highlightthe most discriminative attributes, a last selection of attrib-


Figure 7. Loading plots for the PCA analysis of the catalyticperformance data. Projection of all variables on the PC1/PC2plane (a) and the PC1/PC3 plane (b).

Figure 8. Hierarchical tree plot for all 467 catalysts. Cuttingbetween 100 and 200 generates four well-separated clusters.

Figure 9. Number of cases in each cluster for k-means clusteranalysis based on eight principal components.



11/16

utes was performed, yielding the final descriptor that hasthe most robust predictive power. Two different classifica-tion tools were used to build the QPAR model, namelyANN and Classification tree techniques.

Both MLP and PNN neural networks with two hiddenlayers gave high overall prediction rates. The best PNNnetwork found is characterized by 51 nodes in the inputlayer, and 119 and 240 in the first and second hidden lay-ers, respectively. For this model, the results of the predic-tions for the training, verification and testing datasets, re-spectively, are reported in the contingency table (Table 3).Values reported on the diagonal correspond to correctclassifications while the other entries account for misclassi-

fications. For example, in the quality assessment for thetraining dataset, 54 catalysts were correctly classified incluster #1, whereas one was misclassified in cluster #2, and3 misclassified in cluster #3. Prediction rates (number of correctly classified catalysts divided by the number of allcatalysts assigned to this class) have been calculated foreach catalyst class. The tables reveal very good learningperformance with prediction rates above 93% for allclasses. The prediction rates for the verification and thetest datasets at about 57% indicate that over learning didnot take place, i.e., the model can perform predictionswith a high confidence level on new catalysts. It has to

be pointed out that the PNN model enables good predic-tions to be performed on cluster #4 when considering thevery low number of catalysts used in the learning step(eight and three catalysts for learning and verification, re-spectively). The lowest prediction performance is associat-ed with cluster #2. This results from the fact that catalystsbelonging to cluster #5 are misclassified in cluster #2 andvice versa. These misclassifications originate from the jux-taposition of the two clusters which can be seen in thePCA plots. Chemically, the performance of the catalysts inthese two clusters is related (differing only with respect tothe level of conversion), so that this misclassification isless severe than, for instance, misclassification into cluster#3 or #4.

Applying the pruning method, the number of attributes

was reduced from 75 to 51 resulting in the generation of adescriptor vector with more condensed information con-tent. Similar results were obtained with MLP neural net-works. The best architecture found consists of 45 nodes inthe input layers, and 144 and 27 in the first and second hid-den layers, respectively.

The Classification tree analysis was based on a furtherreduced attribute set of 42 attributes based on a weightanalysis by ANN. A loss matrix had to be used to achievediscriminative models. A loss matrix consists of a squarematrix of coefficients (Table 4) multiplied by a vector of class probabilities to form a vector of cost estimates. The


Figure 10. Score plots showing the results of k-means clusteranalysis based on eight principal components.

Table 3. Contingency table for the ANN analysis, reporting thepredictions for the training (top), verification (middle) and test-ing (bottom) datasets. ANN: Probabilistic Neural Network(PNN) with 51 input nodes: 119 nodes in the first layer and240 nodes in the second layer.

training 1 2 3 4 5 sum ratio pred. rate sensibility

1 predicted 54 0 2 1 0 57 0.24 0.95 9,832 predicted 1 69 1 0 2 73 0.30 0.95 0.963 predicted 3 1 67 0 0 61 0.25 0.93 0.954 predicted 0 0 0 7 0 7 0.03 1.00 0.865 predicted 0 2 0 0 40 42 0.18 0.95 0.95sum/mean 68 72 80 8 42 240 0.96 0.93

verification 1 2 3 4 5 sum ratio pred. rate sensibility

1 predicted 14 2 5 2 2 25 0.27 0.56 0.452 predicted 10 18 1 0 7 36 0.26 0.50 0.623 predicted 7 2 22 0 0 31 0.25 0.71 0.794 predicted 0 0 0 1 1 2 0.03 0.50 0.335 predicted 0 7 0 0 18 20 0.20 0.65 0.57sum/mean 31 29 28 3 23 114 0.59 0.55

test 1 2 3 4 5 sum ratio pred. rate sensibility




12/16

diagonal of a loss matrix is always equal to zero, because acorrect classification has zero costs. The value 1 indicatesno adjusted preference, while all values > 1 induce a pref-erence for certain classifications. With the loss matrix asset, it becomes more costly to misclassify cases from

small clusters than from large clusters. In other words, us-ing a loss matrix enables to improve the prediction rates

on smaller classes. This is desirable, because these smallclusters contain the catalysts producing partial oxidationproducts or hydrocarbons, which are more valuable prod-ucts than CO 2 and CO. The search for best combinationsof misclassification costs was carried out by trial and errorto obtain high prediction rates in all five classes.

The best classification tree yielded a model based ononly 23 attributes, 33 split nodes, and 34 terminal nodes(leaves). Because the large size of the classification treesresults in a complex scheme, a complete graph represent-ing the tree structure cannot be shown here. However, thefirst nodes and a few leaves of a simpler model are depict-

ed in Figure 11 as example. The bar chart on top of thetree presents the distribution of the catalysts in the fiveclusters as already shown in Figure 9. The whole dataset isfirstly divided into two smaller datasets according to themost discriminative splitting rule i.e. whether catalystsshow values for maxec_nffefmsmo (#15) higher or lower

than 153.1. Then, successive splitting conditions are usedto generate nodes that are aimed at creating clusters whichare clearly separated from each other. At the end of eachbranch, the terminal nodes contain the results of the pre-dictions. For example, when all splitting conditions whichdefine the terminal node * are fulfilled, the prediction rateis 84% with respect to cluster #1 whereas the initial proba-bility of cluster #1 is 25% (node 1). The results of the ter-minal nodes are gathered by cluster prediction and thenreported in the confusion matrix (Table 5) which showsthe prediction rates for all catalyst classes and for thelearning and cross-validation datasets, respectively.

Judging from the prediction performance, it is obvious

that the learning has proceeded rather well since the mod-el can predict all five clusters satisfactorily, with an overallprediction rate of 0.68. In order to validate the model, aprediction test was carried out with independent catalysts(one third of the dataset). Also here, the prediction ratesfor each class are well above the distribution probabilitiesalthough the prediction performance is significantly inferi-or with respect to the training set. In addition, also thegood prediction for the catalytic behavior #4 has to bepointed out. Indeed, the model allows us to predict, with aconfidence rate at around 40%, that a catalyst would be-long to the class of partial oxidation catalysts. In contrast,


Table 4. Loss matrix for the Classification tree analysis which in-dicates the costs for misclassification of catalysts in the re-spective classes.

Loss matrix 1 2 3 4 5

1-predicted 0 1 4 5 12-predicted 3 0 1.2 5 33-predicted 1.1 1 0 1.5 14-predicted 1 1 1 0 15-predicted 1 1.5 1.1 1.1 0

Figure 11. Schematic representation of the first nodes and leaves of the classification tree. The numbers in the boxes and the barsrepresent the number of cases left in each class after application of the corresponding splitting rule.



13/16

with random selection, one would only have a 4% chanceof classifying such a catalyst.

From the tree structure and splitting conditions, 34 rulesare derived yielding an explicit model. A prediction rulecorresponds to the ensemble of splitting conditions fromthe top node to a terminal node. The collection of all path-ways to each terminal node in text form results in a rec-ipe which enables us to predict straightforwardly the cat-alytic behavior of a new catalyst, in contrast to ANN. Asan example, some of the 34 rules are reported in Table 6.

4 Discussion

In the course of the work on this project, several issueswere encountered which are important and which shall beaddressed in this section, together with the discussion of the results and the wider implications this work may have.

4.1 Synthesis Coding

Appropriate encoding of the synthesis procedure, even ina simplified form, is a very difficult task, if one deals withsolid catalysts. On the other hand, the synthesis protocol

often is as important as the chemical composition with re-spect to the catalytic performance. Great care thereforehas to be taken to capture the essentials of the syntheticprocedure.

In general, all solids have been synthesized in severalsteps. However the number of steps is a matter of defini-

tion. For example, Cu/MCM-41 (MCM-41: mesoporoussilica as discovered by scientists of Mobil Oil Corporation[20]) would intuitively be classified as a two-step reaction(preparation of the support, which is not commerciallyavailable and has to be synthesized, and subsequent im-pregnation), while Cu/SiO 2 made by impregnation of acommercial SiO 2 support would normally be assumed tobe a one-step reaction. However, although the supportwas bought from a supplier, its synthesis, for instance viaflame pyrolysis, should also be considered as a syntheticstep, resulting, altogether, in a two-step reaction. Consid-ering an ion-exchange reaction, it is also a matter of defini-tion whether each exchange step counts as a reaction step

or whether the whole exchange procedure is considered asa single step. This creates a problem in coding the syn-thesis procedure: the more steps that are defined and themore precisely each single step is described, the more en-tries that are equal to zero and therefore without informa-tion are obtained. This results from the fact that each cata-lyst has to be described with the same attributes in orderto allow meaningful correlation in the model-buildingstep. For example, assuming that each catalyst should bedescribed by four synthesis steps and that each step con-sists of 15 parameters (altogether 60 parameters), the en-try for a catalyst synthesized in a single step (for instanceby precipitation) would contain at least 45 variables with-out any information. Hence, the goal was to find a goodcompromise between a precise description and a suffi-ciently simple coding, by either discarding or regroupinginformation to avoid the above-mentioned problem. Forthis study, it was decided to restrict the information only tothe last synthetic step which was encoded by 19 differentcategorical attributes: coding the type of the synthesis re-action (such as ion-exchange or impregnation), solvents,precursors, and the presence of supports and additives,such as chlorine, alkali metal, or others.

In a more advanced stage of this technology and on abroader data basis of catalysts, one can and should cer-


Table 5. Contingency table for the Classification tree analysiswhich reports the predictions for the training and testing data-sets.

training 1 2 3 4 5 sum ratio pred. rate sensibility

1 predicted 42 14 13 0 4 80 0.25 0.61 0.642 predicted 5 58 0 0 7 70 0.28 0.83 0.67

3 predicted 16 1 54 1 1 73 0.24 0.74 0.744 predicted 6 5 5 10 1 27 0.04 0.37 0.915 predicted 1 8 1 0 49 59 0.20 0.83 0.79sum/mean 77 86 73 11 62 309 0.68 0.75

test 1 2 3 4 5 sum ratio pred. rate sensibility


Table 6. Example for rules that allow the prediction of the performance cluster into which a catalyst falls. Explanation of symbols:see in Table 1. Both sets of rules predominantly sort catalysts into cluster #1.

Cluster 1 2 3 4 5 1 2 3 4 5

Terminal node 14 Terminal node 26Cases 10 3 0 0 0 13 0 3 0 0Rule 1 maxec_nffefmsmo 153.1 maxec_nffefmsmo 153.1Rule 2 meanbc_bseo 185.3 meanbc_bseo 185.3Rule 3 meanec_ea 0.75 meanec_ea > 0.75Rule 4 meanbc_difie_l 0.09 meanec_difie_l > 0.11Rule 5 sumec_minvhosie_os 7



14/16

tainly encode the synthesis of the solids in more detail.However, on the basis of less than 500 catalysts, this didnot seem to be appropriate.

4.2 Design of Relevant Descriptors

Software-based methods alone were not sufficient to re-duce the more than 3000 attributes to the significant onesthat could be used to construct suitable descriptor vectors.Chemical knowledge and intuition has thus been used todesign possibly useful elements of the descriptor vector.This shall be illustrated for some examples: Two propertieswere introduced to account for the stability of oxides andtheir ability to change oxidation states. These propertieswere assumed to be correlated to the availability of oxygenatoms for an oxidation reaction, such as the propene oxi-dation. A normalized formation free enthalpy of forma-tion of element oxides has been introduced (calculation isperformed by dividing formation free enthalpy by the

number of metal atoms in the formula unit). The oxidewith the lowest value of all possible oxides in the catalystis referred to as the most stable oxide. Additionally, thedifference between the most stable oxide and the second-most stable oxide has been taken into account. Consider-ing another attribute based on element oxide characteris-tics, the sum of all accessible oxidation states in oxides of all elements in the catalyst was expected to correlate withthe reduction and oxidation properties of the catalyst. In-tuitively, one would expect that oxidation and reductionsteps would be easier if many oxidation states are accessi-ble under ambient conditions

4.3 QPAR Models

The comparison of prediction performance based on thewhole dataset for ANN and Classification tree models isshown in Figure 12. The prediction rates are compared tothe initial probability for a catalyst to belong to a givencluster. The gap between the prediction rate and respec-tive initial probability reveals how well the model can pre-dict performance. From the bar chart it is obvious thatclassification trees performed less well than ANN. On theother hand, the simplicity/readability of Classification treemodels is a major advantage since much information can

be gained on the most relevant descriptors and guidelines,as will be discussed in the following section. It should alsobe pointed out that many more catalysts should bescreened to allow a fully reliable comparison between thetwo modeling techniques, especially since a significant partof the different datasets were reserved either for verifica-tions or cross-validations.

When complex catalytic reactions are analyzed, such asthe partial oxidation of hydrocarbons, it is obvious that avery limited number of catalyst types will have the desiredperformance, compared to the infinite number of potentialcandidates. From a statistical point of view, it follows that

the most interesting catalytic behavior corresponds to theclass that contains the smallest number of catalysts. This isan issue for the modeling task, because the learning onlimited numbers of cases cannot be optimal, and becausethe construction of the models is usually biased by the dis-tribution of the classes, i.e., models tend to learn better onlarger classes than on smaller ones. For the latter problem,the bias can be corrected by using cost-misclassificationsystems that allow the learning on targeted classes to beforced. The design of the cost matrix is not straightforward

and is usually carried out by trial-and-error processes.

4.4 Descriptor Vector

The goal of this study was the evaluation of a descriptorvector that consists of relevant attributes. This descriptorvector should allow the correlation of properties of solidcatalysts with the respective performance of the catalysts,and would thereby enable the design of libraries suitablefor HT testing. On the other hand, however, from the in-spection of classification rules or attributes selected forthe descriptor vector by ANN, one may learn more about


Figure 12. Prediction rates for each cluster in comparison tothe statistical probability of a catalyst to belong to the cluster.a) ANN of the MLP type based on 45 attributes. b) Decisiontree C&RT based on 23 attributes



15/16

the decisive factors that influence the catalytic behavior of solids in certain catalytic reactions.

For the different types of ANN that were tested in thisstudy, certain attributes were consistently selected as inputvariables by many ANN. If one analyzes these attributes,one can identify certain trends. Considering the attributes

related to properties of the elements, variables based onthe atomic radius (ar, #2, 3), the electron affinity (ea #8, 9,10), the normalized formation free-enthalpy of the moststable metal oxide (nffefmsmo, #14, 15), and the smallestenergy difference between the most stable metal oxideand another metal oxide (sedmsmoom, #16, 17) seemed tobe of major importance. The number of elements in a cata-lyst (nvec, #24) was also significant. When examining theattributes related to the element oxides, only two attrib-utes based on the melting point (mp, #31, 32) seemed tohave any significance. When analyzing the attributes relat-ed to element ions, the ionic radius (ir), coordination num-ber (cn, #35 38), and ionic covalent parameter (icp, #39

42) seemed to be the most important variables. The list of synthesis attributes contains the highest number of attrib-utes (eight) that were selected as input variables in all net-works. This is reasonable from a chemical point of view, assynthesis parameters refer directly to the experimentallytested solid catalyst.

All types of analysis revealed that the stability of oxideshas a strong impact on the performance of catalysts in theoxidation of propene. This is a result that chemical intu-ition would also have given. However, in the framework of this study this conclusion was discovered without addition-al interference by a chemist, and one could therefore, withsome justification, say that the methodology used in theframework of this investigation has implemented chemicalintuition on a basic level in a software program.

5 Conclusions

We have described the implementation of a methodologythat is the basis for a virtual screening of complex solidswith respect to their catalytic properties. These solids aregenerated at random from available elements via a set of different synthetic procedures. Screening in-silico then al-lows one to identify those samples which should be experi-

mentally investigated. Since the coding is set up in a man-ner that corresponds relatively closely to a synthetic proce-dure, there is a high probability that the suggested samplescan indeed be synthesized.

This approach could be very valuable, especially in reac-tions where no good lead is available as yet, since for suchreaction thousands of samples may have to be tested be-fore some activity is discovered at all. Prescreening to re-duce the number of tests to be performed is therefore al-most mandatory. Moreover, if descriptors with high pre-dictive power are discovered, one may be able to extract

information on the relevant properties that a catalystshould have for a specific reaction.

At this point it is not clear just how general descriptorvectors for catalytic reactions will be, i.e., whether a de-scriptor for propene oxidation catalysts, such as developedin this study, will also be valid for other alkenes or even

for hydrocarbon oxidation reactions. The establishment of a much broader database is necessary to verify and furtherdevelop the descriptor concepts for catalysts. It is expectedthat there will be an intimate interplay between the refine-ment of descriptors and the testing of new solids in catalyt-ic reactions. The broader the experimental database be-comes, the more discriminative will be the descriptors de-veloped on this basis, which in turn allows more focusedcatalytic testing.

In addition, the concept seems to be more broadly appli-cable. Any field in which the correlation between proper-ties of solids and performance in a given application iscomplex and development is, to a large extent, empirically

governed, may benefit from the possibility of virtualscreening as the first stage of a high-throughput program.It will be interesting to see how fast these methods willfind their way into the laboratories active in this field.

Acknowledgements

We thank the Marie Curie Fellowship Association and re-gion Rho ne-Alpes (Programme EuroDoc) for having sup-ported the student s mobility and training. In addition, wewould like to thank the Leibniz program of the DFG and

the FCI who provided funding in addition to the basicfunding by the Max-Planck-Gesellschaft and the CNRS.

References

[1] I. E. Maxwell, Nature 1998 , 394, 325.[2] B. Jandeleit, D. J. Schaefer, T. S. Powers, H. W. Turner,

W. H. Weinberg, Angew. Chem. Int. Ed. 1999 , 38, 2494.[3] W. F. Maier, Angew. Chem. Int. Ed. 1999 , 38, 1216.[4] S. Senkan, Angew. Chem. Int. Ed. 2001, 40, 312.[5] Y. Yamada, T. Kobayashi, Chem. Sens. 1999 , 15, 100.[6] J. M. Newsam, F. Schth, Biotechnol. Bioeng. 1999 , 61, 203.[7] C. Klanner, D. Farrusseng, L. Baumes, C. Mirodatos, F.

Schth, QSAR Comb. Sci. 2003 , 22, 729.[8] D. Wolf, O. V. Buyevskaya, M. Baerns, Appl. Catal., A:

Gen. 2000 , 200, 63.[9] U. Rodemerck, M. Baerns, M. Holena, D. Wolf, Appl. Surf.

Sci. 2004 , 223, 168.[10] A. Corma, J. M. Serra, A. Chica, in Principles and Methods

for Accelerated Catalyst Design and Testing (Eds.: E. G.Derouane, V. Parmon, F. Lemos, F. R. Ribeiro), Kluwer,Dordrecht, The Netherlands 2002 , p.153.

[11] J. M. Serra, A. Corma, E. Argente, S. Valero, V. Botti, Appl. Catal., A: Gen. 2003 , 254, 133.

[12] J. M. Serra, A. Corma, D. Farrusseng, L. Baumes, C. Miro-datos, C. Flego, C. Perego, Catal. Today 2003 , 82, 67.




16/16

[13] D. Farrusseng, L. Baumes, C. Mirodatos, in High-Through- put Analysis: A Tool For Combinatorial Materials Science(Eds.: R. A. Potyrailo., E. J. Amis.), Kluwer Academic/Ple-num Publishers, New York 2004, p.551.

[14] J. Cawse, Experimental Design for Combinatorial and HighThroughput Materials Development , John Wiley & Sons,Weinheim, Germany 2002.

[15] C. Klanner, D. Farrusseng, L. Baumes, M. Lengliz, C. Miro-datos, F. Schth, Angew. Chem. Int. Ed. 2004 , 43, 5347.[16] M. Schwickardi, T. Johann, W. Schmidt, F. Schth, Chem.

Mater. 2002 , 14, 3913.

[17] For example: Handbook of Chemistry and Physics , 77th Ed-ition (Eds.: D. R. Lide, H. P. R. Frederikse.) CRC Press,Boca Raton 1996 1997.

[18] K. Kira and L. Rendell, A practical approach to feature se-lection. In: Proceedings of the 9th International Conferenceon Machine Learning (Aberdeen, July 1992), D. Sleeman &P. Edwards (eds.), Morgan Kaufmann 1992, pp. 249 256

Aberdeen, Scotland.[19] C. Hoffmann, A. Wolf, F. Schth, Angew. Chem. Int. Ed.1999 , 38, 2800.

[20] a) C. T. Kresge, M. E Leonowicz, W. J Roth, J. C.Vartuli,J. S Beck, Nature 1992 , 359, 710.



QSAR Design of Discovery Libraries for Solids Based on QSAR Models 2005 QSAR and rial Science

Documents

Transcript of QSAR Design of Discovery Libraries for Solids Based on QSAR Models 2005 QSAR and rial Science