Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity...

11
Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data Soyoung Lee 1 , Keunwan Park 1 , Hee-Sung Ahn, Dongsup Kim Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, South Korea abstract article info Article history: Received 13 January 2010 Revised 29 March 2010 Accepted 7 April 2010 Available online 14 April 2010 Keywords: Human acute toxicity Cytotoxicity Prediction Molecular descriptor Variable selection Multiple linear regression In this study, we tried to assess the utility of the structural information of drugs for predicting human acute toxicity from in vitro basal cytotoxicity, and to interpret the informative quality and the pharmacokinetic meaning of each structural descriptor. For this, human acute toxicity data of 67 drugs were taken from literature with their basal cytotoxicity data, and used to develop predictive models. A series of multiple linear regression analyses were performed to construct feasible regression models by combining molecular descriptors and cytotoxicity data. We found that although the molecular descriptors alone had only moderate correlation with human acute toxicity, they were highly useful for explaining the discrepancy between in vitro cytotoxicity and human acute toxicity. Among many possible models, we selected the most explanatory models by changing the number and the type of combined molecular descriptors. The results showed that our selected models had high predictive power (R 2 : between 0.7 and 0.87). Our analysis indicated that those successful models increased the prediction accuracies by providing the information on human pharmacoki- netic parameters which are the major reason for the difference between human acute toxicity and cytotoxicity. In addition, we performed a clustering analysis on selected molecular descriptors to assess their informative qualities. The results indicated that the number of single bonds, the number of hydrogen bond donors and valence connectivity indices are closely related to linking cytotoxicity to acute toxicity, which provides insightful explanation about human toxicity beyond cytotoxicity. © 2010 Elsevier Inc. All rights reserved. Introduction Drug toxicity is one of the most important issues in every drug discovery process. It is one of the main reasons for attrition in drug development; a report by van de Waterbeemd and Gifford (2003) indicated that 21% of all failures were attributed to animal toxicity (11%) and adverse effects in man (10%). In addition, numerous drugs have had to be withdrawn from the market because of toxicity (Li, 2004). Thus, the accurate assessment of drug toxicity is one of the major challenges in drug development. Conventional toxicity evaluation methods use in vivo animal testing, in vitro cytotoxicity experiments and in silico computational prediction (Johnson and Wolfgang, 2000; Li, 2007; Valerio, 2009). Evaluation using animal testing is the most reliable method, but it is too expensive to screen drugs at the preclinical stage. Testing toxicity by using in vitro experiments (Li, 2007) is comparatively reliable and not expensive, but the prediction accuracy is not sufciently high for clinical use. These tests measure in vitro toxicity value, and thus the accurate prediction of in vivo toxicity is difcult. Computational prediction that uses the chemical properties of drugs and statistical techniques (Dearden, 2003; Valerio, 2009) is less expensive, but of lower accuracy and should not be used alone. Although there have been numerous attempts to develop toxicity screening methods at a small expense by using in vitro experiments or in silico computations, there is no sufciently accurate method for the clinical use. Recently, there has been a pioneering work to improve evaluation accuracies of these two techniques by combining both data for in vitro experimental endpoints and the structural information of drugs (Lessigiarska et al., 2006). Lessigiarska et al. tried to predict human and rodent lethal doses of 26 chemicals as alternatives to animal testing by using animal in vivo testing data, in vitro experimental data and structural descriptors. This attempt provided new insight into toxicity prediction, and it might be improved by other computational tools (Cronin et al., 2001; Lin et al., 2002; Basak et al., 2003; Muskal et al., 2003; Oberg, 2004; Roy et al., 2005; Swamidass et al., 2005; Yuan et al., 2007a,b). Our previous research indicated that the structural information of drugs improved the prediction of human in vivo hepatic clearance from in vitro hepatocyte experimental data (Lee and Kim, 2007). Structural information has been expressed by numerical values that characterize properties of molecules such as molecular weight, lipophilicity, and the number of atoms, which is dened as molecular descriptorsin QSAR (Quantitative Structure-Activity Relationship) Toxicology and Applied Pharmacology 246 (2010) 3848 Corresponding author. Fax: + 82 42 350 4310. E-mail address: [email protected] (D. Kim). 1 The authors wish it to be known that the rst two authors should be regarded as joint rst authors. 0041-008X/$ see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.taap.2010.04.004 Contents lists available at ScienceDirect Toxicology and Applied Pharmacology journal homepage: www.elsevier.com/locate/ytaap

Transcript of Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity...

Page 1: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Toxicology and Applied Pharmacology 246 (2010) 38–48

Contents lists available at ScienceDirect

Toxicology and Applied Pharmacology

j ourna l homepage: www.e lsev ie r.com/ locate /ytaap

Importance of structural information in predicting human acute toxicity from in vitrocytotoxicity data

Soyoung Lee 1, Keunwan Park 1, Hee-Sung Ahn, Dongsup Kim ⁎

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, South Korea

⁎ Corresponding author. Fax: +82 42 350 4310.E-mail address: [email protected] (D. Kim).

1 The authors wish it to be known that the first twojoint first authors.

0041-008X/$ – see front matter © 2010 Elsevier Inc. Adoi:10.1016/j.taap.2010.04.004

a b s t r a c t

a r t i c l e i n f o

Article history:Received 13 January 2010Revised 29 March 2010Accepted 7 April 2010Available online 14 April 2010

Keywords:Human acute toxicityCytotoxicityPredictionMolecular descriptorVariable selectionMultiple linear regression

In this study, we tried to assess the utility of the structural information of drugs for predicting human acutetoxicity from in vitro basal cytotoxicity, and to interpret the informative quality and the pharmacokineticmeaning of each structural descriptor. For this, human acute toxicity data of 67 drugs were taken fromliterature with their basal cytotoxicity data, and used to develop predictive models. A series of multiple linearregression analyses were performed to construct feasible regression models by combining moleculardescriptors and cytotoxicity data. We found that although the molecular descriptors alone had only moderatecorrelation with human acute toxicity, theywere highly useful for explaining the discrepancy between in vitrocytotoxicity and human acute toxicity. Among many possible models, we selected the most explanatorymodels by changing the number and the type of combined molecular descriptors. The results showed that ourselected models had high predictive power (R2: between 0.7 and 0.87). Our analysis indicated that thosesuccessful models increased the prediction accuracies by providing the information on human pharmacoki-netic parameterswhich are themajor reason for the difference between human acute toxicity and cytotoxicity.In addition, we performed a clustering analysis on selected molecular descriptors to assess their informativequalities. The results indicated that the number of single bonds, the number of hydrogen bond donors andvalence connectivity indices are closely related to linking cytotoxicity to acute toxicity, which providesinsightful explanation about human toxicity beyond cytotoxicity.

authors should be regarded as

ll rights reserved.

© 2010 Elsevier Inc. All rights reserved.

Introduction

Drug toxicity is one of the most important issues in every drugdiscovery process. It is one of the main reasons for attrition in drugdevelopment; a report by van de Waterbeemd and Gifford (2003)indicated that 21% of all failures were attributed to animal toxicity(11%) and adverse effects in man (10%). In addition, numerous drugshave had to be withdrawn from the market because of toxicity (Li,2004). Thus, the accurate assessment of drug toxicity is one of themajor challenges in drug development.

Conventional toxicity evaluationmethods use in vivo animal testing,in vitro cytotoxicity experiments and in silico computational prediction(Johnson andWolfgang, 2000; Li, 2007; Valerio, 2009). Evaluation usinganimal testing is the most reliable method, but it is too expensive toscreen drugs at the preclinical stage. Testing toxicity by using in vitroexperiments (Li, 2007) is comparatively reliable and not expensive, butthe prediction accuracy is not sufficiently high for clinical use. Thesetests measure in vitro toxicity value, and thus the accurate prediction of

in vivo toxicity is difficult. Computational prediction that uses thechemical properties of drugs and statistical techniques (Dearden, 2003;Valerio, 2009) is less expensive, but of lower accuracy and should not beused alone. Although there have been numerous attempts to developtoxicity screening methods at a small expense by using in vitroexperiments or in silico computations, there is no sufficiently accuratemethod for the clinical use. Recently, there has been a pioneering workto improve evaluation accuracies of these two techniques by combiningboth data for in vitro experimental endpoints and the structuralinformation of drugs (Lessigiarska et al., 2006). Lessigiarska et al. triedto predict human and rodent lethal doses of 26 chemicals as alternativesto animal testing by using animal in vivo testing data, in vitroexperimental data and structural descriptors. This attempt providednew insight into toxicity prediction, and it might be improved by othercomputational tools (Cronin et al., 2001; Lin et al., 2002; Basak et al.,2003; Muskal et al., 2003; Oberg, 2004; Roy et al., 2005; Swamidasset al., 2005; Yuan et al., 2007a,b).

Our previous research indicated that the structural information ofdrugs improved the prediction of human in vivo hepatic clearancefrom in vitro hepatocyte experimental data (Lee and Kim, 2007).Structural information has been expressed by numerical values thatcharacterize properties of molecules such as molecular weight,lipophilicity, and the number of atoms, which is defined as “moleculardescriptors” in QSAR (Quantitative Structure-Activity Relationship)

Page 2: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Table 1The log IC50 values (logVitro) and the log of the 50% lethal concentration (logVivo)values of 67 chemicals.

ID Chemical name CAS number logVitroa logVivoa

1 2,4-Dichlorophenoxyacetic acid 94-75-7 −2.87 −2.432 5-Fluorouracil 51-21-8 −4.30 −3.693 Acetaminophen 103-90-2 −3.48 −2.664 Acetonitrile 1975-05-08 −0.68 −2.825 Acetylsalicylic acid 50-78-2 −2.36 −2.206 Amiodarone hydrochloride 19774-82-4 −4.58 −4.957 Amitryptiline hydrochloride 549-18-8 −4.64 −5.348 d-Amphetamine sulfate 51-63-8 −3.26 −3.889 Arsenic trioxide 1327-53-3 −4.87 −5.2310 Atropine sulfate monohydrate 5908-99-6 −3.91 −5.7011 Cadmium (II) chloride 10108-64-2 −5.65 −6.0612 Caffeine 1958-08-02 −3.08 −3.2913 Carbamazepine 298-46-4 −3.34 −3.7914 Chloral hydrate 302-17-0 −2.95 −3.1815 Chloramphenicol 56-75-7 −3.14 −3.3516 Chlormethiazole 533-45-9 −2.85 −3.4917 Chloroquine diphosphate 50-63-5 −4.64 −4.8818 Chlorpromazine hydrochloride 69-09-0 −4.82 −6.4619 cis-Platinum 15663-27-1 −5.25 −4.6820 Codeine 76-57-3 −3.28 −5.1921 Colchicine 64-86-8 −6.92 −7.1922 Cyclosporine A 59865-13-3 −4.38 −6.2223 Diazepam 439-14-5 −2.78 −4.4924 Dichlorvos 62-73-7 −3.77 −3.7025 Digoxin 20830-75-5 −3.18 −7.3826 Dimethylformamide 1968-12-02 −1.14 −2.2327 Diquat dibromide 85-00-7 −4.38 −3.5528 Disopyramide 3737-09-05 −2.57 −4.0629 Ethanol 64-17-5 −0.83 −0.8030 Ethylene glycol 107-21-1 −0.38 −1.5031 Glufosinate ammonium 77182-82-2 −2.12 −1.9932 Glutethimide 77-21-4 −3.07 −3.4733 Hexachlorophene 70-30-4 −4.96 −3.1334 Isoniazid 54-85-3 −1.83 −3.3535 Isopropyl alcohol 67-63-0 −1.20 −0.9436 Lindane 58-89-9 −3.27 −5.9837 Lithium sulfate 10377-48-7 −1.81 −2.2538 Malathion 121-75-5 −2.88 −5.7339 Maprotiline 10262-69-8 −4.70 −5.5340 Meprobamate 57-53-4 −2.58 −3.3741 Mercury (II) chloride 7487-94-7 −4.80 −4.7142 Methadone hydrochloride 1095-90-5 −3.97 −6.0143 Nicotine 1954-11-05 −2.49 −5.1044 Orphenadrine hydrochloride 341-69-5 −3.95 −4.6445 Paraquat dichloride 1910-42-5 −4.07 −5.0246 Parathion 56-38-2 −3.67 −5.6547 Pentachlorophenol 87-86-5 −3.72 −3.0848 Phenobarbital 1950-06-06 −2.57 −3.4449 Phenol 108-95-2 −3.12 −3.4150 Potassium chloride 7447-40-7 −1.05 −1.9851 Potassium cyanide 151-50-8 −3.00 −3.8952 Procainamide hydrochloride 614-39-1 −2.79 −3.2453 Propranolol hydrochloride 318-98-9 −4.27 −4.9554 Quinidine sulfate dehydrate 6591-63-5 −4.26 −4.5255 Rifampicine 13292-46-1 −3.99 −3.8156 Sodium bicarbonate 144-55-8 −1.03 −0.9557 Sodium chloride 7647-14-5 −1.14 −1.2758 Sodium fluoride 7681-49-4 −2.72 −3.2459 Sodium selenate 13410-01-0 −3.75 −4.5960 Sodium valproate 1069-66-5 −2.00 −2.2061 Strychnine 57-24-9 −3.19 −5.1262 Thallium sulfate 7446-18-6 −4.80 −5.0963 Theophylline 58-55-9 −3.06 −3.2964 Thioridazine hydrochloride 130-61-0 −4.19 −4.7665 Verapamil hydrochloride 152-11-4 −4.14 −5.2166 Warfarin 81-81-2 −3.09 −3.8167 Xylene 1330-20-7 −2.17 −3.90

a The unit of logVitro and logVivo is expressed in the log of moles per liter.

39S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

studies. Based on those previous results, we tried to find significantmolecular descriptors that could improve the evaluation accuracy ofcytotoxicity experiments, and then interpret the informative meaningof the descriptors. The current study is meaningful because of (1) thelarger size of the dataset (67 chemicals) than the previous work (26chemicals), (2) the improvement of prediction performance, (3) theassessment of the informative qualities of each molecular descriptor,and (4) the explanation of the pharmacokinetic meaning of chemicalproperties to the difference between human acute toxicity andcytotoxicity.

Materials and methods

Data collection. The dataset consisted of 67 compounds obtainedfrom the literature (Sjostrom et al., 2008) by the ACuteTox project(Clemedson, 2008; AcuteTox, 2009), which is an integrated projectintended to develop a simple and robust in vitro testing strategy forprediction of human acute systemic toxicity started in January 2005.In this project, Sjostrom et al. collected human acute toxicity data(LC50 values) from medical and forensic reports such as papers,poison information centers, and on-line databases (e.g. Poisindex,Thomson Micromedex, HSDB, and ChemIDPlus). They obtained invitro cytotoxicity data (IC50 values) using in vitro 3 T3 NRU assaywhich had been employed in the ICCVAM (US Interagency Coordi-nating Committee on the Validation of Alternative Methods)/ECVAM(European Centre for the Validation of Alternative Methods)validation study (see website http://iccvam.niehs.nih.gov/meth-ods/acutetox/acutetox.htm). The original dataset in the literatureconsisted of 97 reference chemicals, including drugs, industrialchemicals and biocides (Sjostrom et al., 2008). Chemicals that hadno IC50 value or human LC50 value were removed, and the final datasetis listed in Table 1.

Molecular descriptor calculation and cleaning. To properly representchemicals, the structural descriptors were calculated. They werecategorized by their intrinsic properties into five categories: constitu-tional, physicochemical, electrostatic, topological and geometricaldescriptors. The constitutional descriptors are simple descriptors thatreflect only the molecular composition of the compound, andelectrostatic descriptors quantify information about the electrostaticproperties of a molecule. Geometrical descriptors indicate geometricalinformation from the molecular structure, and the topological descrip-torswere calculated fromgraph-theoretic information about the atomicconnectivity indices. The physicochemical descriptors describe physi-cochemical properties of a molecule. For all descriptor calculations,preADME software (Lee et al., 2004) was used. The numbers of thecalculated geometrical, physicochemical, topological, electrostatic andconstitutional descriptors were 22, 131, 688, 79 and 150, respectively.

Before performing the analysis, the molecular descriptors werefiltered using the following criteria: (1) the descriptors that had atleast one missing value were removed; (2) the descriptors thatincludedmany identical values (N80%)were also removed. As a result,543 molecular descriptors remained (geometrical: 22, physicochem-ical: 33, topological: 445, electrostatic: 0, constitutional: 40).

Multiple linear regression models. The predictors for human acutetoxicity were modeled by combining cytotoxicity data and moleculardescriptors. Multiple linear regression analysis was used for searchingappropriate and explanatory models. Various predictors were con-structed by changing the number and type of molecular descriptors.

1. Model using the in vitro–in vivo correlationThis model uses only cytotoxicity data to predict human acutetoxicity as follows:

logVivo = b + a × logVitro ð1Þ

where logVivo and logVitro are the log values of LC50 and IC50,respectively, and a and b the coefficients of this linear regressionequation. This equation is based on the linear relationshipbetween the log of acute toxicity and the log of cytotoxicity.

Page 3: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

40 S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

2. Model using cytotoxicity and the best explanatory moleculardescriptor(s).Among the descriptors in each category (geometrical, physico-chemical, topological, and constitutional descriptors; all electro-static descriptors were excluded by the data cleaning process), oneor two molecular descriptor(s) that showed the best statistical fitwith human acute toxicity when the cytotoxicity value was used asan additional explanatory variable were selected. The criterion ofstatistical fitness was the adjusted R2, which is a modification of themultiple R2 that is adjusted for the number of explanatory terms in amodel. The adjusted R2 is useful for comparing models havingdifferent number of explanatory variables as in the present case. Inaddition, the models whose maximum variance inflation factor(VIF) value was greater than 5 were not considered even thoughthey had a high adjusted R2 (the VIF was used to detect the severityof themulticolinearity, and a common rule of thumb is that if theVIFof a variable is larger than 5, then themulticolinearity is high). Eachmodel for adding one (Eq. (2)) and two (Eq. (3)) descriptors is asfollows:

logVivo = b + a1 × logVitro þ a2 × MD1 ð2Þ

logVivo = b + a1 × logVitro þ a2 × MD1þ a3 × MD2 ð3Þ

whereMDi is the i-th selectedmolecular descriptor, and b, a1, a2 arethe regression coefficients. Eq. (2) selects the best informativemolecular descriptor with the log of cytotoxicity among eachcategory; however Eq. (3) considers all possible combinations oftwo descriptors. For four categories ofmolecular descriptors, 3 (top3 accurate models)×4(categories)=12 predictors are modeled.

3. Model using cytotoxicity and various molecular descriptors: thebest modelThis step was designed to obtain the most accurate model. Whilewe tried to assess the utility and meaning of molecular descriptorsto human acute toxicity by selecting one or two descriptors, thepurpose of this model was to build the model with the bestperformance.The regression models using more than two descriptors wereconstructed from all categories of molecular descriptors by theforward-selection technique based on the adjusted R2. That is, theselection procedure that chooses the structural descriptor of thehighest adjusted R2 was repeated until there was no remainingdescriptor that increased the adjusted R2 value of the regressionmodel. The final equation is as follows:

logVivo = b + a1 × logVitroþ ∑S

i=1ðai + 1 × MDiÞ ð4Þ

where s is the number of selected molecular descriptors; eightdescriptors were selected here.

4. Model using only molecular descriptor(s).Thismodel used only structural descriptors without logVitro valuesto predict logVivo values in order to determine whether they weredirectly correlated with human acute toxicity or just helped toreduce the difference between acute toxicity and cytotoxicity. Oneor two molecular descriptors were included in these models in thesame manner, as follows:

logVivo = b + a1 × MD1 ð5Þ

logVivo = b + a1 × MD1þ a2 × MD2 ð6Þ

5. Model using cytotoxicity andpredicted pharmacokinetic parameters.This model used several predicted values of important pharmacoki-netic parameters with cytotoxicity values to compare the ability ofmolecular descriptors to correct for the difference between human

acute toxicity and cytotoxicity with predicted pharmacokineticparameters. The predicted parameters were Caco2 cell permeability(Yazdanian et al., 1998), MDCK cell permeability (Yazdanian et al.,1998), human intestinal absorption (HIA) (Yee, 1997), and skinpermeability (Lee et al., 2004) for detecting the absorption ofchemical, and blood-brain barrier (BBB) penetration (Ajay et al.,1999) and plasma protein binding (Saiakhov et al., 2000) fordetecting its distribution into human body. All parameters werepredictedbyusingpreADMEsoftware.One (Eq. (7)), two(Eq. (8)), orall parameters (Eq. (9)) were used as explanatory factors withcytotoxicity as follows:

logVivo = b + a1 × logVitro + a2 × PK1 ð7Þ

logVivo = b + a1 × logVitro + a2 × PK1 + a3 × PK2 ð8Þ

logVivo = b + a1 × logVitro + ∑6

i=1ai + 1 × PKi ð9Þ

where PK1 to PK6 were predicted pharmacokinetic parameters.

Statistical validation. The statistical significance of regression equa-tions was assessed by using various criteria: multiple R2, adjusted R2,mean squared error (MSE), Akaike information criterion (AIC),correlation coefficient and variance inflation factor (VIF). In addition,the models were validated by leave-one-out (LOO) cross-validationand 10-fold cross validation. The LOO removes one sample, trains theregression model with the remaining ones, and then evaluates theremoved sample by using the training model. This procedure isrepeated based on the number of samples. The performance of the LOOprocedure is estimated by MSE, and the correlation coefficientbetween the predicted and real values. LOO is a special case of k-foldcross validationwhere k equals the number of subsets and known to beless biased andwidely used for a small dataset. However, the accuracyestimate of the procedure tends to have higher variance than 10-foldcross validation since it is determined by only one sample. Thus, a 10-fold cross validation procedurewas also adapted to confirm themodelsignificance and generality. That means that the 67 compounds fromthe AcuteToxicity were randomly shuffled and divided into 10 sets.The nine sets among the tenwere regarded as the training data and theresulting model was applied to the remaining set. This procedure wasrepeated 10 times (the number of divided sets), and the averagedcorrelation coefficient and MSE of each procedure was used toestimate the model quality.

Outlier detection. An outlier is a value in a set of data that does notfit with the rest of the data. Thus, for building a robust regressionmodel, outliers are detected by traditional statistics (studentizedresidual and DFFIT) and regression models are constructed afterremoving outlier samples. The studentized residual is defined as thefollowing equation:

ti =eiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

MSEið1−hiiÞp ð10Þ

where ei is the residual for the i-th sample and MSEi is the meansquared error for the regression model when the i-th sample is leftout. The i-th diagonal element of the hat matrix H, hii, is known to bethe leverage of yi (i-th response variable) on ŷi (estimate for the yi).The observation with the largest hii can be said to have the mostextreme variable, whereas the smallest one can be said to be the mosttypical. Accordingly, studentized residual ti considers the residual ofi-th sample with its influential power. Empirically, a good rule ofthumb for estimating extreme observation is the criteria, tiN2.

Page 4: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Table 2The equation and the prediction accuracy of each model.

Category ModelNo.

Regression equation Sample size(outlier ID)

Self test LOO

r R2 Adj_R2 MSE AIC MaxVIF r Q2 MSE

In vitro 0 −1.04685+0.90709logVitro 66(33) 0.775 0.600 0.594 0.895 185.96 N/A 0.762 0.581 0.940Constitutional 1 −0.089 – 0.045No_CaC+0.848logVitro 63(25,33,36,38) 0.851 0.724 0.715 0.563 150.55 1.21 0.837 0.701 0.612

2 −1.06 – 0.071No_CsC+0.79logVitro 66(55) 0.836 0.699 0.689 0.677 169.55 1.05 0.809 0.654 0.7833 −0.947+0.236Faction_of_aromatic_atoms+0.923logVitro 64(25,33,38) 0.825 0.681 0.671 0.661 163.14 1.07 0.807 0.651 0.7264 −1.08 – 0.0359No_single_bonds+0.314No_H_bond_donors + 0.746logVitro 66(22) 0.849 0.721 0.707 0.606 164.25 2.14 0.815 0.664 0.7325 −1.163 – 0.942No_CsC+0.222No_H_bond_donors+0.807logVitro 67 0.847 0.717 0.703 0.646 168.78 1.72 0.808 0.653 0.7786 −1.112 – 0.912No_CsC + 1.775No_double_bonds+0.809logVitro 67 0.837 0.701 0.687 0.661 172.36 1.90 0.817 0.667 0.739

Geometrical 7 −1.0 – 0.006 2D_VSA_hydrophobic_unsat+0.78logVitro 65(25,36) 0.820 0.672 0.661 0.668 166.24 1.20 0.801 0.642 0.7308 −1.09 – 0.003 2D_VSA_hydrophobic+0.71logVitro 66(55) 0.815 0.665 0.654 0.752 176.51 1.20 0.787 0.620 0.8629 −0.929 – 1.2Fraction_of_2D_VSA_hydrophobic_unsat+0.837logVitro 65(25,36) 0.809 0.654 0.643 0.705 169.70 1.06 0.791 0.626 0.76410 −1.157 – 0.0054 2D_VSA_hydrophobic_sat+0.01067 2D_VSA_Hbond_donor+0.704logVitro 66(22) 0.822 0.676 0.660 0.703 174.05 1.53 0.779 0.607 0.85811 −1.396 – 0.0025 2D_VSA_hydrophobic + 2.456Fraction_of_2D_VSA_Hbond_donor+0.696logVitro 67 0.822 0.675 0.659 0.720 178.12 1.22 0.793 0.628 0.82812 −1.418 – 0.0027 2D_VSA_hydrophobic_sat + 2.58Fraction_of_2D_VSA_Hbond_donor+0.724logVitro 67 0.820 0.673 0.657 0.724 178.48 1.15 0.782 0.612 0.873

Physico-chemical 13 −0.91 – 0.092AlogP98_024_C+0.0818logVitro 64(25,36,38) 0.840 0.706 0.697 0.592 156.02 1.08 0.825 0.681 0.64414 −1.075 – 0.2426AlogP98_008_C+0.817logVitro 67 0.831 0.691 0.681 0.745 172.73 1.01 0.807 0.652 0.77415 −1.1135 – 0.2489AlogP98_002_C+0.7887logVitro 67 0.821 0.674 0.663 0.722 176.33 1.04 0.799 0.639 0.80016 −1.161 – 0.403AlogP98_008_C - 0.0349Solvation_free_energy+0.8773logVitro 67 0.858 0.736 0.723 0.603 164.18 2.38 0.842 0.709 0.64517 −1.286 – 0.328AlogP98_008_C+0.198AlogP98_050_H+0.806logVitro 67 0.857 0.734 0.721 0.608 164.72 1.38 0.828 0.686 0.69818 −1.164 – 0.295AlogP98_008_C+0.00262SK_MP+0.874logVitro 67 0.847 0.717 0.704 0.645 168.69 1.35 0.818 0.669 0.737

Topological 19 −1.033 – 42.02Bound_charge_index_06+0.752logVitro 64(33,36,55) 0.865 0.749 0.741 0.563 152.85 1.17 0.847 0.718 0.63520 −1.087 – 0.265VChi_04_path+0.712logVitro 65(54,55) 0.863 0.744 0.735 0.583 157.43 1.13 0.848 0.719 0.64121 −1.123 – 0.34VChi_05_path+0.721logVitro 65(54,55) 0.859 0.738 0.729 0.596 158.85 1.12 0.845 0.715 0.65122 −1.174 – 0.991Valence_charge_index_03+0.74Delta_chi_04_path_cluster+0.758logVitro 67 0.881 0.776 0.766 0.494 152.93 6.01 0.860 0.740 0.57923 −1.066 – 0.53Valence_charge_index_02 + 0.525Delta_chi_04_path_cluster+0.751logVitro 67 0.876 0.767 0.756 0.515 155.73 3.97 0.855 0.731 0.59724 −1.0556 – 0.539Valence_charge_index_02 + 0.371Delta_chi_03_path+0.787logVitro 67 0.872 0.760 0.748 0.531 157.73 4.42 0.855 0.732 0.595

All 25 −1.002 –1.019VChi_04_path_cluster+0.3888Solvation_chi_04_path_cluster + 0.0062D_VSA_Hbond_donor + 0.158AlogP98_026C - 0.0027ATS_Moreau_Bruto_09_mass_average+0.0013 2D_VSA_polar+0.191ATS_Geary_09_polarizability - 0.319ATS_Geary_10_E_state+0.839logVitro

65(27,23) 0.934 0.873 0.852 0.279 122.55 16.3 0.910 0.820 0.412

*The lowest p-value testing the significance of variables (t-statistic) was 0.085 of model #9, so the others had more significant p-values (e.g., 2.42e-10, 1.03e-7 for two variables in model #22).r: Pearson correlation coefficient.R2: Square of r.Adj_R2: Adjusted R2 : 1−ð1−R2Þ n−1

n−p−1 where p is the total number of variables and n is sample size.MSE: Mean squared error.AIC: Akaike information criterion.MaxVIF: Maximum of variance inflation factor.

41S.Lee

etal./

Toxicologyand

Applied

Pharmacology

246(2010)

38–48

Page 5: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

42 S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

Moreover, the DFFITS is defined as the following equation:

DFFITSi =hii

1−hii

� �1=2ti ð11Þ

The high hii represents high leverage and large ti indicates thepossibility for an extreme observation (outlier). Thus, DFFITS com-bines studentized residuals and leverage, which simultaneouslymeasures outliers and influential points. One rule of thumb for DFFITSis |DFFITSi|N2

ffiffiffiffiffiffiffiffiffiffip= n

p, where p is the sumof diagonal elements of the hat

matrix (sum of hii values for each i) and n is the number of samples.In summary, the two criteria, tiN2 and |DFFITSi|N2

ffiffiffiffiffiffiffiffiffiffip = n

p, were

used for detecting outliers in the present study, which selected theoutliers that were extremely located compared to the others andsimultaneously likely to be an influential point.

Hierarchical clustering of molecular descriptors according to theiractivity profile. The molecular descriptors selected by higherexplanatory power were clustered hierarchically to reveal theirinter-relationship. For the clustering, the dissimilarity betweendescriptor pairs was defined as the following relationship:

distði; jÞ = 1−absðrijÞ ð12Þ

where rij is a correlation coefficient between molecular descriptorsthat are represented by a vector composed of corresponding values foreach compound. For the joining and defining clusters, the completelinkage method that determines the maximum distance between setsof observations was used. The hierarchical tree was generated by the‘hclust’ function of R software (http://www.r-project.org R).

Table 3The definitions of selected molecular descriptors.

Category Names of molecular descriptors De

Constitutional No_CaC NuNo_CsC NuFaction_of_aromatic_atoms FraNo_single_bonds NuNo_double_bonds NuNo_H_bond_donors Nu

Geometrical 2D_VSA_hydrophobic 2D2D_VSA_hydrophobic_unsat 2D2D_VSA_hydrophobic_sat 2D2D_VSA_Hbond_donor 2D2D_VSA_polar 2DFraction_of_2D_VSA_Hbond_donor FraFraction_of_2D_VSA_hydrophobic_unsat Fra

Physicochemical AlogP98_002_Ca C iAlogP98_008_Ca C iAlogP98_024_Ca C iAlogP98_026_Ca C iAlogP98_050_Ha HSolvation_free_energy SoSK_MP Me

Topological Bound_charge_index_06 GaATS_Geary_09_polarizability GeATS_Geary_10_E_state GeATS_Moreau_Bruto_09_mass_average AvVChi_04_path KieVChi_04_path_cluster KieVChi_05_path KieDelta_Chi_03_path DifDelta_Chi_04_path_cluster DifValence_charge_index_02 GaValence_charge_index_03 Ga

a AlogP98 is octanol-water partition coefficient calculated by Ghose atom additive metho

Results

Prediction models using selected molecular descriptors

The prediction models with the selected molecular descriptors(see Supplement 1 for the values) are listed in Table 2 for eachcategory of molecular descriptors, and the definitions of each one areexplained in Table 3. Each model was made without outliers, and theprediction accuracies were tested by the trained multiple linearregression model, leave-one-out validation and 10-fold cross valida-tion (Fig. 1). The results showed that the accuracies using leave-one-out (Table 2) or 10-fold cross validation (Supplement 2) were slightlylower than those using the trained multiple linear regression models,but the differences were not significant.

It is clear that the model that selected variables among allcategories of molecular descriptors (model #25, see the Table 2 fornumbering) showed the best performance (adjusted R2=0.852),but the other models that used molecular descriptor(s) in a specificcategory showed reasonably accurate results (adjusted R2 valueswere around 0.7). The informative power of each category wasordered by topological, physicochemical, constitutional and geo-metrical descriptors (the averages of the adjusted R2 values were0.746, 0.698, 0.695, and 0.656, respectively). Other performancemeasures including the multiple R2, MSE and AIC showed consistentresults.

The prediction accuracies of models using the best one informativemolecular descriptor were lower than those using two descriptors.The averages of the adjusted R2 values were 0.692 and 0.699 formodels when using one and two constitutional descriptors, respec-tively, and those values were 0.653 and 0.659 using geometricaldescriptors. However, the differences of the average of adjusted R2

values were increased from 0.680 to 0.716 for physicochemicaldescriptors and from 0.735 to 0.757 for topological descriptors using

finitions of molecular descriptors

mber of aromatic bonds between C and Cmber of single bonds between C and Cction of aromatic atomsmber of single bondsmber of double bondsmber of hydrogen bond donorsvan der Waals partial hydrophobic surface areavan der Waals partial hydrophobic surface area of hydrophobic saturated groupsvan der Waals partial hydrophobic surface area of hydrophobic unsaturated groupsvan der Waals partial surface area of hbond donorsvan der Waals polar surface areaction of 2D Van der Waals Hbond donor surface areaction of 2D Van der Waals hydrophobic unsaturated surface arean CH2R2 (R represents any group linked through carbon)n CHR2X (X represents any heteroatom (O, N, S, P, Se and halogens))n R–CH-Rn R–CX-R (X represents any heteroatom (O, N, S, P, Se and halogens))attached to heteroatomlvation free energylting pointlvez bound charge index (J) of order 6ary autocorrelation function for polarizability of order 9ary autocorrelation function for E state of order 10erage of Moreau Bruto autocorrelation function for mass of order 9r & Hall valence connectivity index of order 4(path)r & Hall valence connectivity index of order 4(path/cluster)r & Hall valence connectivity index of order 5(path)ference between chi and VChi of order3(path)ference between chi and VChi of order4(path/cluster)lvez valence charge index (Gv) of order 2lvez valence charge index (Gv) of order 3

d.

Page 6: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Fig. 1. Scatter plot of logVivo values versus predicted values fromdifferentmodels: (a)model #0 using logVitro only, (b)model #19 (the bestmodel using one descriptor), (c) model #22(the best model using two descriptors), and (d) model #25 (the model constructed by the forward selection).

43S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

one and two molecular descriptors, respectively. The accuracyimprovement using physicochemical and topological descriptors(e.g., models #16 and #22) seemed to be slightly higher than theothers (e.g., models #4 and #10), and they also had no outliers,indicating that the former is a better model. On the other hand, themost accurate model by exhaustively searching two moleculardescriptor combinations regardless of the descriptor category wasthe same to the model using two topological descriptors (model #22).This result means that the selected topological descriptors have moreexplanatory power than the other categories, and that there is littlecombinational effect between different descriptor categories forpredicting acute toxicity.

The model using only molecular descriptors without cytotoxicitymight provide the explanatory power of only structural descriptors.Most descriptors selected in these models were different from thosein models using cytotoxicity data together. Several topologicaldescriptors and molecular weight were selected for MD1 in Eqs. (5)and (6), and each showed comparable correlations with human acutetoxicity (Supplement 3). However, the assistant descriptors, MD2 inEq. (6), showed weak correlations with acute toxicity values, whichindicates that they might provide complementary information toMD1. Largely, the models using only one or two structural descriptorsshowed lower prediction accuracy than those using cytotoxicity datatogether. Thus, it is clear that cytotoxicity provided much more infor-mation on human acute toxicity than did the structural descriptors,but the structural descriptors might provide complementary infor-mation to the cytotoxicity data instead of sufficient information onhuman acute toxicity. This result is consistent with the previousresults (Lessigiarska et al., 2006), which suggested that cytotoxicitydata seemed to be the best surrogate of in vivo toxicity. On the otherhand, results for models using pharmacokinetic parameters withcytotoxicity were not sufficiently accurate. (Supplement 4), whichwill be detailed in Discussion.

Outliers of each regression model

The outliers for each predictionmodel are listed in Table 2, and thestructures of all of them are shown in Fig. 2. It turned out that digoxin,lindane, and malathion were outliers due to their logVitro values,because themolecular descriptors of these three drugswere zeroes, sotheir regressionmodels were determined by only logVitro values withno molecular descriptor information. The other three drugs, cyclos-porin A, hexachlorophene, and rifampicine, were outliers because ofthe extreme values in specific molecular descriptors (Fig. 2).

The outlier for the model using No_CaC, hexachlorophene, had asignificantly higher value (No_CaC=12) than the average (5.0). TheNo_single_bonds and No_H_bond_donors of cyclosporin A were 184and 5, but the averages were 26.4 and 1.4, respectively. The2D_VSA_hydrophobic_sat and 2D_VSA_Hbond_donor of cyclosporinA were 1035 and 53.12, but the averages were 144.4 and 20.1,respectively. The Bound_charge_index_06 values of hexachloropheneand rifampicinewere 0.0328 and 0.0495, respectively, and the averagewas 0.0126. Box plots in Fig. 2 display the position of the values of eachoutlier in the whole distribution of each outlier-related moleculardescriptor. For example, cyclosporin A was an outlier due to theextreme value of molecular descriptors C1-C4 (Fig. 2, C1:No_single_bonds, C2:No_H_bond_donors, C3:2D_VSA_hydrophobic_sat, C4:2D_VSA_Hbond_donor). Here, capital “C” abbreviates “cyclosporin”.Similarly, hexachlorophene and rifampicine were outliers of modelsusing molecular descriptors H1 (No_CaC), H2 (Bound_charge_index_06) and R1 (Bound_charge_index_06) due to their extreme values.Here, capital ‘H’ and ‘R’ abbreviate ‘hexachlorophen’ and ‘rifampicine’.In most box plots, the values of the outliers were over the valid rangeof distribution (within the box: from the first quartile to the thirdquartile), and moreover, several values were the maximum of thedistribution. Based on these observations, the extremely highdescriptor values of these three drugs seemed to interrupt

Page 7: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Fig. 2. The structures of all outliers in every prediction models. The box plots on the right side of the structures represent the distributions of outlier-oriented molecular descriptors. Thecapital C, H, and R below the box plots indicates Cyclosporin A, hexachlorophene, and rifampicine, respectively. C1–C4 are No_single_bonds, No_H_bond_donors, 2D_VSA_hydropho-bic_sat, and2D_VSA_Hbond_donor, respectively.H1,H2, andR1 areNo_CaC,Bound_charge_index_06 (Hexachlorophene), andBound_charge_index_06 (Rifampicine), respectively. Smallcircles are outlier values in each box plots, and the big large circles indicate the position of the values of each drug at in the distribution of each molecular descriptor.

Fig. 3. The hierarchical clustering tree of molecular descriptors.

44 S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

constructing well-fitted regression models. (The descriptors whoseoutliers had the maximum or the minimum value are listed inSupplement 5.) Therefore, one should be cautious when the proposedmodels are applied to unknowns that have similar properties as theoutliers (of corresponding model).

Clustering analysis on selected descriptors

Clustering molecular descriptors has a beneficial effect on under-standing descriptors that are difficult to interpret. Several moleculardescriptors from different categories can have similar chemicalproperties. For example, the number of single and double bondswould increase with the size of a molecule, and the logP value iscorrelated with hydrophobicity. The correlations between all selectedmolecular descriptors were calculated to find the similarity betweenthem. In particular, since the topological descriptors may representdifferent information that depends on different endpoints or com-pound classes (Randic, 2001), the more understandable descriptorssuch as a molecular weight can be used to help interpret informationencoded in topological descriptors.

Largely, there are four groups when the descriptors are hierarchi-cally clustered (Fig. 3). The first group could be represented by single-bonded carbons, which were closely related to the hydrophobicity oflocal environments. The AlogP98_002_C and AlogP98_008_C werecorrelated with the number of single bonds (No_single_bonds) or thenumber of single bonds between carbons (No_CsC) (Fig. 3). Inaddition, the Kier and Hall valence connectivity indices (VChi_04_path, VChi_05_path) and the Galvez valence charge index (Valence_charge_index) of higher order were also strongly correlated with theabove descriptors. Similarly, the second group was represented bydelta connectivity indices (Delta_chi_04_path_cluster, Delta_chi_03_path, derivatives of valence connectivity indices) and the van der

Page 8: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

45S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

Waals partial hydrophobic surface area (2D_VSA_hydrophobic). Thisgroup was more closely related to the first group than the others. Thedelta connectivity indices of high order seemed to be correlated withglobal hydrophobic indices or solvation free energy. The third groupcontained fractional information of hydrophobic surface area ofunsaturated carbons (Fraction_of_2D_VSA_hydrophobic_unsat) andaromatic carbons (Fraction_of_aromatic_atoms). The other quantities(logP value of carbon in R–CH-R, number of aromatic atoms) werealso correlated with those descriptors. The fourth group wasrepresented by hydrogen-bonding related terms. They were (fractionof) van der Waals surface area of hydrogen bond donors and thenumber of hydrogen bond donors. Since the 050_H atom typerepresented H attached to a heteroatom, AlogP98_050_H was alsorelated to them.

Discussion

Molecular descriptors of each regression model

Most regression models with three explanatory variables com-posed of cytotoxicity and two molecular descriptors had highaccuracy, except those using geometrical descriptors. The modelsusing topological descriptors showed the best performance based onthe adjusted R2, and subsequently, physicochemical, constitutionaland geometrical descriptors. Notably, two combined descriptors of allmodels were composed of descriptors from different groups (seeabove for grouping). For example, the best fit model of two topologicaldescriptors used valence_charge_index_03 (group1) and delta_chi_04_path_cluster (group2). In most cases, the explanatory vari-ables were combined from (group1 and group2) or (group1 andgroup4). It is reasonable that the performance improvement usingtwo descriptors rather than one descriptor is due to adding newinformation from a different chemical property group. Accordingly,significant performance improvement by using the descriptorcombinations (e.g. model #22) suggested that both descriptorscould correct the difference between acute toxicity and cytotoxicity.

Since the inter-correlations between explanatory variables werelow in most regression models judging by maximum VIF values, thesedescriptor combinations seemed to have mutual complementarities.In addition, the relationship between molecular descriptors andhuman acute toxicity could be easily established because of theindependency between explanatory variables. As mentioned before,the descriptors in group1 represented a single-bonded local environ-ment, which was also related to the logP value, and the descriptors ingroup2 provided the information on valence connectivity indices andglobal hydrophobicity. In the representative model (model #22) forthis combination, human acute toxicitywas negatively correlatedwithvalence_charge_index of high order (group1) and positively correlat-ed with the delta connectivity indices (group2). In this case, it waslikely that these two descriptors expressed different valence connec-tivity features and covered different local environments (refer theorder of each descriptor), and thus, might show good performance.

In addition, most models using two descriptors from each ingroup1 and group4 showed relatively high accuracy. Group4 providedinformation on hydrogen-bonding (in particular, hydrogen-bondingdonors). Model #4 was the representative predictor of such modelsusing the number of single bonds in group1 and the number ofhydrogen bond donors in group4. In this model, the number of singlebonds was negatively correlated with human acute toxicity; incontrast, the number of hydrogen bond donors had a positivecorrelation. In summary, human acute toxicity was strongly correlat-ed with cytotoxicity data, and the accuracy could be increased whenthe descriptors of group1 were included in the basic in vitro model.

Among all descriptor categories, topological descriptors had thehighest explanatory power. These descriptors had been devised forrepresenting somewhat different features of compounds from other

categories (Randic, 2001). Thus, using topological descriptors withdescriptors in other categories together is useful, but the directinterpretation of their pharmacokinetic meanings is very difficult.Instead, we can infer their meanings from interpretable descriptors inother categories that are strongly correlated with them (the sameclustered group). Most of them can be expressed by using one or twodescriptors in other categories. For example, VChi_05_path in group1(Fig. 3) can be understoodwith a balanced combination of the numberof rings (No_Rings) and the number of single bonds between carbons(No_CsC). The R2 of the regression model (VChi_05_path=0.51589No_Rings+0.14453 No_CsC) was 0.9589. The Valence_charge_index_02 of group1 can similarly be expressed by the regressionmodel usingthe number of single bonds and the number of single bonds betweencarbon andoxygen (Valence_charge_index_02=0.070079No_single_bonds+0.136250 No_CsO, R2 0.9616).

The role of molecular descriptors

The accuracy improvement of predictionmodels adding moleculardescriptors to cytotoxicity does not indicate that these descriptors arestrongly correlated with human acute toxicity. To examine whetherthey are correlated with acute toxicity or whether they can explainthe difference between in vitro and in vivo toxicity, we made a modelusing selected molecular descriptors (Valence_charge_index_03 andDelta_Chi_04_path_cluster) from model #22, which showed the bestperformance among models #1 to #24. The R2 was decreased from0.77 with cytotoxicity to 0.39 without cytotoxicity. This observationshowed that molecular descriptors provided information not onhuman acute toxicity, but on the difference between acute toxicityand cytotoxicity.

The next important question was what kind of information amolecular descriptor provides to correct the difference betweenhuman acute toxicity and cytotoxicity. The first reason for thedifference may be the limitation of toxicity detection by a specifictest (3T3 NRU here). The test can detect only the basal toxicity thatinterferes with normal cell survival. Moreover, it targets a specific cellline, and thus cannot represent the viability of various cells in thehuman body. The other reason is the bioavailability of the drug. Manypharmacokinetic parameters like absorption, distribution, metabo-lism and excretion cause different toxicity results. For these twohypotheses, we searched previous reports in the literature. We foundno clue for the first reason, but various reports gave hints concerningthe second reason.

The selected molecular descriptors have been used to predictvarious pharmacokinetic parameters. Hydrophobic surface area hasbeen used to predict blood brain barrier penetration (Stanton et al.,2004); in this study, Fraction_of_2D_VSA_hydrophobic_unsat,2D_VSA_hydrophobic_unsat, 2D_VSA_hydrophobic_sat and2D_VSA_hydrophobic describe this property. Another very importantfactor for determining pharmacokinetic properties is hydrophobicity,which has been used to predict solubility, human plasma proteinbinding and volume of distribution with acidity (Lobell and Sivarajah,2003); in this study, the selected descriptors of hydrophobicity wereAlogP98_002_C, AlogP98_008_C, AlogP98_024_C and AlogP98_050_H. Two selected descriptors, melting point (SK_MP) andsolvation free energy (Solvation_free_energy), were used to predictdrug absorption (Chu and Yalkowsky, 2009) and solubility (Luderet al., 2007), respectively. Recently, it was observed that theconnectivity index provided information on the toxicity of substitut-ed-benzenes (Chen et al., 2009). In this study, two connectivityindexes, VChi_04_path and VChi_05_path, provided useful informa-tion. This index was also useful for the prediction of the octanol–airpartition coefficients of semivolatile organic compounds (Zhao et al.,2005). The valence charge index has also been important forpredicting various pharmacodynamic, pharmacokinetic or toxicolog-ical parameters (Turner et al., 2003; Yap et al., 2006). In this study,

Page 9: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

Table 4Lists and properties of chemical groups.

Chemical name AbsErr* Half-life MW AlogP98

Chemicalgroup1

Digoxin 4.2 3.5 to 5 days 780.9 2.00Malathion 2.85 8–24 h 330.4 2.16Lindane 2.71 18 h 290.8 4.16Nicotine 2.61 15–20 h 162.2 1.24

Chemicalgroup2

Glufosinate ammonium 0.13 NA** 198.2 −3.65Sodium chloride 0.13 NA** 58.4 −2.67Mercury (II) chloride 0.09 NA** 271.5 0Sodium bicarbonate 0.08 NA** 84.0 −0.516Dichlorvos 0.07 NA** 221.0 1.03Ethanol 0.03 NA** 46.1 −0.0092

AbsErr*: the absolute value of logVivo minus logVitro.NA**: not available.

46 S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

Valence_charge_index_02 and Valence_charge_index_03 providedinformation on the valence charge.

Variously successful prediction methods for pharmacokineticparameters have used the structural information of chemicals, whichmeans that molecular descriptors provide information on the reactionof the human body to a chemical. In particular, molecular descriptorshave been used to predict bioavailability (Pintore et al., 2003; Turneret al., 2004; Ma et al., 2008), absorption (Egan et al., 2000; Stenberget al., 2001; Hou et al., 2007), distribution (Jansson et al., 2008; Paixaoet al., 2009), metabolism (Cronin, 2003; Ekins et al., 2005; Maddenand Cronin, 2006; Li et al., 2008) and plasma protein binding (Liu et al.,2005; Ma et al., 2008). These pharmacokinetic parameters are veryimportant factors in determining the blood concentration and residualtime in the human body of chemicals, and thus they affect toxicity.Some successful selected descriptors in this study might correct thepharmacokinetic parameters that have effects on toxicity in the humanbody.

In summary, through the observation that (1) molecular descrip-tor information can correct the difference between cytotoxicity dataand human acute toxicity values and (2) molecular descriptors, inparticular the selected descriptors in this study, can predict certainbasic pharmacokinetic parameters, the improvement of predictionaccuracy by adding molecular descriptors to cytotoxicity data may bedue to correction of the pharmacokinetic difference between in vivomeasurements and in vitro experiments.

It is remarkable that the prediction accuracies of models usingpredicted pharmacokinetic parameters instead of molecular descrip-tors (Eqs. (7)–(9)) were lower than models using moleculardescriptors (Supplement 4). PreADME predicts pharmacokinetic

Fig. 4. Average molecular descriptor values of chemical group1, chemical group2, and the w

parameters from the combination of molecular descriptors, and sothe prediction error may be doubled when the inaccurate predictionresults are used as explanatory variables. If real pharmacokineticparameters could be obtained, they may showmore improved results.

Difference between in vivo and in vitro toxicity

Some chemicals, ethanol, dichlorvos, sodium bicarbonate, mercury(II) chloride, sodium chloride and glufosinate ammonium (denoted aschemical group2 in Table 4), can be predicted accurately fromcytotoxicity tests; their absolute errors of logVivo and logVitro werelower than 0.15. Such chemicals are good examples of basal toxicity.However, some chemicals like digoxin, malathion, lindane andnicotine (denote as chemical group1) had absolute errors higherthan 2.5. Their acute toxicities were much higher than theircytotoxicities. We compared six important pharmacokinetic para-meters (half-life, mechanism of action, absorption property, plasmaprotein binding and biotransformation) from DrugBank and twostructural descriptors (molecular weight and hydrophobicity) that arethe most important for predicting pharmacokinetic parameters(Blaauboer, 2003; Chae et al., 2005; Cronin andMark, 2006). However,we could not find any factors distinguishing the two groups(Supplement 6). The lack of pharmacokinetic data also preventedprogress in the study. Nevertheless, chemical group1 had a relativelylong half-life (38.5 hours on average) compared to thewhole chemicalspace (13.5 hours on average except for extreme values: amiodarone(58 days), chloroquine (1–2 months), and warfarin (a week)). Theirlong residual time in the human body could increase their toxic effects,which might explain the observation that their acute toxicities aremuch higher than their cytotoxicity.

On the other hand, the first group (chemical group2) includes manysimple inorganic compounds such as sodium bicarbonate, mercury (II)chloride and sodium chloride. Although their pharmacokinetic para-meters cannot be analyzed, they can be inferred from structuraldescriptors that are important to predicting pharmacokinetic para-meters. They have low molecular weights (see Fig. 4a) and smallAlogP98 values (see Fig. 4(b)) compared to the average of the wholechemical space; however, the secondgrouphasvery large values for twoproperties compared to the average values of all chemicals.

It is known that the absorption of some chemicals such as chitosanis significantly influenced by its molecular weight (that is, as themolecular weight increases, the absorption decreases) (Chae et al.,2005). In this case, small chemicals tend to be well-absorbed, and thusthey might show little in vivo–in vitro difference due to absorption inhuman body. Similarly, for reactive chemicals, hydrophobicity isknown to be important for transport and distribution. Due to its

hole chemical space. (a) average molecular weights and (b) average AlogP98 values.

Page 10: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

47S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

contribution to pharmacokinetics, log P is also found in models formammalian toxicities (Cronin and Mark, 2006).

Relationships with previous results

As mentioned in the introduction section, our study extended theprevious work by Lessigiarska et al. Prediction methods developed bythe current study and Lessigiarska et al. are different from previoustoxicity screening tests in making up for the weak points of in vitroexperiments and in silico computations by using both informationtogether. Although the main concept of our method and Lessigiarska'smethod is similar, our study tried to improve the concept with thelarger size of the dataset, the precise assessment of the informativequalities of molecular descriptors, the explanation of the pharmaco-kinetic meaning of chemical properties to the difference betweenhuman acute toxicity and cytotoxicity.

Our results have some similarities and differences with Lessigiars-ka's. For example, lindane and hexachlorophen were also outliers intheir study, and digoxin served as an influential factor as in somemodels in our study. In spite of the difference of experimental systemssuch as cell type and end point, this correspondence indicates thatthese compounds commonly behave as outliers. However, in ourstudy, most models using two molecular descriptors and logVitro hadhigh accuracy without outliers.

Previously, the additional descriptors used for relating humanacute toxicity and cytotoxicity were the electronic reactivity property,size/shape property and the number of hydrogen bonds (Lessigiarskaet al., 2006). The mainly effective descriptors in our models (Table 2)were topological descriptors, number of hydrogen bond donors,number of single bonds and hydrophobicity, which overlapped withthe previous research.

The most critical difference was that in their study, the QSARmodel was comparable with the QSAAR (Quantitative structure-activity activity relationship) model including in vitro toxicity, and insome cases, the QSAR model was more accurate than the QSAARmodel. However, in our results, the in vitro toxicity information wassuperior to the structural descriptors. For example, the difference ofcorrelations between the best model using one structural descriptorand in vitro toxicity (model #20) and using two structural descriptorswithout cytotoxicity (Supplement 3) was 0.134 (0.848 – –0.714). Thismeans that cytotoxicity has information that the structural descrip-tors cannot represent. Another possibility is that the smaller dataset ofthe previous study could be the cause of the difference.

Conclusion

In this study, models for predicting human acute toxicity wereconstructed by combinations of cytotoxicity and structural descrip-tors. Although the in vitro experiment was comparatively reliable forestimating human toxicity, the prediction accuracy was not suffi-ciently high for clinical use. Thus, we assumed that the deficiency of invitro experiments could be recovered by adding structural informa-tion, which might be plausible owing to the numerous computationalmethods that have been developed to handle various molecularproperties. A successful model construction would be helpful forpredicting and modeling human toxicity without paying additionalcosts.

Our exhaustive search to close the gap between in vivo and in vitrotoxicity showed that adding a few structural descriptors significantlyimproved the prediction accuracy, while simultaneously reducing thenumber of outliers. Among these descriptors, topological descriptorsgave very useful information that was complementary to otherinformation. Molecular descriptors could correct for the differencebetween cytotoxicity and human acute toxicity, but they had a smallcorrelation with human acute toxicity alone. The selected moleculardescriptors were highly related to several pharmacokinetic para-

meters, which could be the reason for the difference between in vivoand in vitro toxicity.

Many previous chemoinformatic methods lack transparency. Partof the reason is that many of them are commercial softwares. In thisstudy, we tried to maintain the maximal level of transparency of dataand all prediction models, which would help anyone who wishes toutilize our method for more efficient drug development. Although weused a larger dataset than previous researches, the size of datasetneeds to be larger to help drug development practically. Undoubtedly,this work with sufficient size of data and various types of in vitro datawill improve the reliability of in vitro experiments, which can providevaluable information on the toxicity screening.

In conclusion, human acute toxicity was strongly correlated withcytotoxicity data. The insufficient ability of in vitro tests to validate invivo data could be complemented by using structural descriptorsrelated to human pharmacokinetic parameters.

Acknowledgments

This work was supported by the National Research Foundation ofKorea (NRF) grant funded by the Korea Government (MEST) (2009-0086964). We thank JungMoon-Soul and themembers of Protein Bio-Informatics Laboratory (PBIL) for helpful discussions.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at doi:10.1016/j.taap.2010.04.004.

References

ACuteTox (2009). http://www.acutetox.org.Ajay, Bemis, G.W., Murcko, M.A., 1999. Designing libraries with CNS activity. J. Med.

Chem. 42, 4942–4951.Basak, S.C., Balasubramanian, K., Gute, B.D., Mills, D., Gorczynska, A., Roszak, S., 2003.

Prediction of cellular toxicity of halocarbons from computed chemodescriptors: ahierarchical QSAR approach. J. Chem. Inf. Comput. Sci. 43, 1103–1109.

Blaauboer, B.J., 2003. The integration of data on physico-chemical properties, in vitro-derived toxicity data and physiologically based kinetic and dynamic as modelling atool in hazard and risk assessment. A commentary. Toxicol. Lett. 138, 161–171.

Chae, S.Y., Jang, M.K., Nah, J.W., 2005. Influence of molecular weight on oral absorptionof water soluble chitosans. J. Control. Release 102, 383–394.

Chen, Q., Kou, Y.W., Wang, Q., Chen, H., Yuan, J., 2009. A molecular fragments variableconnectivity index for studying the toxicity (Vibrio fischeri pT50) of substituted-benzenes. J. Environ. Sci. Health A Tox. Hazard. Subst. Environ. Eng. 44, 288–294.

Chu, K.A., Yalkowsky, S.H., 2009. An interesting relationship between drug absorptionand melting point. Int. J. Pharm. 373, 24–40.

Clemedson, C., 2008. The European ACuteTox project: a modern integrative in vitroapproach to better prediction of acute toxicity. Clin. Pharmacol. Ther. 84, 200–202.

Cronin, D., Mark, T., 2006. The role of hydrophobicity in toxicity prediction. Curr.Comput. Aided Drug Design 2, 405–413.

Cronin, M.T., 2003. Computer-aided prediction of drug toxicity and metabolism. Exs259–278.

Cronin, M.T., Manga, N., Seward, J.R., Sinks, G.D., Schultz, T.W., 2001. Parametrization ofelectrophilicity for the prediction of the toxicity of aromatic compounds. Chem.Res. Toxicol. 14, 1498–1505.

Dearden, J.C., 2003. In silico prediction of drug toxicity. J. Comput. Aided Mol. Des. 17,119–127.

Egan, W.J., Merz Jr., K.M., Baldwin, J.J., 2000. Prediction of drug absorption usingmultivariate statistics. J. Med. Chem. 43, 3867–3877.

Ekins, S., Andreyev, S., Ryabov, A., Kirillov, E., Rakhmatulin, E.A., Bugrim, A., Nikolskaya,T., 2005. Computational prediction of human drug metabolism. Expert Opin. DrugMetab. Toxicol. 1, 303–324.

Hou, T., Wang, J., Li, Y., 2007. ADME evaluation in drug discovery. 8. The prediction ofhuman intestinal absorption by a support vector machine. J. Chem. Inf. Model 47,2408–2415.

http://www.r-project.org R.Jansson, R., Bredberg, U., Ashton, M., 2008. Prediction of drug tissue to plasma

concentration ratios using a measured volume of distribution in combination withlipophilicity. J. Pharm. Sci. 97, 2324–2339.

Johnson, D.E., Wolfgang, G.H., 2000. Predicting human safety: screening andcomputational approaches. Drug Discov. Today 5, 445–454.

Lee, S., Kim, D., 2007. A new method for predicting human hepatic clearance from invitro experimental data using molecular descriptors. Arch. Pharm. Res. 30,182–190.

Lee, S.K., Chang, G.S., Lee, I.H., Chung, J.E., Sung, K.Y., No, K.T., 2004. The PreADME: PC-BASED PROGRAM FOR BATCH PREDICTION OF ADME PROPERTIES. EuroQSAR.

Page 11: Importance of structural information in predicting human acute toxicity from in vitro cytotoxicity data

48 S. Lee et al. / Toxicology and Applied Pharmacology 246 (2010) 38–48

Lessigiarska, I., Worth, A.P., Netzeva, T.I., Dearden, J.C., Cronin, M.T., 2006. Quantitativestructure–activity-activity and quantitative structure–activity investigations ofhuman and rodent toxicity. Chemosphere 65, 1878–1887.

Li, A.P., 2004. Accurate prediction of human drug toxicity: a major challenge in drugdevelopment. Chem. Biol. Interact. 150, 3–7.

Li, A.P., 2007. Human-based in vitro experimental systems for the evaluation of humandrug safety. Curr. Drug Saf. 2, 193–199.

Li, H., Sun, J., Fan, X., Sui, X., Zhang, L., Wang, Y., He, Z., 2008. Considerations and recentadvances in QSAR models for cytochrome P450-mediated drug metabolismprediction. J. Comput. Aided Mol. Des. 22, 843–855.

Lin, Z., Yu, H., Wei, D., Wang, G., Feng, J., Wang, L., 2002. Prediction of mixture toxicitywith its total hydrophobicity. Chemosphere 46, 305–310.

Liu, J., Yang, L., Li, Y., Pan, D., Hopfinger, A.J., 2005. Prediction of plasma protein bindingof drugs using Kier–Hall valence connectivity indices and 4D-fingerprint molecularsimilarity analyses. J. Comput. Aided Mol. Des. 19, 567–583.

Lobell, M., Sivarajah, V., 2003. In silico prediction of aqueous solubility, human plasmaprotein binding and volume of distribution of compounds from calculated pKa andAlogP98 values. Mol. Divers. 7, 69–87.

Luder, K., Lindfors, L., Westergren, J., Nordholm, S., Kjellander, R., 2007. In silicoprediction of drug solubility. 3. Free energy of solvation in pure amorphous matter.J. Phys. Chem. B 111, 7303–7311.

Ma, C.Y., Yang, S.Y., Zhang, H., Xiang, M.L., Huang, Q., Wei, Y.Q., 2008. Prediction modelsof human plasma protein binding rate and oral bioavailability derived by using GA-CG-SVM method. J. Pharm. Biomed. Anal. 47, 677–682.

Madden, J.C., Cronin, M.T., 2006. Structure-based methods for the prediction of drugmetabolism. Expert Opin. Drug Metab. Toxicol. 2, 545–557.

Muskal, S.M., Jha, S.K., Kishore, M.P., Tyagi, P., 2003. A simple and readily integratableapproach to toxicity prediction. J. Chem. Inf. Comput. Sci. 43, 1673–1678.

Oberg, T., 2004. A QSAR for baseline toxicity: validation, domain of application, andprediction. Chem. Res. Toxicol. 17, 1630–1637.

Paixao, P., Gouveia, L.F., Morais, J.A., 2009. Prediction of drug distribution within blood.Eur. J. Pharm. Sci. 36, 544–554.

Pintore, M., van de Waterbeemd, H., Piclin, N., Chretien, J.R., 2003. Prediction of oralbioavailability by adaptive fuzzy partitioning. Eur. J. Med. Chem. 38, 427–431.

Randic, M., 2001. The connectivity index 25 years after. J. Mol. Graph. Model. 20, 19–35.Roy, D.R., Parthasarathi, R., Maiti, B., Subramanian, V., Chattaraj, P.K., 2005. Electrophilicity

as a possible descriptor for toxicity prediction. Bioorg. Med. Chem. 13, 3405–3412.Saiakhov, R.D., Stefan, L.R., Klopman, G., 2000. Multiple computer-automated structure

evaluation model of the plasma protein binding affinity of diverse drugs. Perspect.Drug Discov. Design 19, 133–155.

Sjostrom, M., Kolman, A., Clemedson, C., Clothier, R., 2008. Estimation of human bloodLC50 values for use in modeling of in vitro–in vivo data of the ACuteTox project.Toxicol. In Vitro 22, 1405–1411.

Stanton, D.T., Mattioni, B.E., Knittel, J.J., Jurs, P.C., 2004. Development and use ofhydrophobic surface area (HSA) descriptors for computer-assisted quantitativestructure–activity and structure–property relationship studies. J. Chem. Inf. Comput.Sci. 44, 1010–1023.

Stenberg, P., Norinder, U., Luthman, K., Artursson, P., 2001. Experimental andcomputational screening models for the prediction of intestinal drug absorption.J. Med. Chem. 44, 1927–1937.

Swamidass, S.J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., Baldi, P., 2005. Kernels forsmall molecules and the prediction of mutagenicity, toxicity and anti-canceractivity. Bioinformatics 21 (Suppl 1), i359–368.

Turner, J.V., Maddalena, D.J., Agatonovic-Kustrin, S., 2004. Bioavailability predictionbased on molecular structure for a diverse series of drugs. Pharm. Res. 21, 68–82.

Turner, J.V., Maddalena, D.J., Cutler, D.J., Agatonovic-Kustrin, S., 2003. Multiplepharmacokinetic parameter prediction for a series of cephalosporins. J. Pharm.Sci. 92, 552–559.

Valerio Jr., L.G., 2009. In silico toxicology for the pharmaceutical sciences. Toxicol. Appl.Pharmacol. 241, 356–370.

van de Waterbeemd, H., Gifford, E., 2003. ADMET in silico modelling: towardsprediction paradise? Nat. Rev. Drug Discov. 2, 192–204.

Yap, C.W., Xue, Y., Li, H., Li, Z.R., Ung, C.Y., Han, L.Y., Zheng, C.J., Cao, Z.W., Chen, Y.Z.,2006. Prediction of compounds with specific pharmacodynamic, pharmacokineticor toxicological property by statistical learning methods. Mini Rev. Med. Chem. 6,449–459.

Yazdanian, M., Glynn, S.L., Wright, J.L., Hawi, A., 1998. Correlating partitioning and Caco-2cell permeability of structurally diverse small molecular weight compounds.Pharm. Res. 15, 1490–1494.

Yee, S., 1997. In vitro permeability across Caco-2 cells (colonic) can predict in vivo(small intestinal) absorption in man—fact or myth. Pharm. Res. 14, 763–766.

Yuan, H., Wang, Y., Cheng, Y., 2007a. Local and global quantitative structure–activityrelationship modeling and prediction for the baseline toxicity. J. Chem. Inf. Model.47, 159–169.

Yuan, H., Wang, Y.Y., Cheng, Y.Y., 2007b. Mode of action-based local QSAR modeling forthe prediction of acute toxicity in the fathead minnow. J. Mol. Graph. Model. 26,327–335.

Zhao, H., Zhang, Q., Chen, J., Xue, X., Liang, X., 2005. Prediction of octanol–air partitioncoefficients of semivolatile organic compounds based on molecular connectivityindex. Chemosphere 59, 1421–1426.