Navigating in vitro bioactivity data: investigating ...Navigating in vitro bioactivity data:...

Navigatinginvitrobioactivitydata:investigatingavailableresourcesusingmodelcompounds

StenIlmjärv1,2,*,FionaAugsburger1,JervenTjallingBolleman3,RobinLiechti2,AlanJamesBridge3,

JennySandström4,VincentJaquet1,IoannisXenarios2,3,Karl-HeinzKrause1,*

Affiliation

1DepartmentofPathologyandImmunology,CentreMedicalUniversitaire,UniversityofGeneva,Geneva,Switzerland;2Vital-ITGroup,SIBSwissInstituteofBioinformatics,Lausanne,Switzerland;3Swiss-Protgroup,SIBSwissInstituteofBioinformatics,CentreMedicalUniversitaire,GenevaSwitzerland;4SCAHTSwissCentreforAppliedHumanToxicology,Basel,Switzerland

Keywords

chemicals,drugs,compounds,chemicaldatabases,invitro,dataintegration,ontology,semantic

web,InChI,InChIKey,SMILES,high-throughput,omics

*Addressforcorrespondence

[email protected]&[email protected]

ABSTRACTModernmedicineandanincreasinglycomplexenvironmentcontributetoexposureofhumansto

a large number of chemical compounds, that can potentially be toxic. Although widely used,

compound testing in animals has important limitations. In vitro testing provides a promising

alternative.However,becauseoftherelativeinaccessibilityandfragmentationofavailabledata,

theinvitroapproachlargelyunderperformsitspotential.Theaimofthisstudyistoinvestigate

howavailablepubliconlineresources(toolsanddatabases)supportaccessinganddistributionof

in vitro compound data.We examined 19 public online resources,mapped their features, and

evaluatedtheirusabilitywithasetoffourmodelcompounds(aspirin,rosiglitazone,valproicacid,

and tamoxifen). By investigating compound names and identifiers, we observed extensive

variationand inconsistencies in available resources: the synonymsweredifferent, compounds’

structuralidentifiers(InChI,InChIKey,SMILESandIUPACsystematicname)underperformedin

omics databases, identification of compound relatedmetadata (e.g. concentrations used in the

experiments) fromomicsexperimentswascomplexandnoneoftheavailableresourcesclearly

distinguishedbetweeninvivoandinvitrodata.Inaddition,weestimatedaccessibilityofselected

public resources using computational queries. Only a few public resources provided access to

compound-related data using semantic web technology. The general quality of experiment

annotations created further difficulties in identifying data of interest. Therefore,we identified

severalstandardizedontologieswithpotentialtoprovideanincreasedaccuracyforextensivedata

.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint

https://doi.org/10.1101/248898

http://creativecommons.org/licenses/by-nc-nd/4.0/

2

retrievalofinvitrocompounddata.Furthermore,usingtheexamplesofourmodelcompounds,we

provide recommendations on the use of ontologies by suggesting specific ontology terms to

annotateinvitroexperimentaldatawhenbeingpublished.

IntroductionThe number of chemical compounds in public databases is growing and experimental data

concerning these compounds are accumulating. As of today, both the CAS RegistrySM and

PubChem1containmorethan90millioncompoundswithmanynewcompoundsbeingaddedeach

day.Amongthesecompoundsaredrugs,chemicals,environmentalcontaminantsandtoxins,allof

which potentially could elicit effects that could have implications on human health and/or

environment.Mostofthesecompoundshavenotbeencharacterizedfromatoxicologicalpointof

view.Indeed,theUSbasedToxNetdatabaseHazardousSubstancesDataBank(HSDB)2wherean

expertpanelreviewsindividualcompounds,containslessthan6000records(asof12.12.2017).

Alternatively, ChEMBL3, which predominantly consists of literature extracted bioactivity data,

containsaround1.73milliondistinctcompounds(asof12.12.2017).Thedifferenceinthenumber

ofcompoundsregisteredandthenumberofcompoundswhosetoxiceffectiswellcharacterized,

demonstratesthelimitedcapacityofcurrentmethodstoassessacompound’sbioactivityinaliving

system.Invitro testsystemswithhigh-throughputperformanceandpotentialscalabilityaimto

bridge that gap. It will further increase the amount of data and provide new knowledge of

compound’sbiologicalproperties.

Animaltestingofcompoundtoxicityandefficacyisrequestedfordrugdevelopment,butitisslow,

expensive and raise concerns about animal welfare4. Alternatives to animal experiments is

strongly encouraged according to the 3R (replacement, reduction and refinement) principles.

Therefore,large-scaleinitiativeshavebeenlaunchedtoinvestigatecompound’stoxicityinrapid,

quantitative and scalable in vitro systems (e.g. Tox215 and EU-ToxRisk6). These bioassays can

measureendocrinedisruption7, thegenerationof reactiveoxygenspecies (ROS)8orchanges in

gene expression of biomarkers in developing and mature neurons9. These assays represent

differentaspectsofthehumanbiologywhereanaberrationofhomeostasiscansuggestadverse

consequencestothehumanhealth.Comparedtoanimalmodels,invitrobioassaysallowstudying

concentration-response relationships over a large concentration range, including those

representative of human exposure4. By combining compound’s structural information, known

molecular properties and data from bioassays and omics experiments, researchers can better

describecompoundeffectpathwaysforasystematicunderstanding.

Presentlythereisnosimplewaytoaccessinvitrocompounddatainaquickandsynopticmanner.

Instead,dataisfragmentedacrossmanydifferentresources10andinterestedpartiesneedtoinvest

invaluabletimeandefforttodevelopanexpertiseinordertonavigatethesesystemsefficiently11.

Thediversityofcompoundsynonymsanditsidentifiers,lackofprecisemetadataandannotations,

canleadtofalseconclusionsandtodifficultiesidentifyingthecompoundcorrectlyafterithasbeen

published12.Toimprovethereproducibilityofexperimentalresultsandtotestnewhypotheses


https://doi.org/10.1101/248898


3

(e.g.developmentofpredictivecomputationalmodels),availabilityandaccessibilityofrawdatais

crucial.

Inthisstudy,weanalyzed19publicresourcesthatpotentiallyallowaresearchertoinvestigate

compound-specific in vitro data. We demonstrate that modest adoption of semantic web

technologies and poor annotations of experimental metadata, in particular when describing

experimental conditions, represent a major obstacle for high quality data integration and its

reusability.Lastly,weprovideinsighthowstandardizedontologiescouldimprovetheannotation

ofcompoundrelatedexperimentaldata.

MaterialsandmethodsChoiceofmodelcompoundsforresourceanalysis

Inordertosearchtheresourcesandtoidentifytheirpotentialsandpitfalls,wechosefourmodel

compounds that were used to evaluate public resources throughout this study (Table 1):

rosiglitazone,valproicacid,aspirin(acetylsalicylicacid),andtamoxifen.Belowthemotivestowhy

thesewerechosenasmodelcompoundsarelisted:

Tamoxifenisacommonlyuseddrugtotreathormone-dependentbreastcancers.Tamoxifenwas

included in the study because of the several known side-effects13 and its relatively complex

structure.

Valproicacid isanantiepilepticdrugwhichhasbeeninclinicaluseforabouthalfacentury. Its

mechanismsofactionaremultipleandincludeinhibitionofhistonedeacetylases14.Itstherapeutic

use is well accepted but it has been found to cause severe developmental toxicity if given to

pregnantwomen15.Thehistoryofthedrugmadeitinterestingforustostudyiflaterdiscoveries

of toxic effects were translated to better data representation and to additional experiments

performedwiththecompound.

Aspirin(acetylsalicylicacid)isoneoftheoldestsyntheticdrugs.Itsmechanismofaction,inhibition

of prostaglandin synthesis through acetylation of platelet cyclooxygenase (COX)16,makes it an

anti-inflammatory,analgesicandantipyreticdrug.Aspirinwasinterestingforourstudysincefor

alongtimeithasbeenwidelyusedasareadilyavailablepainkiller,thuspotentiallyhavingmany

historicalsynonymsandexperimentaldata.Ofnote,thenameaspirinitselfderivesfromatrade

name.

Rosiglitazone is an antidiabetic drug of Peroxisome Proliferator-Activated Receptors (PPAR)

agonists family, introduced into the market in 1999. It was included in the analysis because

comparedtoothermodelcompoundsitisrelativelynew.Also,similartotamoxifen,thecomplex

chemicalstructureofrosiglitazoneenabledustostudythestructuralidentifiersofmoreintricate

molecule.

Overviewofdatabasescontainingcompoundinformation

Togetanoverviewofpublicresourcescontainingcompoundinformation,bothcompoundspecific

resources (e.g. databases containing physicochemical properties such as structure) and omics

databaseswereinvestigatedusingourmodelcompounds.Primarily,wefocusedonresourcesthat

couldpotentiallyaccommodateorlinktoinvitrocompounddata.Downstreaminformationsuch


https://doi.org/10.1101/248898


4

astargetinteractionsandpharmacologicalinferencewerealsoconsideredastheycanbederived

frominvitroexperimentsandhencerawdatacouldpotentiallybebacktracked.Resourcefeatures

included in our analysis involved the scope of supported compound identifiers and search

functionalities,accessibilityofrawdata,supportforontologiesandprogrammaticdataaccess.

Table1.Tableofmodelcompoundsusedinthestudyandtheiridentifiers.

Aspirin

ChEBIID CHEBI:15365

PubChemCID CID2244

InChIKey BSYNRYMUTXBXSQ-UHFFFAOYSA-N

InChI InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

SMILES CC(=O)Oc1ccccc1C(O)=O

Rosiglitazone

ChEBIID CHEBI:50122

PubChemCID CID77999

InChIKey YASAKCUCGLMORW-UHFFFAOYSA-N

InChIInChI=1S/C18H19N3O3S/c1-21(16-4-2-3-9-19-16)10-11-24-14-7-5-13(6-8-14)12-15-17(22)20-18(23)25-15/h2-9,15H,10-12H2,1H3,(H,20,22,23)

SMILES CN(CCOc1ccc(CC2SC(=O)NC2=O)cc1)c1ccccn1

Valproicacid

ChEBIID CHEBI:39867

PubChemCID CID3121

InChIKey NIJJYAXOARWZEE-UHFFFAOYSA-N

InChI InChI=1S/C8H16O2/c1-3-5-7(6-4-2)8(9)10/h7H,3-6H2,1-2H3,(H,9,10)

SMILES CCCC(CCC)C(O)=O

Tamoxifen ChEBIID CHEBI:41774

PubChemCID CID2733526

InChIKey NKANXQFJJICGDU-QPLCGJKRSA-N

InChIInChI=1S/C26H29NO/c1-4-25(21-11-7-5-8-12-21)26(22-13-9-6-10-14-22)23-15-17-24(18-16-23)28-20-19-27(2)3/h5-18H,4,19-20H2,1-3H3/b26-25-

SMILES CC\C(c1ccccc1)=C(/c1ccccc1)c1ccc(OCCN(C)C)cc1


https://doi.org/10.1101/248898


5

The public resources we used in our study were PubChem1, ChEMBL3, ChEBI17, Chemistry

Dashboard(CompTox)18,ChemSpider19,BioSamples20,ArrayExpress21,ExpressionAtlas22,PRIDE

PRoteomics IDEntifications (PRIDE)23,Humanmetabolomedatabase (HMDB)24, TheToxin and

Toxin Target Database (T3DB)25, Gene Expression Omnibus (GEO)26, UniProt27, BindingDB28,

DrugBank29,ZINC30andthreeof theToxicologyDataNetwork(TOXNET)databases:Hazardous

SubstanceDataBank(HSDB)2,ChemIDPlus31,andComparativeToxicogenomicsDatabase(CTD)32

(Table2).

Table 2. A list of resources used in the study, their categorization and the number of estimatedcompoundsintheseresourcesatthetimeofthestudy.

Database No.ofcompounds Checkedat Datatype

ArrayExpress - - Raw

BindingDB 635301 December-2017 Curated

BioSamples - - Raw

ChEBI 53495 December-2017 Curated

ChEMBL 1735442 December-2017 Curated

ChemIDPlus 418376 December-2017 Curated

ChemSpider ~62million December-2017 Curated

CompTox ~760000 December-2017 Raw/Curated

CTD 15381 December-2017 Curated

DrugBank 10562 December-2017 Curated

ExpressionAtlas - - Raw/Curated

GEO - - Raw

HMDB 114100 December-2017 Curated

HSDB 5929 December-2017 Curated

PRIDE - - Raw

PubChem >94500000 December-2017 Raw/Curated

T3DB 3673 December-2017 Curated

UniProt - - Curated

ZINC15 >100000000 December-2017 Curated


https://doi.org/10.1101/248898


6

Modelcompounds,compoundnamingandidentifiers

To demonstrate the nomenclature and data retrieval complexity of public resources, we used

ChEBIasourreferencedatabasesinceitisextensivelycuratedandChEBIontology17wasusedin

several other resources (eg.PubChemandChEMBL). In addition,ChEBI containedanontology

term for all of our four model compounds (Table 1). For an overview of compound names,

synonymsandidentifiers,weexaminedthepublicresourcesinTable2.Ourscopeofidentifiers

and registry numbers included Simplified Molecular Input Line Entry System (SMILES),

International Chemical Identifier (InChI), International Chemical IdentifierKey (InChIKey) and

International Union of Pure and Applied Chemistry (IUPAC) name. SMiles ARbitrary Target

Specification(SMARTS)werenotincludedinouranalysissincetheyarenotshowninChEBI.Also,

weexcludedChemicalAbstractServiceRegistryNumber(CASRN)becausetheaccuracyofCASRN

in the public domain are not absolute and reliable information can only be accessed by paid

servicesprovidedbyChemicalAbstractService(CAS)10.

Dataaccessandintegrativeinitiatives

Theabilityofaresourcetointegratedataandcompoundinformationwasevaluatedbysearching

cross-referencesbetweenresourcesusingtheselectedmodelcompounds.Toassessdatasharing

and access to public data collections of the selected resources, we studied the use of

representationalstatetransfer(RESTorRESTful)applicationprogramminginterface(API)and

resourcedescriptionframework(RDF)technologies.

Analysisofontologyusage

Tosupporttheuseofdatacross-referencingfromindividualexperimentsweevaluatedifandto

whatextent, thepublicresourcesutilizedontologies in theirdataannotationandpresentation.

With the exception of Cellosaurus33, the rest of the ontologieswe considered important were

included either in the OBO Foundry34, NCBI BioPortal35 or Ontology Lookup Service by EBI36.

Althoughtheremightbeotherontologiesnotincludedinthesecollections,theywereconsidered

lessvisibleandthereforelesslikelytobetakenupbytheresearchcommunity.

ResultsDescriptionandinterconnectivityofpublicresourcescontainingcompounddata

Here,ouraimwastostudywhatarethepublicresourcescontainingcompoundinformationandif

one resource could be used to identify a compound in another resource. Broadly, compound

relatedinformationcanbeseparatedintotwocategories:1)knowledgethathasbeenmanuallyor

automaticallycuratedfrompublications,whichoftenresultfrompublishedorunpublishedraw

data(e.g.halfmaximaleffectconcentration(IC50)valuesorcompoundinteractionswithtargets),

and 2) the raw data itself, either from omics experiments that interrogate many potential

interactionssimultaneouslyorfromdedicatedbioassaysthatmeasureaspecificendpointovera

rangeofconcentrations.Therefore, theresourcescanbeclassifiedaseithercontainingcurated

information from scientific publications or some other reliable documents (examples include

HazardousSubstanceDataBank(HSDB),ChEMBL),ordatabasesthatcontainrawnumericaldata

dedicatedtocertaindatatypes(examplesincludeArrayExpress,PRIDE).InthecaseofPubChem,


https://doi.org/10.1101/248898


7

it supports both curated information and raw data of bioassays as uploaded by its users. An

overviewofdatabaseswiththetypeofinformationtheycontaincanbefoundinTable2.

Weobserved,thatoftenaresourceisactuallyreusingdatafromotherdatabases,thuscombining

information fromseveral resources,orprovidingcross-references to them.Assuch,DrugBank,

PubChem,andChEBIhadthehighestnumberofincomingconnectionsfromotherresourcesand

PubChem, ChEMBL, ChEBI and ChemSpider had the highest number of outgoing connections,

connectingroughlytohalfofthepublicresourcesinourstudylist(Figure1).Mostoftheresources

had cross-references to another resource, especially when they contained data curated from

scientificdocuments.However,withtheexceptionsofUniProt,ChEBIandChEMBL,nocompound-

specificresourceshadcross-referencestodatabasescontainingdatafromomicsexperiments.At

the time of the study we were able to identify references in ChEBI to in vitro raw data in

Figure1.Representationofcross-references(representingeitherdirect linkstootherresourceorsampleddata fromtheotherresource)betweenresources inachorddiagram.Greyresourcesaredepicting compound-specific resources whereas, the red ones depict omics resources. UniProt isneithercompound-specificoromicsresourceandisthereforecoloredblue.Outgoingconnectionsareclosertothecircumferencewhereasincomingconnectionsarefurtherfromthecircumference.Thenumbersonthescalerepresentthecombinednumberofbothincomingandoutgoingconnectionsforeveryresource.

GEO

HMDB

HSDB

CompTox

T3DB

PRIDEPubChem

UniProt

ZINC

ArrayExpress

BindingDB

ChEBI

BioSamples

ChEMBL

ChemIDPlus ChemSpider

ExpressionAtlas

CTD

DrugBank

00

6

120

0

6

0

6

00

61218

0

6

12

0

6

0

60

6

120

6

12

18

0

0

6

12

180

6 0 6

12

0

60

6

0

6

1218


https://doi.org/10.1101/248898


8

ArrayExpressforthreeoutoffourmodelcompounds(rosiglitazone:E-GEOD-5509,E-GEOD-5679,

E-GEOD-30147; valproic acid: E-GEOD-1615, E-GEOD-14973, E-GEOD-23909, E-TABM-903, E-

TABM-1205,tamoxifen:E-GEOD-2225,E-GEOD-12665,E-MTAB-5319,E-TABM-562).Foraspirin,

we were able to find 73 datasets in ArrayExpress, representing both in vivo and in vitro

experimentsthathaveuntilnownotbeencross-referencedinChEBI.

Identificationofdataincompound-specificpublicresources

Here, ouraimwas to studydifferentwaysa chemical compound is identifiedbetweenvarious

publicresources.Indeed,achemicalcompoundcanbeassociatedwithmanyidentifiers,suchasa

tradename,agenericname,aregistrynumber,auniquedatabaseidentifier(e.g.PubChemCIDor

ChEMBL ID) and its structure-derived representations, referred here as structural identifiers

(SMILES, InChI, InChIKey – different types of compound identifiers are shown in Table 3).

Potentially any of these can be used to search a compound from an online public resource.

However,weobservedthatnotallcompoundsynonymsandidentifiersareidenticalindifferent

resources.Asanexample,thecompoundrosiglitazonecontains151depositor-suppliedsynonyms

in the PubChem37, whereas in ChEBI only two synonyms were provided38. Consequently, the

PubChem depositor-supplied synonym for rosiglitazone termed Gaudil failed to recognize the

compoundinChEBI.

Table3.Atableofcompoundidentifiers,theirfeaturesandexamples.AbetteroverviewofcompoundidentifiersisprovidedbyWarr,201139.Ofnote,thereareseveralstructure-basedidentifiersandnotallofthemallowforsubstructuresearches.Inaddition,somestructure-basedidentifiers(e.g.InChIKeys)arebettersuitedforcross-referencingpurposes.

Identifier

Standardized

Uniquerepresentation

ofacom

pound

Machine-readable

structurenotation

Oftenusedin

annotatingdata

Permitssubstructure

search

Canbeusedforcross-

referencing

Examples

Compoundname ✓ Rosiglitazone

Tradename ✓ Avandia

Compounddatabaseidentifier

✓ CHEMBL121

Registrynumber ✓ ✓ 122320-73-4(CASnumber)

Chemicalstructuredrawing

✓ ✓ ✓

Structure-basedidentifier ✓ ✓ ✓ ✓ ✓

SMILES,InChI,InChIKey,IUPACname


https://doi.org/10.1101/248898


9

To further illustrate the complications in searching information for a compound from several

distinct databases, we first used compound names of four model compounds to investigate

compoundspecificresources(BindingDB,ChEBI,ChEMBL,ChemIDPlus,ChemSpider,CompTox,

CTD,DrugBank,HMDB,HSDB,PubChem,T3DBandZINC15)usingfree-textbasedsearch.Wewere

abletofindrecordsforallfourmodelcompoundsinalloftheresources(withoneexceptionof

rosiglitazone that was not found in T3DB) andwe recorded their structural identifierswhere

available(SupplementaryTable1-4).Interestingly,weobservedthatnotallstructuralidentifiers

wereidenticalbetweentheresources.ThatwasespeciallytrueforSMILESwherefromtheeleven

resourcesthatreportedSMILES,wefound8differentSMILESforrosiglitazoneandtamoxifen,5for

aspirinand3forvalproicacid(SupplementaryTable1-4).TheresultswerebetterforInChIKeys

(establishedbyInternationalUnionofPureandAppliedChemistry(IUPAC)),whereweobserved

asingleidentifierforbothaspirin,valproicacidandtamoxifenbutthreedifferentInChIKeysfor

rosiglitazone. Things were more complicated with IUPAC systematic names since in several

resourcesitwasunclearifthereportednamewasactuallyaIUPACsystematicnameandtherefore

theseresourceswerediscardedfromtheanalysis.Consequently,IUPACnamewasrecordedonly

fromChEBI, ChemSpider, CompTox, DrugBank, HMDB, PubChem and T3DB. Even though few

resources reported IUPAC systematic names,we observed large variability among them.More

specifically,wefound3differentnamesforaspirin,4forrosiglitazone,1forvalproicacidand5for

tamoxifen(SupplementaryTable1-4).

These resultshighlight that InChIKeysamongvariousdatabasesaremoreuniquecompared to

SMILESorIUPACsystematicnames.Tothatend,aglobalresourceUniChemhasprovidedacross-

referencing service that connects 38 individual database identifiers of various resources using

InChIKeys40. However, this service is only useful when one already knows the compound’s

databaseidentifierortheInChIKey.Currently,itcannotbeusedwithotherstructuralidentifiers

orwithcompoundnames.

Identificationofcompounddatafromomicsdatabases

Theidentityofthechemicalcompoundscanbeambiguous,sincecompoundsareoftenmentioned

bynamewithouttheaccompanyingstructurerepresentations12.Thisisespeciallytrueforomics

resourceswheredatauploadedbytheresearcherscanbeannotatedwithdifferentsynonymsof

thesamecompound.Toinvestigatethisissue,weutilizedaweb-basedfree-textsearchtoresearch

omics data from ArrayExpress, ExpressionAtlas, BioSamples, GEO and PRIDE resources using

compounds’structuralidentifiers(SMILES,InChIandInChIKey)asreportedinChEBI.Fromallthe

resources,wewereabletoretrievedataatleastfortwomodelcompoundsusingcompoundnames

(SupplementaryTable5). Inaddition, the IUPACsystematicnamesofaspirin,rosiglitazone and

valproic acid retrieved datasets from ArrayExpress, BioSamples and GEO. Interestingly, in

BioSamples,wewereabletoretrievedatasetsforvalproicacidalsowithSMILES.However,these

datasets actually corresponded to the sodium salt ofvalproic acid, that has a slightly different

SMILES representation inChEBI compared tovalproic acid.Moreover, these sampleswerenot

retrievedwhenthecompoundnamewasusedinstead.


https://doi.org/10.1101/248898


10

Thelatterhighlightsthatthebestwaytoidentifycompoundrelateddatafromomicsresourcesis

touse compoundnames,whichwould require the researcher to consider all of the compound

synonyms. Therefore, to estimate the variability of compound annotation in sample labels,we

retrievedthename,synonymsandstructural identifiers foreachof the fourmodelcompounds

fromtheChEMBLpublicSPARQLendpoint(thecodeisavailableathttps://github.com/whetlake/

ivcdp).ThesewerethenusedtoidentifysamplesandsamplelabelsintheBioSamplesdatabase

usingBioSamplespublicSPARQLendpoint.Forrosiglitazoneandtamoxifenonlythesampleswith

therespectivenamewasfoundinanyofthesamplelabels.Foraspirin,sampleswerefoundusing

bothaspirin,asparin,asprin,leviusandmeasurin(SupplementaryTable6-9).Thetermmeasurin

can also be derived from sample labels that contain the word measuring, making it highly

susceptibletoretrieving falsesamples,notassociatedwithaspirin.Surprisingly, thecompound

nameacetylsalicylicacidhighlightedinChEBIwasnotfoundinanyofthesamplelabels.Valproic

acidretrievedresultsalsoforvalproate,depakoteand44089.Thelatterisasynonymofvalproic

acid in ChEMBL but none of the associated sampleswere actually associated to valproic acid.

Similartoweb-basedfree-textsearch,inouranalysis,nosamplescontainedstructuralidentifiers

ofthemodelcompoundswithinthesamplelabels.Also,allthesamplesretrievedwereuniquei.e.

alternativecompoundlabelswerenotusedtoannotatethesamesample.Inaddition,caremustbe

taken when using free-text based approach to programmatically search for compound data.

Indeed, we were able to retrieve samples where sample labels indicated a negation of the

compound,suchas“Norosiglitazonepresent”(SAMEA86299841).

Identificationofinvitrocompounddatafrompublicresources

With the exception of ZINC15, none of the public resources clearly distinguished compounds

whoseeffecthasbeenmeasuredininvitroexperiments.Incontrast,ZINC15reportsasubsetsof

compoundsthathavebeenreportedorinferredactiveininvitrodirectbindingassays42.Another

approach to identify in vitro data in public resources is to browse the study description for

referencesof invitroexperimentrelatedkeywords like“invitro”, “cell-line”orspecificcell-line

names(e.g.“HeLa”).Forexample,PubChem,ChEMBL,ChEBI,ArrayExpress,ExpressionAtlas,GEO,

BioSamplesandPRIDEcanbeusedtomapthesekeywordswithsampledescriptionsusingfree-

textsearch.Assuch,wewereabletoretrieveabioassayrecordfromPubChemwiththetitle“In

VitroCytotoxicityAgainstOvarianCarcinomaCellLineCH1”43whichclearlyrepresentsaninvitro

experiment.ChEMBLprovidesamoreprecisesearch,whichallowstoretrievedataoncompounds

associatedwithspecificcell-linesorinvitroassays.

Inaddition,wecanidentifyinvitrodatabyusingconcentrationunitsassociatedwithmolarityor

molality. For example, we used previously identified microarray samples from ArrayExpress

experimentE-GEOD-5679wheredendriticcellsweretreatedwithrosiglitazone44.Fromthefull

experimentdescription,weidentifiedthatrosiglitazonewasusedat2.5µMconcentrationlevel,

indicating that this was indeed an in vitro experiment. However, samples themselves had no

attributionofconcentrationunitinneitherArrayExpress44orBioSamples(SAMEG1922645).Since

the actual concentration could only be retrieved from the experiment description or from the


https://doi.org/10.1101/248898


11

originalpublication,thisapproachiscurrentlynotscalabletolarge-scaleautomaticidentification

ofinvitrodata,thuslimitingitsuse.

Insummary,thisanalysisdemonstratesthatpresentlynoconcretewaytospecificallyidentifyin

vitrodatausingfree-textbasedsearchesexistsinanyofthestudiedresources.AlthoughChEMBL

providesidentificationofcompoundinformationrelatedtocell-linesandinvitroassays,onewould

havetocoverallavailablecell-linesandassaysinordertogetacompleteoverviewoftheavailable

invitrodata.

Accessinginvitrocompounddatafrompublicresourcesprogrammatically

Toreusealreadypublisheddata,relevantdataneedtobeidentifiedcorrectlyandmadeavailable

to the researcher. Therefore, most public resources provide access to data through a bulk

download(Table4),whichcanbeverylarge(e.g.thePubChemcompounddatabaseincompressed

ASNformatisabout60GB,asofMarch2017).Alternatively,severalpublicresourcesalsoprovide

online programmatic data access. Here, we researched the utilization of RESTful API

(Representational state transfer application programming interface) andResourceDescription

Framework(RDF)technologiesbytheselectedpublicresources.

Table4.Alistofresourcesusedinthestudyandtheirsupportfordifferentdataaccess.

Bulkdownload

RESTfulAPI

RDFsupport

SPARQLendpoint

Substructureand/orsimilarity

search

ArrayExpress ✓ ✓

BindingDB ✓ ✓ ✓

BioSamples ✓ ✓ ✓

ChEBI ✓ ✓ ✓

ChEMBL ✓ ✓ ✓ ✓ ✓

ChemIDPlus ✓ ✓ ✓

ChemSpider ✓ ✓ ✓

CompTox ✓ ✓

CTD ✓

DrugBank ✓ ✓

ExpressionAtlas ✓ ✓ ✓ ✓

GEO ✓

HMDB ✓ ✓

HSDB ✓ ✓

PRIDE ✓ ✓

PubChem ✓ ✓ ✓ ✓

T3DB ✓

UniProt ✓ ✓ ✓ ✓

ZINC15 ✓ ✓


https://doi.org/10.1101/248898


12

In aRESTfulquery, thedata request is constructed into a singleURL.ThismakesRESTfulAPI

simpletouseandplatformindependent.Outofthe19resourcesinourstudy,10providedfree

access to theirRESTfulAPI (Table4),whereasaccess toDrugBankAPIwasprovided fora fee.

ContrarytoRESTfulAPIwherenodataisstoredlocally,thedatainRDFcompatibleformatcanbe

usedforbulkdownloadandsubsequent introductionintoa localdatabaseforqueryexecution.

However, a preferred way of data acquisition is through a public SPARQL endpoint, which

facilitatesqueryingtheserviceproviderdirectly,thusalwaysretrievingthemostup-to-datedata.

HereweobservedthatonlyBioSamples,ChEMBL,ExpressionAtlasandUniProtprovideapublic

SPARQLendpoint(Table4).AcquiringdatausingaSPARQLendpointcanbeslowercomparedto

RESTful data access, since the latter is better optimized for specific, recurrent query requests.

However,SPARQLquerieshave thebenefitofbeinghighlycustomizableby theresearcherand

thereforetheyprovidetheflexibilitythatwillcatertotheresearchers’uniqueneeds.Also,since

RDFisaninherentpartofthe“linkeddata”concept,itcanbeusedtofindrelationshipsbetween

datasets indifferent resourcesand is thereforeextremelyuseful fordata integrationpurposes,

suchasconnectingcompoundseffectinoneresourcetoitsphysicochemicalpropertiesinanother.

Identificationofrelevantontologiestoannotateinvitrocompounddata

Ontology isacollectionofspecificallydefinedandcontrolledvocabularyconsistingofontology

terms. Ontologies allow to formally describe concepts and to define the relationship between

them46. Individual ontologies can consist from few tomany thousands of ontology terms. For

example, ChEBI ontology is made up over 100 000 ontology terms. By using ontologies, the

researchercancreatebettertargetedandmoreprecisequerieswithinthepublicresources.This

necessitatesthattheresearcherisabletoidentifyandapplyrelevantontologytermsfortheirdata.

For this purpose, we identified three resources that allow to browse and search for specific

ontologies and ontology terms: i) OntoBee47 is the default service containing a collection of

ontologiesasprovidedbytheOpenBiologicalOntology(OBO)foundry34,aswellasii)BioPortal35

and iii)OntologyLookupService (OLS)36whichareontologybrowsersdevelopedbyNCBI and

EMBL-EBI, respectively. These services can be used to browse through a number of different

ontologies, toviewrelationsbetweentermsandretrievedefinitionsforbiomedicalvocabulary.

Currently (as of 20.12.2017), OLS contains 204, OntoBee 193 and BioPortal 678 ontologies.

Important to our study, we identified 10 ontologies that are useful in relation to chemical

compounds and can be used for annotating as well as programmatically accessing in vitro

compound data (Table 5). As such, CHEMical INFormation ontology (CHEMINF)48 provides

ontologytermsforresourceandstructureidentifiers.Forexample,InChIKeycanbeidentifiedwith

a specific Universal Resource Identifier (URI) (http://semanticscience.org/resource/CHEMINF

_000059)inCHEMINFontology.


https://doi.org/10.1101/248898


13

Table5.Overviewofontologiesthatcanbeusedtoannotatedoridentifyinvitrodata.

ChEBI ChemicalEntitiesofBiologcalInterest17

Highlycuratedcontrolledvocabularythathasbecomethe“goldstandard”forannotationofthesmallmolecularentities.

CHEMINF CHEMicalINFormationontology48

Canbeusedtoannotatechemicalinformation,includingchemicalstructureandchemicalproperties.Importantly,itincludesmanyofthedatabaseandstructureidentifierssuchChEMBLIDorInChIKey,respectively.

OBI • OntologyforBiomedicalInvestigations49

Canbeusedtodescribeabiomedicalinvestigation,suchasprotocols,instrumentsandexperimentaldesign.Importantly,itallowstoannotatedatapointsofspecificdosesusedintheexperiment(http://purl.obolibrary.org/obo/OBI_0000984)thatexaminedose-response(http://purl.obolibrary.org/obo/OBI_0001418)relationships.Furthermore,itcanbeusedexplicitlytoannotateaninvitro(http://purl.obolibrary.org/obo/OBI_0001285)experiment.

CLO Cell-lineOntology50

CLOisusefulfordescribingacell-line.Itaimstobringtogetheravailablecell-linedatafrommultiplesourcesintoacommonformat50.Importantly,itcanbeusedtospecifyacell-linethatwasmaintainedinaninvitroenvironment(http://purl.obolibrary.org/obo/CL_0001034),thusallowingidentificationofinvitroinvestigation.

Cellosaurus TheCellosaurus:acelllineknowledgeresource33

Cellosaurusisusedforreferencingcell-lines.Asofversion24(November2017)itcontained99021cell-lineswithreferencestooriginalpublicationsandcross-referencestomanycell-linecatalogues.Theyalsoprovidereferencestoresources,wheresamples(BioSamples,ChEMBL,etc.)couldbeannotatedwiththecell-linescontainedinCellosaurus.Similarly,itallowssearchofinvitrostudiesthroughtheidentificationofspecificcell-lines.

BAO BioAssayOntology51

BAOwasdevelopedtodescribeandmodeldiverseandcomplexassays51.Importantly,BAOprovidesdefinitionsforabsoluteorrelativeIC50(http://www.bioassayontology.org/bao#BAO_0000197,http://www.bioassayontology.org/bao#BAO_0000198,respectively)valuesandforseveralotherendpointsbasedonbioassaymeasurements.

UO UnitsofMeasurement52

UOprovidesunitsofmeasurement.Importantlyonecanuseunitsofmolarity(http://purl.obolibrary.org/obo/UO_0000061)suchasmicromolar(http://purl.obolibrary.org/obo/UO_0000064)todefineconcentrationsininvitroexperiments,thusallowingidentificationofinvitrodata.

UBERON Uberon,anintegrativemulti-speciesanatomyontology53

UBERONisanintegratedcross-speciesontologythatfacilitatesreasoningfromhigher-orderanatomicallevelstomorespecificsublevels.Forexample,CLOusesUBERONtoaddorgan-specificannotationstocell-lines50.Therefore,UBERONcanbeusedtotranslateknowledgefromcell-linestoorgansorviceversa.


https://doi.org/10.1101/248898


14

Usingontologiestoretrieveinvitrocompounddata

Since the information stored in an RDF triplet is formulated as subject-predicate-object, each

element can be a term defined in an existing ontology. Therefore, we investigated if using

ontologies enabled us to retrieve samples directly related to in vitro ontology terms.We used

BioSamplespublicSPARQLendpointasourtargetdatabasesinceitcombinessampleannotations

from ArrayExpress, PRIDE Archive and European Nucleotide Archive (ENA) and as shown

previously,wewereabletoretrievedatafromBioSamplesforallourmodelcompoundsusingboth

compoundandIUPACsystematicnames.Tooursurprise,noneofthesamplesinBioSampleswere

annotated with the in vitro ontology term (http://purl.obolibrary.org/obo/OBI

_0001285) found in Ontology for Biomedical Investigation (OBI) ontology or with structural

identifier ontology terms (InChI, InChIKey, SMILES) from either ChEBI (Chemical Entities of

BiologicalInterest),CHEMINForEDAM(Bioinformaticsoperations,datatypes,formats,identifiers

andtopics)ontologies.However,wedidfindsamplesforallourmodelcompoundsusingChEBI

ontology terms (rosiglitazone: http://purl.obolibrary.org/obo/CHEBI_50122, aspirin:

http://purl.obolibrary.org/obo/CHEBI_15365, valproic acid: http://purl.obolibrary.org/obo/

CHEBI_39867,tamoxifen:http://purl.obolibrary.org/obo/CHEBI_41774).

Next,weinvestigatedifsamplesofourmodelcompoundsretrievedusingChEBIontologyterms

wereassociatedwithamolarityunit(http://purl.obolibrary.org/obo/UO_0000061)usingtheUO

(Unitsofmeasurement)ontology.Indeed,suchsamplesexistedforrosiglitazoneandtamoxifenbut

notforaspirinandvalproicacid.Alternatively,wewereinterestedifanyofthesamplesassociated

with the model compounds were also associated with a cell-line ontology term

(http://www.ebi.ac.uk/efo/EFO_0000322)fromExperimentalFactorOntology(EFO).Nosamples

werefoundforaspirinbutwewereabletofindseveralsamplesforrosiglitazoneandtamoxifen

andasinglesampleforvalproicacid(SupplementaryTables10-12),effectivelyidentifyinginvitro

compounddata.

In summary,despite the fact thatwewereable to retrieve somesamples representing in vitro

experimentsrelatedtoourmodelcompounds,theyareoflimiteduseforthesearchofpresently

EDAM EDAM:anontologyofbioinformaticsoperations,typesofdataandidentifiers,topicsandformats54

EDAMcanbeusedtodescribeaninvitroscientificexperiment,suchastheoutputandcomputationalproceduresrelatedtothedataanddataformats.SimilartoCHEMINFitprovidesvocabularyforcompoundstructureidentifiers/formatsuchasInChIKey(http://edamontology.org/format_1199),howeveritismoreusefulinannotatingthedown-streamdataanalysisandlesspracticalindescribingtheexperimentaldesign.

EFO ExperimentalFactorOntology55

EFOwasdevelopedtoannotatethelargequantityofdomain-independentgeneexpressiondata55storedinArrayExpress.Althoughnotspecificallydevelopedforusewithchemicalcompounds,itcontainsspecificclassesdescribingchemicalentities(e.g.chemicalcompoundsanddrugs).EFOontologycanbeusedtoidentifygeneexpressiondatathatisrelatedtochemicalcompounds.


https://doi.org/10.1101/248898


15

availablecompounddata,asthesedataaremostlydepositedwithoutassociatedontologyterms.

However,theuseofontologiescouldbeapowerfultoolforsearchstrategiesonceannotationsof

experimentalmetadataimproves.

DiscussionInthisstudy,weusedfourmodelcompounds(aspirin,rosiglitazone,valproicacidandtamoxifen)

to investigate 19 public resources potentially suited for identification and access to in vitro

compound data. Fragmentation and problematic accessibility of in vitro data poses a major

obstaclefortheiroptimaluseandhencereducestheirscientificandpracticalimpact.Quickand

preciseidentificationandinformationretrievalisnecessaryforefficientuseofpublisheddata,to

saveresearcherandregulatorsalikevaluabletimeandtoallowquickandeasyidentificationof

relevantdatasets56.

Fragmentationofpublicresourcesandintegrationefforts

Therearemanyresourcesthatprovidecompound-specificdata.Therearealsoseveraldatabases

thatcontainomicsdata.Ageneralconsensusisthatrawdatafromexperimentsshouldbepublicly

accessible,anditisthepolicyofhighqualityscientificjournalsandofseveralgrantingagenciesto

requirepubliclyaccessibledepositionofresearchdata.Intheory,thisisanexcellentopportunity

forbothinvitroandinsilicotoxicologysinceitwouldallowtocomparedataproducedbydifferent

research laboratories and help to estimate the impact of different experimental conditions.

However,atpresent,thistheoreticalopportunitydoesnotliveuptoitsfullpotential.

Effortshavebeenmadetointegratecompoundinformationoromicsdatafromseveralresources.

Indeed, in our analysis we observed that many curated resources dedicated to chemical

compounds contain cross-references to each other. Some of them also include data from one

another. This can lead to databases overlapping in their content57. As an example of cross-

references,ChemSpiderreportstohavelinkstomorethan500resourcesbutnonerepresenting

anomicsdataresource.Ontheotherside,BioSamplesdatabaseintegratesmetadatafromomics

resources such as ArrayExpress, EuropeanNucleotide Archive (ENA) and PRIDE but does not

readily associate this metadata with compound-specific resources or compound identifiers.

Although resources such as UniProt, ChEMBL and ChEBI included links to ArrayExpress,

ExpressionAtlas or PRIDE,we observed that thesewere not exhaustive,with several potential

linkstoomicsdatasetsmissing.

InadditiontoBioSamples,thereareotherpublicandprivateinitiativesaimingatbridgingthegap

between the discrepancies of field-specific databases. OmicsDI (Omics Discovery Index)58 and

Repositive IO (https://repositive.io) aim to integrate metadata from many different domains,

including genomics, transcriptomics andmetabolomics. Although they lack cross-references to

compound-specificresources,theycanstillbeusedtoidentifycompound-related invitroomics

data.Furthermore,Repositive IOallows itscommunitydrivensearchplatformtobepopulated

with annotation tags. This makes data identification a crowd-sourcing task, which may

substantiallyimprovetheannotationofthemetadata.However,currentlytheseannotationssuffer

from the lack of a common vocabulary and may lead to confusion when different terms that


https://doi.org/10.1101/248898


16

correspondtothesamemeaningareused.Instead,asystemusingstandardizedontologiesshould

befavored.Inthatcasecompoundsamplescouldbedirectlyrelatedtothecompound’sInChIKeys

asanannotationtagdescribingthemetadata.Similaradditionscouldbemadewithconcentration

unitsandcell-lines.

Variabilityofcompoundnamesandidentifiers

It iswellknownthat free-textsearchofsystematicchemicalnameshas lowprecisionandhigh

error rate12, especially considering the variability in synonyms between resources as

demonstratedforourmodelcompounds.Therefore,searchingacompoundnameoritssynonym

fromoneresourcemightfailtoretrieveresultsfromanotherresource.Also,coveringthewhole

scopeof compound identifierswill take substantial amountof researcher’s time, sincea single

compoundcan containhundredsof synonymsand several identifiers. Furthermore, compound

namesthemselvesareambiguous,especiallyinomicsresourcesthatarenotspecificallydesigned

tostorecompoundinformationwhereassociatedmetadatafailstoincorporateuniquecompound

identifiers. For example, as identified in our study,measurin,which according to ChEMBL is a

synonymofaspirin (Compound ID:CHEMBL25), canpotentiallybemisledwithsampleswhere

labelsincludethetermmeasuring.Indeed,twooutoffourmodelcompoundsinourstudywere

annotatedusingalternativelabelsthatwereassociatedwithdifferentsamples.

Ideally,onlyasingleuniquestructurerepresentationexistsforeachuniquecompound.Wetested

this by using the name of themodel compound as search input and recorded four structural

identifiers:InChI,InChIKey,SMILESandIUPACname.MostvariabilitywasobservedforSMILES

andIUPACnameswheremorecomplexmolecule,suchasrosiglitazoneandtamoxifenhadmultiple

structural identifier representationsbetween resources. In the caseof InChIand InChIKey, the

variation,wasconsiderablylower,albeitnotabsent.Thereforecaremustbetakennottoattribute

bioassayresultstoalternativestereochemicalstructures57.

Theobservedvariabilityinstructuralidentifiersisnotsurprising,sinceoftenmolecularformulas

arenotunequivocallyuniquerepresentationsofachemicalcompound39.Nevertheless,although

severalInChIKeyscanbederivedfromasingleInChI,theyarestillconsideredsufficientlyunique

atprovidinganadequatecollisionresistancei.e. it isunlikelythattwodifferentcompoundsare

associated with one InChIKey59. However, including all omics resources, there are some

compound-specificresourcesthatdon’tallowdataidentificationbasedoncompound’sstructural

identifiers.Thiscancauseproblemsfordata integrationtaskswhichiswhycompoundsshould

alwaysbesupplementedwithuniquestructuralidentifiers,suchasInChIKeys,thatwouldenhance

theirdownstreamidentification.

Usingontologiestointegrateandsearchforinvitrocompounddata

Onewaytoincreaseannotationqualityistouseexistingontologiesthatenhancethequalityand

consequentlytheprecisionofidentifyingdatacorrectly.Inaddition,sinceontologiesareapartof

the “linkeddata”concept, theycanbeused to integrate information fromseveral resourcesby

usingfederatedqueries.Tothisend,ontologybrowsers,suchasOntologyLookupService(OLS)36

are excellent tools for finding terms relevant for your data, thus also promoting the reuse of

existingontologycontent35.


https://doi.org/10.1101/248898


17

In this study, we focused on resources that provided Resource Description Framework (RDF)

compliantdataorapublicSPARQLendpointfordataqueriessincetheyinherentlytakeadvantage

ofontologies.We failed to retrievea single sample inBioSamples resource thatwoulddirectly

relatetoinvitrodesignontologytermfromOntologyforBiomedicalInvestigation(OBI)ontology.

Inaddition,wesearchedBioSamplesfordatasetscharacterizedwithmolarconcentrationunitor

cell-lineontologyterms.Similarly, theresultswere far fromcomprehensive.Wethinkthat this

approach can be further improved once better ontology mappings of the metadata become

available.Prospectively,thereisaclearpotential,providedabetterannotationofdata,forexample

byutilizingtheCellosaurusontology33thatcontainsacontrolledvocabularyforaround100000

celllines.

UsingSPARQLandothersemanticwebtechnologies,weshowedthatassociationsbetweenraw

datasamplesandontologytermsdoexistandhencetheambiguityofidentifyingandretrievingin

vitrodata,canbereduced.Nevertheless,despitethecurrentstateofdataannotation,inorderto

integrate data from different public resources, the compound formatting needs to be further

harmonizedandthedomainvocabularystandardized56.Moreimportantly,thenewandexisting

resourcesneedtoadjustthemselvestotakefulladvantageofexistingstandardizedontologies.So

far,theyhaveonlybeenimplementedbytheselectedfew.

Toacertainextent,thereareinitiativesthatutilizesemanticwebtechnologiestosolveproblems

withcompounddataintegration.Assuch,theOpenPHACTSproject,encouragesdataresourcesto

publishtheirdataasRDF, inorder tobuildan“openpharmaceuticalspace”wheredatacanbe

accessedthroughauser-friendlysoftwareinterface10.Asoftoday,itisnotclearhowthesoftware

canbeusedbyresearcherswhodonothaveexplicitknowledgeinAPIprogramming.Nevertheless,

it can potentially be used to build new, researcher friendly applications on Open PHACTS

interoperabledata.Inparallel,OpenToxaimstoprovideasingleaccesstobothinvivoandinvitro

toxicity data and a framework for storing and executing predictive computational models60.

SimilartoOpenPHACTS,thisinitiativeisnotapplicabletonon-expertresearcher,whoislooking

forcompoundassociatedrawomicsdata,thusagapconnectingcompoundstotheirrawinvitro

datastillpersists.

Invitrotoxicology:themagnitudeofthechallengeahead(conclusionandout-look)

Tens of thousands of compounds need to be tested in order to correctly evaluate the

pharmaceuticalchemicalspace61.Realizingthistaskusing invivomodels isexpensiveandtime

consuming.Instead,alternativetestmethodssuchashigh-throughputinvitroassaysprovidethe

capabilitytotestthousandsofcompoundsacrossawiderangeofconcentrations.Togetherwith

predictive in silico applications, such as Toxtree62, which is an open-source application

commissionedbytheEuropeanCommissionJointResearchCenter’sEuropeanChemicalBureau,

theyhavethepotentialtoimproveriskassessmentofchemicalsforthebenefitofhumanhealth

andenvironment.


https://doi.org/10.1101/248898


18

Figure2.Agraphicsillustratingtheproblemsofintegratingknowledgebetweencompoundofinterestanddifferenttypesofdataresources.Theproblemscanbesolvedwithintegratedapproachesusingontologies,semantic-webtechnologiesandbetterannotationofthedata.

There already exists a substantial corpus of resources that contain data on a large number of

chemicalcompounds.Thesedataandtheirsourcesarediverseandtheyneedtobeintegratedin

Compound data is generated by researchers, screening facilities and assay developers

Data for compound of interest is (A) curated from publications into compound-specificdatabases or (B) raw data is stored in high-throughput omics databases.

A SINGLE COMPOUND DATA ACCESS POINT WOULD RESOLVE COMPOUND IDENTIFICATIONISSUES BETWEEN RESOURCES AND ACCELERATE DATA RETRIEVAL AND ANALYSIS

Curation of dataOntologies Semantic-web technologies

+ +AZ

There is little communication between compound-specific (A) and high-throughput omics (B) dataresources.

Identification of compound data in compound-specific resources (A) can be done easily but attentionmust be given to compound naming and identifiers that can differ between resources.Due to compound naming and identifier ambinguity the identification of compound data inhigh-throughput omics resources (B) is complicated and time consuming.

compound of inte

rest

=

Semantic web technologies can be used to connect knowledge in different resources based on shared ontology terms.

A unifying resources that takes advantage of ontologies, semantic web technologies and clean data annotation will provide an invaluable service to researchers globally, improve metadata quality, researcher’s efficiency and save considerable amount of time and money.

Curation of existing and new data with ontology terms will improve data annotation. This will reduce ambiquity in accessing relevant compound information.

Standardized ontologies can be used to identify relevant data associated with specific keywords (e.g. in vitro) that fit experimental description.


https://doi.org/10.1101/248898


19

order to answer the systemic effects of compounds (Figure 2). Accessing published datawith

correctcompoundinformationisessential.Theproblemsencounteredinaccessingdataonour

modelcompounds,demonstratethatusingtheresultsfrompublicationsstoredinpublicresources

andcross-referencingthemwithomcisdatastillrequiressubstantial investigativecapacity.We

recognize,thattherearetwosolutionsfortheproblemsofaccessinginvitroactivitydata:curation

ofalreadyexistingdataandimprovedannotationsbytheexperimenters.Curationcouldtransform

poorlyandhardtoreachcompoundinformationintoarichandrelevantresource.Withoutnew

experimental efforts, it could right away enhance our knowledge of toxicology and decrease

unnecessaryadditionalexperimentation,includinganimaltesting.Theproblemofthecurationis

unfortunatelyinlargepartsinsurmountable:themanpowerandexpertiserequiredtotransform

thepresent“datagraveyard”intoalivingresourcewouldnecessitatefinancialmeanswhichsimply

donotexist.Thus,wefullysupporttheconceptofacurationeffortbutwearealsoperfectlyaware

of its limitations. However, efforts similar to SourceData63, that allows to annotated already

publishedfiguresinpublication,couldprovideapotentialcure.Alternativetocurationisimproved

annotationby theexperimenter themselves.Obviously, this isaprospectiveeffort thatwillnot

applytoexistingdata. It isalso limitedbythewillingnessandexpertiseof theexperimenterto

standardize data presentation. For example, the use of proper compound identifiers (e.g.

InChIKeys)andontologytermsisnotpartof thepresentscientificculture.However,theuseof

ontologybrowserscouldalleviatethistransition.Nevertheless,currentlyresearchersreceivelittle

to no reward for the broad accessibility and perennity of their data. Thus, while improved

annotationbytheexperimenterisanoblegoalforthefuture,thiswillnothappenautomatically.

Novelapproaches forscientificpublishingsuchas theuseofproperannotationswithontology

termsandattributionofcredittoscientistsfortheirdatawillbeneededtoreallymakeanimpact.

Wouldtheeffortsnecessaryforgeneralaccessiontoinvitrocompounddatabeworththemoney

and time? Considering the success of UniProt which incorporates extensively curated and

trustworthyproteindata,theanswerisyes.IndeedacarefulanalysispublishedintheEMBL-EBI

valuereport64estimated46%increaseinresearchefficiencyforscientistsaccessinginformation

relevanttotheirresearchquestion.Witharound400000uniquevisitorspermonth,thereported

estimationshowsanenormouscost-effectbenefitfortheresearchercommunity.Theinterestin

chemical compounds isevenbigger:PubChemalonereceivesabout1millionuniqueusersper

month65. However, omics resources do not readily link to PubChem (or other compound

databases)andoftenambiguouscompoundnamesareusedtoannotatethedata.Thus,thetime

lostbythecommunitytoassemblecompoundinformationisconsiderable.Thishighlightstheneed

for an improved resource that would enhance the efficiency and speed of accessing raw and

analyzedcompounddatainareliable,simplifiedandintuitivemanner.Itwouldallowresearchers

tofocusondataanalysisanditsinterpretationinsteadofcollectionandcuration.Andfinally,such

aresourcecouldcontributetopublichealthbyallowingabetteridentificationandmanagement

ofpotentiallyhazardouscompounds.


https://doi.org/10.1101/248898


20

References

1 Kim,S.etal.PubChemSubstanceandCompounddatabases.NucleicAcidsRes44,D1202-1213,doi:10.1093/nar/gkv951(2016).

2 Fonger,G.C.,Hakkinen,P.,Jordan,S.&Publicker,S.TheNationalLibraryofMedicine's(NLM)HazardousSubstancesDataBank(HSDB):background,recentenhancementsandfutureplans.Toxicology325,209-216,doi:10.1016/j.tox.2014.09.003(2014).

3 Bento,A.P.etal.TheChEMBLbioactivitydatabase:anupdate.NucleicAcidsRes42,D1083-1090,doi:10.1093/nar/gkt1031(2014).

4 Krewski,D.etal.Toxicitytestinginthe21stcentury:avisionandastrategy.JToxicolEnvironHealthBCritRev13,51-138,doi:10.1080/10937404.2010.483176(2010).

5 Tice,R.R.,Austin,C.P.,Kavlock,R.J.&Bucher,J.R.Improvingthehumanhazardcharacterizationofchemicals:aTox21update.EnvironHealthPerspect121,756-765,doi:10.1289/ehp.1205784(2013).

6 Daneshian,M.,Kamp,H.,Hengstler,J.,Leist,M.&vandeWater,B.Highlightreport:LaunchofalargeintegratedEuropeaninvitrotoxicologyproject:EU-ToxRisk.ArchToxicol90,1021-1024,doi:10.1007/s00204-016-1698-7(2016).

7 Huang,R.etal.ProfilingoftheTox2110Kcompoundlibraryforagonistsandantagonistsoftheestrogenreceptoralphasignalingpathway.SciRep4,5664,doi:10.1038/srep05664(2014).

8 Schmidt,B.Z.etal.Invitroacuteanddevelopmentalneurotoxicityscreening:anoverviewofcellularplatformsandhigh-throughputtechnicalpossibilities.ArchToxicol91,1-33,doi:10.1007/s00204-016-1805-9(2017).

9 Colaianna,M.etal.Fingerprintingofneurotoxiccompoundsusingamouseembryonicstemcelldualluminescencereporterassay.ArchToxicol91,365-391,doi:10.1007/s00204-016-1690-2(2017).

10 Williams,A.J.etal.OpenPHACTS:semanticinteroperabilityfordrugdiscovery.DrugDiscovToday17,1188-1198,doi:10.1016/j.drudis.2012.05.016(2012).

11 Sansone,S.A.etal.Towardinteroperablebiosciencedata.NatGenet44,121-126,doi:10.1038/ng.1054(2012).

12 Murray-Rust,P.,Mitchell,J.B.&Rzepa,H.S.Communicationandre-useofchemicalinformationinbioscience.BMCBioinformatics6,180,doi:10.1186/1471-2105-6-180(2005).

13 Yang,G.,Nowsheen,S.,Aziz,K.&Georgakilas,A.G.ToxicityandadverseeffectsofTamoxifenandotheranti-estrogendrugs.PharmacolTher139,392-404,doi:10.1016/j.pharmthera.2013.05.005(2013).

14 Phiel,C.J.etal.Histonedeacetylaseisadirecttargetofvalproicacid,apotentanticonvulsant,moodstabilizer,andteratogen.JBiolChem276,36734-36741,doi:10.1074/jbc.M101287200(2001).

15 Tomson,T.etal.Withdrawalofvalproicacidtreatmentduringpregnancyandseizureoutcome:ObservationsfromEURAP.Epilepsia57,e173-177,doi:10.1111/epi.13437(2016).

16 Smith,J.B.&Willis,A.L.Aspirinselectivelyinhibitsprostaglandinproductioninhumanplatelets.NatNewBiol231,235-237(1971).

17 Degtyarenko,K.etal.ChEBI:adatabaseandontologyforchemicalentitiesofbiologicalinterest.NucleicAcidsRes36,D344-350,doi:10.1093/nar/gkm791(2008).

18 Williams,A.J.etal.TheCompToxChemistryDashboard:acommunitydataresourceforenvironmentalchemistry.JCheminform9,61,doi:10.1186/s13321-017-0247-6(2017).


https://doi.org/10.1101/248898


21

19 Pence,H.E.&Williams,A.ChemSpider:AnOnlineChemicalInformationResource.JournalofChemicalEducation87,1123-1124,doi:10.1021/ed100697w(2010).

20 Faulconbridge,A.etal.UpdatestoBioSamplesdatabaseatEuropeanBioinformaticsInstitute.NucleicAcidsRes42,D50-52,doi:10.1093/nar/gkt1081(2014).

21 Kolesnikov,N.etal.ArrayExpressupdate--simplifyingdatasubmissions.NucleicAcidsRes43,D1113-1116,doi:10.1093/nar/gku1057(2015).

22 Petryszak,R.etal.ExpressionAtlasupdate--anintegrateddatabaseofgeneandproteinexpressioninhumans,animalsandplants.NucleicAcidsRes44,D746-752,doi:10.1093/nar/gkv1045(2016).

23 Vizcaino,J.A.etal.2016updateofthePRIDEdatabaseanditsrelatedtools.NucleicAcidsRes44,D447-456,doi:10.1093/nar/gkv1145(2016).

24 Wishart,D.S.etal.HMDB3.0--TheHumanMetabolomeDatabasein2013.NucleicAcidsRes41,D801-807,doi:10.1093/nar/gks1065(2013).

25 Wishart,D.etal.T3DB:thetoxicexposomedatabase.NucleicAcidsRes43,D928-934,doi:10.1093/nar/gku1004(2015).

26 Barrett,T.etal.NCBIGEO:archiveforfunctionalgenomicsdatasets--update.NucleicAcidsRes41,D991-995,doi:10.1093/nar/gks1193(2013).

27 TheUniProt,C.UniProt:theuniversalproteinknowledgebase.NucleicAcidsRes45,D158-D169,doi:10.1093/nar/gkw1099(2017).

28 Gilson,M.K.etal.BindingDBin2015:Apublicdatabaseformedicinalchemistry,computationalchemistryandsystemspharmacology.NucleicAcidsRes44,D1045-1053,doi:10.1093/nar/gkv1072(2016).

29 Law,V.etal.DrugBank4.0:sheddingnewlightondrugmetabolism.NucleicAcidsRes42,D1091-1097,doi:10.1093/nar/gkt1068(2014).

30 Sterling,T.&Irwin,J.J.ZINC15--LigandDiscoveryforEveryone.JChemInfModel55,2324-2337,doi:10.1021/acs.jcim.5b00559(2015).

31 Tomasulo,P.ChemIDplus-supersourceforchemicalanddruginformation.MedRefServQ21,53-59,doi:10.1300/J115v21n01_04(2002).

32 Davis,A.P.etal.TheComparativeToxicogenomicsDatabase:update2017.NucleicAcidsRes45,D972-D978,doi:10.1093/nar/gkw838(2017).

33 Bairoch,A.TheCellosaurus:acelllineknowledgeresource.http://web.expasy.org/cellosaurus/.

34 Smith,B.etal.TheOBOFoundry:coordinatedevolutionofontologiestosupportbiomedicaldataintegration.NatBiotechnol25,1251-1255,doi:10.1038/nbt1346(2007).

35 Whetzel,P.L.etal.BioPortal:enhancedfunctionalityvianewWebservicesfromtheNationalCenterforBiomedicalOntologytoaccessanduseontologiesinsoftwareapplications.NucleicAcidsRes39,W541-545,doi:10.1093/nar/gkr469(2011).

36 Jupp,S.,Burdett,T.,Leroy,C.&Parkinson,H.E.inSWAT4LS.118-119.

37 NationalCenterforBiotechnologyInformation.PubChemCompoundDatabase;CID=77999,https://pubchem.ncbi.nlm.nih.gov/compound/77999(accessedJan.3,2018).

38 CHEBI:50122,https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:50122(accessed:Jan.3,2017).

39 Warr,W.A.Representationofchemicalstructures.WileyInterdisciplinaryReviews:ComputationalMolecularScience1,557-579,doi:10.1002/wcms.36(2011).

40 Chambers,J.etal.UniChem:aunifiedchemicalstructurecross-referencingandidentifiertrackingsystem.JCheminform5,3,doi:10.1186/1758-2946-5-3(2013).


https://doi.org/10.1101/248898


22

41 Samplemeta-dataareavailableintheBioSamplesdatabase(http://www.ebi.ac.uk/biosamples)underaccessionnumbersSAMEA862998.

42 ZINC15Subsets,http://zinc15.docking.org/substances/subsets/.

43 NationalCenterforBiotechnologyInformation.PubChemBioAssayDatabase;AID=47506,https://pubchem.ncbi.nlm.nih.gov/bioassay/47506(accessedJan.3,2018).

44 MicroarraydataareavailableintheArrayExpressdatabase(http://www.ebi.ac.uk/arrayexpress)underaccessionnumberE-GEOD-5679.

45 Samplemeta-dataareavailableintheBioSamplesdatabase(http://www.ebi.ac.uk/biosamples)underaccessionnumbersSAMEG19226.

46 Bard,J.B.&Rhee,S.Y.Ontologiesinbiology:design,applicationsandfuturechallenges.NatRevGenet5,213-222,doi:10.1038/nrg1295(2004).

47 Ong,E.etal.Ontobee:Alinkedontologydataservertosupportontologytermdereferencing,linkage,queryandintegration.NucleicAcidsRes45,D347-D352,doi:10.1093/nar/gkw918(2017).

48 Hastings,J.etal.Thechemicalinformationontology:provenanceanddisambiguationforchemicaldataonthebiologicalsemanticweb.PLoSOne6,e25513,doi:10.1371/journal.pone.0025513(2011).

49 Bandrowski,A.etal.TheOntologyforBiomedicalInvestigations.PLoSOne11,e0154556,doi:10.1371/journal.pone.0154556(2016).

50 Sarntivijai,S.etal.CLO:Thecelllineontology.JBiomedSemantics5,37,doi:10.1186/2041-1480-5-37(2014).

51 Abeyruwan,S.etal.EvolvingBioAssayOntology(BAO):modularization,integrationandapplications.JBiomedSemantics5,S5,doi:10.1186/2041-1480-5-S1-S5(2014).

52 Unitsofmeasurement(UO),https://code.google.com/archive/p/unit-ontology/.

53 Mungall,C.J.,Torniai,C.,Gkoutos,G.V.,Lewis,S.E.&Haendel,M.A.Uberon,anintegrativemulti-speciesanatomyontology.GenomeBiol13,R5,doi:10.1186/gb-2012-13-1-r5(2012).

54 Ison,J.etal.EDAM:anontologyofbioinformaticsoperations,typesofdataandidentifiers,topicsandformats.Bioinformatics29,1325-1332,doi:10.1093/bioinformatics/btt113(2013).

55 Malone,J.etal.ModelingsamplevariableswithanExperimentalFactorOntology.Bioinformatics26,1112-1118,doi:10.1093/bioinformatics/btq099(2010).

56 Harland,L.etal.Empoweringindustrialresearchwithsharedbiomedicalvocabularies.DrugDiscovToday16,940-947,doi:10.1016/j.drudis.2011.09.013(2011).

57 Williams,A.J.,Ekins,S.&Tkachenko,V.Towardsagoldstandard:regardingqualityinpublicdomainchemistrydatabasesandapproachestoimprovingthesituation.DrugDiscovToday17,685-701,doi:10.1016/j.drudis.2012.02.013(2012).

58 Perez-Riverol,Y.etal.DiscoveringandlinkingpublicomicsdatasetsusingtheOmicsDiscoveryIndex.NatBiotechnol35,406-409,doi:10.1038/nbt.3790(2017).

59 Pletnev,I.etal.InChIKeycollisionresistance:anexperimentaltesting.JCheminform4,39,doi:10.1186/1758-2946-4-39(2012).

60 Hardy,B.etal.Collaborativedevelopmentofpredictivetoxicologyapplications.JCheminform2,7,doi:10.1186/1758-2946-2-7(2010).

61 Cumming,J.G.,Davis,A.M.,Muresan,S.,Haeberlein,M.&Chen,H.Chemicalpredictivemodellingtoimprovecompoundquality.NatRevDrugDiscov12,948-962,doi:10.1038/nrd4128(2013).


https://doi.org/10.1101/248898


23

62 Patlewicz,G.,Jeliazkova,N.,Safford,R.J.,Worth,A.P.&Aleksiev,B.AnevaluationoftheimplementationoftheCramerclassificationschemeintheToxtreesoftware.SARQSAREnvironRes19,495-524,doi:10.1080/10629360802083871(2008).

63 Liechti,R.etal.SourceData:asemanticplatformforcuratingandsearchingfigures.NatMethods14,1021-1022,doi:10.1038/nmeth.4471(2017).

64 Beagrie,N.H.,J.TheValueandImpactoftheEuropeanBioinformaticsInstitute,FullReport.(2016).

65 Kim,S.,Thiessen,P.A.,Bolton,E.E.&Bryant,S.H.PUG-SOAPandPUG-REST:webservicesforprogrammaticaccesstochemicalinformationinPubChem.NucleicAcidsRes43,W605-611,doi:10.1093/nar/gkv396(2015).


https://doi.org/10.1101/248898


24

CompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.

AcknowledgementsSIandJSwerepartiallysupportedbySwissCentreofAppliedHumanToxicology(SCAHT).


https://doi.org/10.1101/248898


Navigating in vitro bioactivity data: investigating ...Navigating in vitro bioactivity data:...

Documents

Transcript of Navigating in vitro bioactivity data: investigating ...Navigating in vitro bioactivity data:...