Navigating in vitro bioactivity data: investigating ...Navigating in vitro bioactivity data:...
Transcript of Navigating in vitro bioactivity data: investigating ...Navigating in vitro bioactivity data:...
Navigatinginvitrobioactivitydata:investigatingavailableresourcesusingmodelcompounds
StenIlmjärv1,2,*,FionaAugsburger1,JervenTjallingBolleman3,RobinLiechti2,AlanJamesBridge3,
JennySandström4,VincentJaquet1,IoannisXenarios2,3,Karl-HeinzKrause1,*
Affiliation
1DepartmentofPathologyandImmunology,CentreMedicalUniversitaire,UniversityofGeneva,Geneva,Switzerland;2Vital-ITGroup,SIBSwissInstituteofBioinformatics,Lausanne,Switzerland;3Swiss-Protgroup,SIBSwissInstituteofBioinformatics,CentreMedicalUniversitaire,GenevaSwitzerland;4SCAHTSwissCentreforAppliedHumanToxicology,Basel,Switzerland
Keywords
chemicals,drugs,compounds,chemicaldatabases,invitro,dataintegration,ontology,semantic
web,InChI,InChIKey,SMILES,high-throughput,omics
*Addressforcorrespondence
[email protected]&[email protected]
ABSTRACTModernmedicineandanincreasinglycomplexenvironmentcontributetoexposureofhumansto
a large number of chemical compounds, that can potentially be toxic. Although widely used,
compound testing in animals has important limitations. In vitro testing provides a promising
alternative.However,becauseoftherelativeinaccessibilityandfragmentationofavailabledata,
theinvitroapproachlargelyunderperformsitspotential.Theaimofthisstudyistoinvestigate
howavailablepubliconlineresources(toolsanddatabases)supportaccessinganddistributionof
in vitro compound data.We examined 19 public online resources,mapped their features, and
evaluatedtheirusabilitywithasetoffourmodelcompounds(aspirin,rosiglitazone,valproicacid,
and tamoxifen). By investigating compound names and identifiers, we observed extensive
variationand inconsistencies in available resources: the synonymsweredifferent, compounds’
structuralidentifiers(InChI,InChIKey,SMILESandIUPACsystematicname)underperformedin
omics databases, identification of compound relatedmetadata (e.g. concentrations used in the
experiments) fromomicsexperimentswascomplexandnoneoftheavailableresourcesclearly
distinguishedbetweeninvivoandinvitrodata.Inaddition,weestimatedaccessibilityofselected
public resources using computational queries. Only a few public resources provided access to
compound-related data using semantic web technology. The general quality of experiment
annotations created further difficulties in identifying data of interest. Therefore,we identified
severalstandardizedontologieswithpotentialtoprovideanincreasedaccuracyforextensivedata
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
2
retrievalofinvitrocompounddata.Furthermore,usingtheexamplesofourmodelcompounds,we
provide recommendations on the use of ontologies by suggesting specific ontology terms to
annotateinvitroexperimentaldatawhenbeingpublished.
IntroductionThe number of chemical compounds in public databases is growing and experimental data
concerning these compounds are accumulating. As of today, both the CAS RegistrySM and
PubChem1containmorethan90millioncompoundswithmanynewcompoundsbeingaddedeach
day.Amongthesecompoundsaredrugs,chemicals,environmentalcontaminantsandtoxins,allof
which potentially could elicit effects that could have implications on human health and/or
environment.Mostofthesecompoundshavenotbeencharacterizedfromatoxicologicalpointof
view.Indeed,theUSbasedToxNetdatabaseHazardousSubstancesDataBank(HSDB)2wherean
expertpanelreviewsindividualcompounds,containslessthan6000records(asof12.12.2017).
Alternatively, ChEMBL3, which predominantly consists of literature extracted bioactivity data,
containsaround1.73milliondistinctcompounds(asof12.12.2017).Thedifferenceinthenumber
ofcompoundsregisteredandthenumberofcompoundswhosetoxiceffectiswellcharacterized,
demonstratesthelimitedcapacityofcurrentmethodstoassessacompound’sbioactivityinaliving
system.Invitro testsystemswithhigh-throughputperformanceandpotentialscalabilityaimto
bridge that gap. It will further increase the amount of data and provide new knowledge of
compound’sbiologicalproperties.
Animaltestingofcompoundtoxicityandefficacyisrequestedfordrugdevelopment,butitisslow,
expensive and raise concerns about animal welfare4. Alternatives to animal experiments is
strongly encouraged according to the 3R (replacement, reduction and refinement) principles.
Therefore,large-scaleinitiativeshavebeenlaunchedtoinvestigatecompound’stoxicityinrapid,
quantitative and scalable in vitro systems (e.g. Tox215 and EU-ToxRisk6). These bioassays can
measureendocrinedisruption7, thegenerationof reactiveoxygenspecies (ROS)8orchanges in
gene expression of biomarkers in developing and mature neurons9. These assays represent
differentaspectsofthehumanbiologywhereanaberrationofhomeostasiscansuggestadverse
consequencestothehumanhealth.Comparedtoanimalmodels,invitrobioassaysallowstudying
concentration-response relationships over a large concentration range, including those
representative of human exposure4. By combining compound’s structural information, known
molecular properties and data from bioassays and omics experiments, researchers can better
describecompoundeffectpathwaysforasystematicunderstanding.
Presentlythereisnosimplewaytoaccessinvitrocompounddatainaquickandsynopticmanner.
Instead,dataisfragmentedacrossmanydifferentresources10andinterestedpartiesneedtoinvest
invaluabletimeandefforttodevelopanexpertiseinordertonavigatethesesystemsefficiently11.
Thediversityofcompoundsynonymsanditsidentifiers,lackofprecisemetadataandannotations,
canleadtofalseconclusionsandtodifficultiesidentifyingthecompoundcorrectlyafterithasbeen
published12.Toimprovethereproducibilityofexperimentalresultsandtotestnewhypotheses
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
3
(e.g.developmentofpredictivecomputationalmodels),availabilityandaccessibilityofrawdatais
crucial.
Inthisstudy,weanalyzed19publicresourcesthatpotentiallyallowaresearchertoinvestigate
compound-specific in vitro data. We demonstrate that modest adoption of semantic web
technologies and poor annotations of experimental metadata, in particular when describing
experimental conditions, represent a major obstacle for high quality data integration and its
reusability.Lastly,weprovideinsighthowstandardizedontologiescouldimprovetheannotation
ofcompoundrelatedexperimentaldata.
MaterialsandmethodsChoiceofmodelcompoundsforresourceanalysis
Inordertosearchtheresourcesandtoidentifytheirpotentialsandpitfalls,wechosefourmodel
compounds that were used to evaluate public resources throughout this study (Table 1):
rosiglitazone,valproicacid,aspirin(acetylsalicylicacid),andtamoxifen.Belowthemotivestowhy
thesewerechosenasmodelcompoundsarelisted:
Tamoxifenisacommonlyuseddrugtotreathormone-dependentbreastcancers.Tamoxifenwas
included in the study because of the several known side-effects13 and its relatively complex
structure.
Valproicacid isanantiepilepticdrugwhichhasbeeninclinicaluseforabouthalfacentury. Its
mechanismsofactionaremultipleandincludeinhibitionofhistonedeacetylases14.Itstherapeutic
use is well accepted but it has been found to cause severe developmental toxicity if given to
pregnantwomen15.Thehistoryofthedrugmadeitinterestingforustostudyiflaterdiscoveries
of toxic effects were translated to better data representation and to additional experiments
performedwiththecompound.
Aspirin(acetylsalicylicacid)isoneoftheoldestsyntheticdrugs.Itsmechanismofaction,inhibition
of prostaglandin synthesis through acetylation of platelet cyclooxygenase (COX)16,makes it an
anti-inflammatory,analgesicandantipyreticdrug.Aspirinwasinterestingforourstudysincefor
alongtimeithasbeenwidelyusedasareadilyavailablepainkiller,thuspotentiallyhavingmany
historicalsynonymsandexperimentaldata.Ofnote,thenameaspirinitselfderivesfromatrade
name.
Rosiglitazone is an antidiabetic drug of Peroxisome Proliferator-Activated Receptors (PPAR)
agonists family, introduced into the market in 1999. It was included in the analysis because
comparedtoothermodelcompoundsitisrelativelynew.Also,similartotamoxifen,thecomplex
chemicalstructureofrosiglitazoneenabledustostudythestructuralidentifiersofmoreintricate
molecule.
Overviewofdatabasescontainingcompoundinformation
Togetanoverviewofpublicresourcescontainingcompoundinformation,bothcompoundspecific
resources (e.g. databases containing physicochemical properties such as structure) and omics
databaseswereinvestigatedusingourmodelcompounds.Primarily,wefocusedonresourcesthat
couldpotentiallyaccommodateorlinktoinvitrocompounddata.Downstreaminformationsuch
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
4
astargetinteractionsandpharmacologicalinferencewerealsoconsideredastheycanbederived
frominvitroexperimentsandhencerawdatacouldpotentiallybebacktracked.Resourcefeatures
included in our analysis involved the scope of supported compound identifiers and search
functionalities,accessibilityofrawdata,supportforontologiesandprogrammaticdataaccess.
Table1.Tableofmodelcompoundsusedinthestudyandtheiridentifiers.
Aspirin
ChEBIID CHEBI:15365
PubChemCID CID2244
InChIKey BSYNRYMUTXBXSQ-UHFFFAOYSA-N
InChI InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
SMILES CC(=O)Oc1ccccc1C(O)=O
Rosiglitazone
ChEBIID CHEBI:50122
PubChemCID CID77999
InChIKey YASAKCUCGLMORW-UHFFFAOYSA-N
InChIInChI=1S/C18H19N3O3S/c1-21(16-4-2-3-9-19-16)10-11-24-14-7-5-13(6-8-14)12-15-17(22)20-18(23)25-15/h2-9,15H,10-12H2,1H3,(H,20,22,23)
SMILES CN(CCOc1ccc(CC2SC(=O)NC2=O)cc1)c1ccccn1
Valproicacid
ChEBIID CHEBI:39867
PubChemCID CID3121
InChIKey NIJJYAXOARWZEE-UHFFFAOYSA-N
InChI InChI=1S/C8H16O2/c1-3-5-7(6-4-2)8(9)10/h7H,3-6H2,1-2H3,(H,9,10)
SMILES CCCC(CCC)C(O)=O
Tamoxifen ChEBIID CHEBI:41774
PubChemCID CID2733526
InChIKey NKANXQFJJICGDU-QPLCGJKRSA-N
InChIInChI=1S/C26H29NO/c1-4-25(21-11-7-5-8-12-21)26(22-13-9-6-10-14-22)23-15-17-24(18-16-23)28-20-19-27(2)3/h5-18H,4,19-20H2,1-3H3/b26-25-
SMILES CC\C(c1ccccc1)=C(/c1ccccc1)c1ccc(OCCN(C)C)cc1
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
5
The public resources we used in our study were PubChem1, ChEMBL3, ChEBI17, Chemistry
Dashboard(CompTox)18,ChemSpider19,BioSamples20,ArrayExpress21,ExpressionAtlas22,PRIDE
PRoteomics IDEntifications (PRIDE)23,Humanmetabolomedatabase (HMDB)24, TheToxin and
Toxin Target Database (T3DB)25, Gene Expression Omnibus (GEO)26, UniProt27, BindingDB28,
DrugBank29,ZINC30andthreeof theToxicologyDataNetwork(TOXNET)databases:Hazardous
SubstanceDataBank(HSDB)2,ChemIDPlus31,andComparativeToxicogenomicsDatabase(CTD)32
(Table2).
Table 2. A list of resources used in the study, their categorization and the number of estimatedcompoundsintheseresourcesatthetimeofthestudy.
Database No.ofcompounds Checkedat Datatype
ArrayExpress - - Raw
BindingDB 635301 December-2017 Curated
BioSamples - - Raw
ChEBI 53495 December-2017 Curated
ChEMBL 1735442 December-2017 Curated
ChemIDPlus 418376 December-2017 Curated
ChemSpider ~62million December-2017 Curated
CompTox ~760000 December-2017 Raw/Curated
CTD 15381 December-2017 Curated
DrugBank 10562 December-2017 Curated
ExpressionAtlas - - Raw/Curated
GEO - - Raw
HMDB 114100 December-2017 Curated
HSDB 5929 December-2017 Curated
PRIDE - - Raw
PubChem >94500000 December-2017 Raw/Curated
T3DB 3673 December-2017 Curated
UniProt - - Curated
ZINC15 >100000000 December-2017 Curated
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
6
Modelcompounds,compoundnamingandidentifiers
To demonstrate the nomenclature and data retrieval complexity of public resources, we used
ChEBIasourreferencedatabasesinceitisextensivelycuratedandChEBIontology17wasusedin
several other resources (eg.PubChemandChEMBL). In addition,ChEBI containedanontology
term for all of our four model compounds (Table 1). For an overview of compound names,
synonymsandidentifiers,weexaminedthepublicresourcesinTable2.Ourscopeofidentifiers
and registry numbers included Simplified Molecular Input Line Entry System (SMILES),
International Chemical Identifier (InChI), International Chemical IdentifierKey (InChIKey) and
International Union of Pure and Applied Chemistry (IUPAC) name. SMiles ARbitrary Target
Specification(SMARTS)werenotincludedinouranalysissincetheyarenotshowninChEBI.Also,
weexcludedChemicalAbstractServiceRegistryNumber(CASRN)becausetheaccuracyofCASRN
in the public domain are not absolute and reliable information can only be accessed by paid
servicesprovidedbyChemicalAbstractService(CAS)10.
Dataaccessandintegrativeinitiatives
Theabilityofaresourcetointegratedataandcompoundinformationwasevaluatedbysearching
cross-referencesbetweenresourcesusingtheselectedmodelcompounds.Toassessdatasharing
and access to public data collections of the selected resources, we studied the use of
representationalstatetransfer(RESTorRESTful)applicationprogramminginterface(API)and
resourcedescriptionframework(RDF)technologies.
Analysisofontologyusage
Tosupporttheuseofdatacross-referencingfromindividualexperimentsweevaluatedifandto
whatextent, thepublicresourcesutilizedontologies in theirdataannotationandpresentation.
With the exception of Cellosaurus33, the rest of the ontologieswe considered important were
included either in the OBO Foundry34, NCBI BioPortal35 or Ontology Lookup Service by EBI36.
Althoughtheremightbeotherontologiesnotincludedinthesecollections,theywereconsidered
lessvisibleandthereforelesslikelytobetakenupbytheresearchcommunity.
ResultsDescriptionandinterconnectivityofpublicresourcescontainingcompounddata
Here,ouraimwastostudywhatarethepublicresourcescontainingcompoundinformationandif
one resource could be used to identify a compound in another resource. Broadly, compound
relatedinformationcanbeseparatedintotwocategories:1)knowledgethathasbeenmanuallyor
automaticallycuratedfrompublications,whichoftenresultfrompublishedorunpublishedraw
data(e.g.halfmaximaleffectconcentration(IC50)valuesorcompoundinteractionswithtargets),
and 2) the raw data itself, either from omics experiments that interrogate many potential
interactionssimultaneouslyorfromdedicatedbioassaysthatmeasureaspecificendpointovera
rangeofconcentrations.Therefore, theresourcescanbeclassifiedaseithercontainingcurated
information from scientific publications or some other reliable documents (examples include
HazardousSubstanceDataBank(HSDB),ChEMBL),ordatabasesthatcontainrawnumericaldata
dedicatedtocertaindatatypes(examplesincludeArrayExpress,PRIDE).InthecaseofPubChem,
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
7
it supports both curated information and raw data of bioassays as uploaded by its users. An
overviewofdatabaseswiththetypeofinformationtheycontaincanbefoundinTable2.
Weobserved,thatoftenaresourceisactuallyreusingdatafromotherdatabases,thuscombining
information fromseveral resources,orprovidingcross-references to them.Assuch,DrugBank,
PubChem,andChEBIhadthehighestnumberofincomingconnectionsfromotherresourcesand
PubChem, ChEMBL, ChEBI and ChemSpider had the highest number of outgoing connections,
connectingroughlytohalfofthepublicresourcesinourstudylist(Figure1).Mostoftheresources
had cross-references to another resource, especially when they contained data curated from
scientificdocuments.However,withtheexceptionsofUniProt,ChEBIandChEMBL,nocompound-
specificresourceshadcross-referencestodatabasescontainingdatafromomicsexperiments.At
the time of the study we were able to identify references in ChEBI to in vitro raw data in
Figure1.Representationofcross-references(representingeitherdirect linkstootherresourceorsampleddata fromtheotherresource)betweenresources inachorddiagram.Greyresourcesaredepicting compound-specific resources whereas, the red ones depict omics resources. UniProt isneithercompound-specificoromicsresourceandisthereforecoloredblue.Outgoingconnectionsareclosertothecircumferencewhereasincomingconnectionsarefurtherfromthecircumference.Thenumbersonthescalerepresentthecombinednumberofbothincomingandoutgoingconnectionsforeveryresource.
GEO
HMDB
HSDB
CompTox
T3DB
PRIDEPubChem
UniProt
ZINC
ArrayExpress
BindingDB
ChEBI
BioSamples
ChEMBL
ChemIDPlus ChemSpider
ExpressionAtlas
CTD
DrugBank
00
6
120
0
6
0
6
00
61218
0
6
12
0
6
0
60
6
120
6
12
18
0
0
6
12
180
6 0 6
12
0
60
6
0
6
1218
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
8
ArrayExpressforthreeoutoffourmodelcompounds(rosiglitazone:E-GEOD-5509,E-GEOD-5679,
E-GEOD-30147; valproic acid: E-GEOD-1615, E-GEOD-14973, E-GEOD-23909, E-TABM-903, E-
TABM-1205,tamoxifen:E-GEOD-2225,E-GEOD-12665,E-MTAB-5319,E-TABM-562).Foraspirin,
we were able to find 73 datasets in ArrayExpress, representing both in vivo and in vitro
experimentsthathaveuntilnownotbeencross-referencedinChEBI.
Identificationofdataincompound-specificpublicresources
Here, ouraimwas to studydifferentwaysa chemical compound is identifiedbetweenvarious
publicresources.Indeed,achemicalcompoundcanbeassociatedwithmanyidentifiers,suchasa
tradename,agenericname,aregistrynumber,auniquedatabaseidentifier(e.g.PubChemCIDor
ChEMBL ID) and its structure-derived representations, referred here as structural identifiers
(SMILES, InChI, InChIKey – different types of compound identifiers are shown in Table 3).
Potentially any of these can be used to search a compound from an online public resource.
However,weobservedthatnotallcompoundsynonymsandidentifiersareidenticalindifferent
resources.Asanexample,thecompoundrosiglitazonecontains151depositor-suppliedsynonyms
in the PubChem37, whereas in ChEBI only two synonyms were provided38. Consequently, the
PubChem depositor-supplied synonym for rosiglitazone termed Gaudil failed to recognize the
compoundinChEBI.
Table3.Atableofcompoundidentifiers,theirfeaturesandexamples.AbetteroverviewofcompoundidentifiersisprovidedbyWarr,201139.Ofnote,thereareseveralstructure-basedidentifiersandnotallofthemallowforsubstructuresearches.Inaddition,somestructure-basedidentifiers(e.g.InChIKeys)arebettersuitedforcross-referencingpurposes.
Identifier
Standardized
Uniquerepresentation
ofacom
pound
Machine-readable
structurenotation
Oftenusedin
annotatingdata
Permitssubstructure
search
Canbeusedforcross-
referencing
Examples
Compoundname ✓ Rosiglitazone
Tradename ✓ Avandia
Compounddatabaseidentifier
✓ CHEMBL121
Registrynumber ✓ ✓ 122320-73-4(CASnumber)
Chemicalstructuredrawing
✓ ✓ ✓
Structure-basedidentifier ✓ ✓ ✓ ✓ ✓
SMILES,InChI,InChIKey,IUPACname
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
9
To further illustrate the complications in searching information for a compound from several
distinct databases, we first used compound names of four model compounds to investigate
compoundspecificresources(BindingDB,ChEBI,ChEMBL,ChemIDPlus,ChemSpider,CompTox,
CTD,DrugBank,HMDB,HSDB,PubChem,T3DBandZINC15)usingfree-textbasedsearch.Wewere
abletofindrecordsforallfourmodelcompoundsinalloftheresources(withoneexceptionof
rosiglitazone that was not found in T3DB) andwe recorded their structural identifierswhere
available(SupplementaryTable1-4).Interestingly,weobservedthatnotallstructuralidentifiers
wereidenticalbetweentheresources.ThatwasespeciallytrueforSMILESwherefromtheeleven
resourcesthatreportedSMILES,wefound8differentSMILESforrosiglitazoneandtamoxifen,5for
aspirinand3forvalproicacid(SupplementaryTable1-4).TheresultswerebetterforInChIKeys
(establishedbyInternationalUnionofPureandAppliedChemistry(IUPAC)),whereweobserved
asingleidentifierforbothaspirin,valproicacidandtamoxifenbutthreedifferentInChIKeysfor
rosiglitazone. Things were more complicated with IUPAC systematic names since in several
resourcesitwasunclearifthereportednamewasactuallyaIUPACsystematicnameandtherefore
theseresourceswerediscardedfromtheanalysis.Consequently,IUPACnamewasrecordedonly
fromChEBI, ChemSpider, CompTox, DrugBank, HMDB, PubChem and T3DB. Even though few
resources reported IUPAC systematic names,we observed large variability among them.More
specifically,wefound3differentnamesforaspirin,4forrosiglitazone,1forvalproicacidand5for
tamoxifen(SupplementaryTable1-4).
These resultshighlight that InChIKeysamongvariousdatabasesaremoreuniquecompared to
SMILESorIUPACsystematicnames.Tothatend,aglobalresourceUniChemhasprovidedacross-
referencing service that connects 38 individual database identifiers of various resources using
InChIKeys40. However, this service is only useful when one already knows the compound’s
databaseidentifierortheInChIKey.Currently,itcannotbeusedwithotherstructuralidentifiers
orwithcompoundnames.
Identificationofcompounddatafromomicsdatabases
Theidentityofthechemicalcompoundscanbeambiguous,sincecompoundsareoftenmentioned
bynamewithouttheaccompanyingstructurerepresentations12.Thisisespeciallytrueforomics
resourceswheredatauploadedbytheresearcherscanbeannotatedwithdifferentsynonymsof
thesamecompound.Toinvestigatethisissue,weutilizedaweb-basedfree-textsearchtoresearch
omics data from ArrayExpress, ExpressionAtlas, BioSamples, GEO and PRIDE resources using
compounds’structuralidentifiers(SMILES,InChIandInChIKey)asreportedinChEBI.Fromallthe
resources,wewereabletoretrievedataatleastfortwomodelcompoundsusingcompoundnames
(SupplementaryTable5). Inaddition, the IUPACsystematicnamesofaspirin,rosiglitazone and
valproic acid retrieved datasets from ArrayExpress, BioSamples and GEO. Interestingly, in
BioSamples,wewereabletoretrievedatasetsforvalproicacidalsowithSMILES.However,these
datasets actually corresponded to the sodium salt ofvalproic acid, that has a slightly different
SMILES representation inChEBI compared tovalproic acid.Moreover, these sampleswerenot
retrievedwhenthecompoundnamewasusedinstead.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
10
Thelatterhighlightsthatthebestwaytoidentifycompoundrelateddatafromomicsresourcesis
touse compoundnames,whichwould require the researcher to consider all of the compound
synonyms. Therefore, to estimate the variability of compound annotation in sample labels,we
retrievedthename,synonymsandstructural identifiers foreachof the fourmodelcompounds
fromtheChEMBLpublicSPARQLendpoint(thecodeisavailableathttps://github.com/whetlake/
ivcdp).ThesewerethenusedtoidentifysamplesandsamplelabelsintheBioSamplesdatabase
usingBioSamplespublicSPARQLendpoint.Forrosiglitazoneandtamoxifenonlythesampleswith
therespectivenamewasfoundinanyofthesamplelabels.Foraspirin,sampleswerefoundusing
bothaspirin,asparin,asprin,leviusandmeasurin(SupplementaryTable6-9).Thetermmeasurin
can also be derived from sample labels that contain the word measuring, making it highly
susceptibletoretrieving falsesamples,notassociatedwithaspirin.Surprisingly, thecompound
nameacetylsalicylicacidhighlightedinChEBIwasnotfoundinanyofthesamplelabels.Valproic
acidretrievedresultsalsoforvalproate,depakoteand44089.Thelatterisasynonymofvalproic
acid in ChEMBL but none of the associated sampleswere actually associated to valproic acid.
Similartoweb-basedfree-textsearch,inouranalysis,nosamplescontainedstructuralidentifiers
ofthemodelcompoundswithinthesamplelabels.Also,allthesamplesretrievedwereuniquei.e.
alternativecompoundlabelswerenotusedtoannotatethesamesample.Inaddition,caremustbe
taken when using free-text based approach to programmatically search for compound data.
Indeed, we were able to retrieve samples where sample labels indicated a negation of the
compound,suchas“Norosiglitazonepresent”(SAMEA86299841).
Identificationofinvitrocompounddatafrompublicresources
With the exception of ZINC15, none of the public resources clearly distinguished compounds
whoseeffecthasbeenmeasuredininvitroexperiments.Incontrast,ZINC15reportsasubsetsof
compoundsthathavebeenreportedorinferredactiveininvitrodirectbindingassays42.Another
approach to identify in vitro data in public resources is to browse the study description for
referencesof invitroexperimentrelatedkeywords like“invitro”, “cell-line”orspecificcell-line
names(e.g.“HeLa”).Forexample,PubChem,ChEMBL,ChEBI,ArrayExpress,ExpressionAtlas,GEO,
BioSamplesandPRIDEcanbeusedtomapthesekeywordswithsampledescriptionsusingfree-
textsearch.Assuch,wewereabletoretrieveabioassayrecordfromPubChemwiththetitle“In
VitroCytotoxicityAgainstOvarianCarcinomaCellLineCH1”43whichclearlyrepresentsaninvitro
experiment.ChEMBLprovidesamoreprecisesearch,whichallowstoretrievedataoncompounds
associatedwithspecificcell-linesorinvitroassays.
Inaddition,wecanidentifyinvitrodatabyusingconcentrationunitsassociatedwithmolarityor
molality. For example, we used previously identified microarray samples from ArrayExpress
experimentE-GEOD-5679wheredendriticcellsweretreatedwithrosiglitazone44.Fromthefull
experimentdescription,weidentifiedthatrosiglitazonewasusedat2.5µMconcentrationlevel,
indicating that this was indeed an in vitro experiment. However, samples themselves had no
attributionofconcentrationunitinneitherArrayExpress44orBioSamples(SAMEG1922645).Since
the actual concentration could only be retrieved from the experiment description or from the
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
11
originalpublication,thisapproachiscurrentlynotscalabletolarge-scaleautomaticidentification
ofinvitrodata,thuslimitingitsuse.
Insummary,thisanalysisdemonstratesthatpresentlynoconcretewaytospecificallyidentifyin
vitrodatausingfree-textbasedsearchesexistsinanyofthestudiedresources.AlthoughChEMBL
providesidentificationofcompoundinformationrelatedtocell-linesandinvitroassays,onewould
havetocoverallavailablecell-linesandassaysinordertogetacompleteoverviewoftheavailable
invitrodata.
Accessinginvitrocompounddatafrompublicresourcesprogrammatically
Toreusealreadypublisheddata,relevantdataneedtobeidentifiedcorrectlyandmadeavailable
to the researcher. Therefore, most public resources provide access to data through a bulk
download(Table4),whichcanbeverylarge(e.g.thePubChemcompounddatabaseincompressed
ASNformatisabout60GB,asofMarch2017).Alternatively,severalpublicresourcesalsoprovide
online programmatic data access. Here, we researched the utilization of RESTful API
(Representational state transfer application programming interface) andResourceDescription
Framework(RDF)technologiesbytheselectedpublicresources.
Table4.Alistofresourcesusedinthestudyandtheirsupportfordifferentdataaccess.
Bulkdownload
RESTfulAPI
RDFsupport
SPARQLendpoint
Substructureand/orsimilarity
search
ArrayExpress ✓ ✓
BindingDB ✓ ✓ ✓
BioSamples ✓ ✓ ✓
ChEBI ✓ ✓ ✓
ChEMBL ✓ ✓ ✓ ✓ ✓
ChemIDPlus ✓ ✓ ✓
ChemSpider ✓ ✓ ✓
CompTox ✓ ✓
CTD ✓
DrugBank ✓ ✓
ExpressionAtlas ✓ ✓ ✓ ✓
GEO ✓
HMDB ✓ ✓
HSDB ✓ ✓
PRIDE ✓ ✓
PubChem ✓ ✓ ✓ ✓
T3DB ✓
UniProt ✓ ✓ ✓ ✓
ZINC15 ✓ ✓
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
12
In aRESTfulquery, thedata request is constructed into a singleURL.ThismakesRESTfulAPI
simpletouseandplatformindependent.Outofthe19resourcesinourstudy,10providedfree
access to theirRESTfulAPI (Table4),whereasaccess toDrugBankAPIwasprovided fora fee.
ContrarytoRESTfulAPIwherenodataisstoredlocally,thedatainRDFcompatibleformatcanbe
usedforbulkdownloadandsubsequent introductionintoa localdatabaseforqueryexecution.
However, a preferred way of data acquisition is through a public SPARQL endpoint, which
facilitatesqueryingtheserviceproviderdirectly,thusalwaysretrievingthemostup-to-datedata.
HereweobservedthatonlyBioSamples,ChEMBL,ExpressionAtlasandUniProtprovideapublic
SPARQLendpoint(Table4).AcquiringdatausingaSPARQLendpointcanbeslowercomparedto
RESTful data access, since the latter is better optimized for specific, recurrent query requests.
However,SPARQLquerieshave thebenefitofbeinghighlycustomizableby theresearcherand
thereforetheyprovidetheflexibilitythatwillcatertotheresearchers’uniqueneeds.Also,since
RDFisaninherentpartofthe“linkeddata”concept,itcanbeusedtofindrelationshipsbetween
datasets indifferent resourcesand is thereforeextremelyuseful fordata integrationpurposes,
suchasconnectingcompoundseffectinoneresourcetoitsphysicochemicalpropertiesinanother.
Identificationofrelevantontologiestoannotateinvitrocompounddata
Ontology isacollectionofspecificallydefinedandcontrolledvocabularyconsistingofontology
terms. Ontologies allow to formally describe concepts and to define the relationship between
them46. Individual ontologies can consist from few tomany thousands of ontology terms. For
example, ChEBI ontology is made up over 100 000 ontology terms. By using ontologies, the
researchercancreatebettertargetedandmoreprecisequerieswithinthepublicresources.This
necessitatesthattheresearcherisabletoidentifyandapplyrelevantontologytermsfortheirdata.
For this purpose, we identified three resources that allow to browse and search for specific
ontologies and ontology terms: i) OntoBee47 is the default service containing a collection of
ontologiesasprovidedbytheOpenBiologicalOntology(OBO)foundry34,aswellasii)BioPortal35
and iii)OntologyLookupService (OLS)36whichareontologybrowsersdevelopedbyNCBI and
EMBL-EBI, respectively. These services can be used to browse through a number of different
ontologies, toviewrelationsbetweentermsandretrievedefinitionsforbiomedicalvocabulary.
Currently (as of 20.12.2017), OLS contains 204, OntoBee 193 and BioPortal 678 ontologies.
Important to our study, we identified 10 ontologies that are useful in relation to chemical
compounds and can be used for annotating as well as programmatically accessing in vitro
compound data (Table 5). As such, CHEMical INFormation ontology (CHEMINF)48 provides
ontologytermsforresourceandstructureidentifiers.Forexample,InChIKeycanbeidentifiedwith
a specific Universal Resource Identifier (URI) (http://semanticscience.org/resource/CHEMINF
_000059)inCHEMINFontology.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
13
Table5.Overviewofontologiesthatcanbeusedtoannotatedoridentifyinvitrodata.
ChEBI ChemicalEntitiesofBiologcalInterest17
Highlycuratedcontrolledvocabularythathasbecomethe“goldstandard”forannotationofthesmallmolecularentities.
CHEMINF CHEMicalINFormationontology48
Canbeusedtoannotatechemicalinformation,includingchemicalstructureandchemicalproperties.Importantly,itincludesmanyofthedatabaseandstructureidentifierssuchChEMBLIDorInChIKey,respectively.
OBI • OntologyforBiomedicalInvestigations49
Canbeusedtodescribeabiomedicalinvestigation,suchasprotocols,instrumentsandexperimentaldesign.Importantly,itallowstoannotatedatapointsofspecificdosesusedintheexperiment(http://purl.obolibrary.org/obo/OBI_0000984)thatexaminedose-response(http://purl.obolibrary.org/obo/OBI_0001418)relationships.Furthermore,itcanbeusedexplicitlytoannotateaninvitro(http://purl.obolibrary.org/obo/OBI_0001285)experiment.
CLO Cell-lineOntology50
CLOisusefulfordescribingacell-line.Itaimstobringtogetheravailablecell-linedatafrommultiplesourcesintoacommonformat50.Importantly,itcanbeusedtospecifyacell-linethatwasmaintainedinaninvitroenvironment(http://purl.obolibrary.org/obo/CL_0001034),thusallowingidentificationofinvitroinvestigation.
Cellosaurus TheCellosaurus:acelllineknowledgeresource33
Cellosaurusisusedforreferencingcell-lines.Asofversion24(November2017)itcontained99021cell-lineswithreferencestooriginalpublicationsandcross-referencestomanycell-linecatalogues.Theyalsoprovidereferencestoresources,wheresamples(BioSamples,ChEMBL,etc.)couldbeannotatedwiththecell-linescontainedinCellosaurus.Similarly,itallowssearchofinvitrostudiesthroughtheidentificationofspecificcell-lines.
BAO BioAssayOntology51
BAOwasdevelopedtodescribeandmodeldiverseandcomplexassays51.Importantly,BAOprovidesdefinitionsforabsoluteorrelativeIC50(http://www.bioassayontology.org/bao#BAO_0000197,http://www.bioassayontology.org/bao#BAO_0000198,respectively)valuesandforseveralotherendpointsbasedonbioassaymeasurements.
UO UnitsofMeasurement52
UOprovidesunitsofmeasurement.Importantlyonecanuseunitsofmolarity(http://purl.obolibrary.org/obo/UO_0000061)suchasmicromolar(http://purl.obolibrary.org/obo/UO_0000064)todefineconcentrationsininvitroexperiments,thusallowingidentificationofinvitrodata.
UBERON Uberon,anintegrativemulti-speciesanatomyontology53
UBERONisanintegratedcross-speciesontologythatfacilitatesreasoningfromhigher-orderanatomicallevelstomorespecificsublevels.Forexample,CLOusesUBERONtoaddorgan-specificannotationstocell-lines50.Therefore,UBERONcanbeusedtotranslateknowledgefromcell-linestoorgansorviceversa.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
14
Usingontologiestoretrieveinvitrocompounddata
Since the information stored in an RDF triplet is formulated as subject-predicate-object, each
element can be a term defined in an existing ontology. Therefore, we investigated if using
ontologies enabled us to retrieve samples directly related to in vitro ontology terms.We used
BioSamplespublicSPARQLendpointasourtargetdatabasesinceitcombinessampleannotations
from ArrayExpress, PRIDE Archive and European Nucleotide Archive (ENA) and as shown
previously,wewereabletoretrievedatafromBioSamplesforallourmodelcompoundsusingboth
compoundandIUPACsystematicnames.Tooursurprise,noneofthesamplesinBioSampleswere
annotated with the in vitro ontology term (http://purl.obolibrary.org/obo/OBI
_0001285) found in Ontology for Biomedical Investigation (OBI) ontology or with structural
identifier ontology terms (InChI, InChIKey, SMILES) from either ChEBI (Chemical Entities of
BiologicalInterest),CHEMINForEDAM(Bioinformaticsoperations,datatypes,formats,identifiers
andtopics)ontologies.However,wedidfindsamplesforallourmodelcompoundsusingChEBI
ontology terms (rosiglitazone: http://purl.obolibrary.org/obo/CHEBI_50122, aspirin:
http://purl.obolibrary.org/obo/CHEBI_15365, valproic acid: http://purl.obolibrary.org/obo/
CHEBI_39867,tamoxifen:http://purl.obolibrary.org/obo/CHEBI_41774).
Next,weinvestigatedifsamplesofourmodelcompoundsretrievedusingChEBIontologyterms
wereassociatedwithamolarityunit(http://purl.obolibrary.org/obo/UO_0000061)usingtheUO
(Unitsofmeasurement)ontology.Indeed,suchsamplesexistedforrosiglitazoneandtamoxifenbut
notforaspirinandvalproicacid.Alternatively,wewereinterestedifanyofthesamplesassociated
with the model compounds were also associated with a cell-line ontology term
(http://www.ebi.ac.uk/efo/EFO_0000322)fromExperimentalFactorOntology(EFO).Nosamples
werefoundforaspirinbutwewereabletofindseveralsamplesforrosiglitazoneandtamoxifen
andasinglesampleforvalproicacid(SupplementaryTables10-12),effectivelyidentifyinginvitro
compounddata.
In summary,despite the fact thatwewereable to retrieve somesamples representing in vitro
experimentsrelatedtoourmodelcompounds,theyareoflimiteduseforthesearchofpresently
EDAM EDAM:anontologyofbioinformaticsoperations,typesofdataandidentifiers,topicsandformats54
EDAMcanbeusedtodescribeaninvitroscientificexperiment,suchastheoutputandcomputationalproceduresrelatedtothedataanddataformats.SimilartoCHEMINFitprovidesvocabularyforcompoundstructureidentifiers/formatsuchasInChIKey(http://edamontology.org/format_1199),howeveritismoreusefulinannotatingthedown-streamdataanalysisandlesspracticalindescribingtheexperimentaldesign.
EFO ExperimentalFactorOntology55
EFOwasdevelopedtoannotatethelargequantityofdomain-independentgeneexpressiondata55storedinArrayExpress.Althoughnotspecificallydevelopedforusewithchemicalcompounds,itcontainsspecificclassesdescribingchemicalentities(e.g.chemicalcompoundsanddrugs).EFOontologycanbeusedtoidentifygeneexpressiondatathatisrelatedtochemicalcompounds.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
15
availablecompounddata,asthesedataaremostlydepositedwithoutassociatedontologyterms.
However,theuseofontologiescouldbeapowerfultoolforsearchstrategiesonceannotationsof
experimentalmetadataimproves.
DiscussionInthisstudy,weusedfourmodelcompounds(aspirin,rosiglitazone,valproicacidandtamoxifen)
to investigate 19 public resources potentially suited for identification and access to in vitro
compound data. Fragmentation and problematic accessibility of in vitro data poses a major
obstaclefortheiroptimaluseandhencereducestheirscientificandpracticalimpact.Quickand
preciseidentificationandinformationretrievalisnecessaryforefficientuseofpublisheddata,to
saveresearcherandregulatorsalikevaluabletimeandtoallowquickandeasyidentificationof
relevantdatasets56.
Fragmentationofpublicresourcesandintegrationefforts
Therearemanyresourcesthatprovidecompound-specificdata.Therearealsoseveraldatabases
thatcontainomicsdata.Ageneralconsensusisthatrawdatafromexperimentsshouldbepublicly
accessible,anditisthepolicyofhighqualityscientificjournalsandofseveralgrantingagenciesto
requirepubliclyaccessibledepositionofresearchdata.Intheory,thisisanexcellentopportunity
forbothinvitroandinsilicotoxicologysinceitwouldallowtocomparedataproducedbydifferent
research laboratories and help to estimate the impact of different experimental conditions.
However,atpresent,thistheoreticalopportunitydoesnotliveuptoitsfullpotential.
Effortshavebeenmadetointegratecompoundinformationoromicsdatafromseveralresources.
Indeed, in our analysis we observed that many curated resources dedicated to chemical
compounds contain cross-references to each other. Some of them also include data from one
another. This can lead to databases overlapping in their content57. As an example of cross-
references,ChemSpiderreportstohavelinkstomorethan500resourcesbutnonerepresenting
anomicsdataresource.Ontheotherside,BioSamplesdatabaseintegratesmetadatafromomics
resources such as ArrayExpress, EuropeanNucleotide Archive (ENA) and PRIDE but does not
readily associate this metadata with compound-specific resources or compound identifiers.
Although resources such as UniProt, ChEMBL and ChEBI included links to ArrayExpress,
ExpressionAtlas or PRIDE,we observed that thesewere not exhaustive,with several potential
linkstoomicsdatasetsmissing.
InadditiontoBioSamples,thereareotherpublicandprivateinitiativesaimingatbridgingthegap
between the discrepancies of field-specific databases. OmicsDI (Omics Discovery Index)58 and
Repositive IO (https://repositive.io) aim to integrate metadata from many different domains,
including genomics, transcriptomics andmetabolomics. Although they lack cross-references to
compound-specificresources,theycanstillbeusedtoidentifycompound-related invitroomics
data.Furthermore,Repositive IOallows itscommunitydrivensearchplatformtobepopulated
with annotation tags. This makes data identification a crowd-sourcing task, which may
substantiallyimprovetheannotationofthemetadata.However,currentlytheseannotationssuffer
from the lack of a common vocabulary and may lead to confusion when different terms that
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
16
correspondtothesamemeaningareused.Instead,asystemusingstandardizedontologiesshould
befavored.Inthatcasecompoundsamplescouldbedirectlyrelatedtothecompound’sInChIKeys
asanannotationtagdescribingthemetadata.Similaradditionscouldbemadewithconcentration
unitsandcell-lines.
Variabilityofcompoundnamesandidentifiers
It iswellknownthat free-textsearchofsystematicchemicalnameshas lowprecisionandhigh
error rate12, especially considering the variability in synonyms between resources as
demonstratedforourmodelcompounds.Therefore,searchingacompoundnameoritssynonym
fromoneresourcemightfailtoretrieveresultsfromanotherresource.Also,coveringthewhole
scopeof compound identifierswill take substantial amountof researcher’s time, sincea single
compoundcan containhundredsof synonymsand several identifiers. Furthermore, compound
namesthemselvesareambiguous,especiallyinomicsresourcesthatarenotspecificallydesigned
tostorecompoundinformationwhereassociatedmetadatafailstoincorporateuniquecompound
identifiers. For example, as identified in our study,measurin,which according to ChEMBL is a
synonymofaspirin (Compound ID:CHEMBL25), canpotentiallybemisledwithsampleswhere
labelsincludethetermmeasuring.Indeed,twooutoffourmodelcompoundsinourstudywere
annotatedusingalternativelabelsthatwereassociatedwithdifferentsamples.
Ideally,onlyasingleuniquestructurerepresentationexistsforeachuniquecompound.Wetested
this by using the name of themodel compound as search input and recorded four structural
identifiers:InChI,InChIKey,SMILESandIUPACname.MostvariabilitywasobservedforSMILES
andIUPACnameswheremorecomplexmolecule,suchasrosiglitazoneandtamoxifenhadmultiple
structural identifier representationsbetween resources. In the caseof InChIand InChIKey, the
variation,wasconsiderablylower,albeitnotabsent.Thereforecaremustbetakennottoattribute
bioassayresultstoalternativestereochemicalstructures57.
Theobservedvariabilityinstructuralidentifiersisnotsurprising,sinceoftenmolecularformulas
arenotunequivocallyuniquerepresentationsofachemicalcompound39.Nevertheless,although
severalInChIKeyscanbederivedfromasingleInChI,theyarestillconsideredsufficientlyunique
atprovidinganadequatecollisionresistancei.e. it isunlikelythattwodifferentcompoundsare
associated with one InChIKey59. However, including all omics resources, there are some
compound-specificresourcesthatdon’tallowdataidentificationbasedoncompound’sstructural
identifiers.Thiscancauseproblemsfordata integrationtaskswhichiswhycompoundsshould
alwaysbesupplementedwithuniquestructuralidentifiers,suchasInChIKeys,thatwouldenhance
theirdownstreamidentification.
Usingontologiestointegrateandsearchforinvitrocompounddata
Onewaytoincreaseannotationqualityistouseexistingontologiesthatenhancethequalityand
consequentlytheprecisionofidentifyingdatacorrectly.Inaddition,sinceontologiesareapartof
the “linkeddata”concept, theycanbeused to integrate information fromseveral resourcesby
usingfederatedqueries.Tothisend,ontologybrowsers,suchasOntologyLookupService(OLS)36
are excellent tools for finding terms relevant for your data, thus also promoting the reuse of
existingontologycontent35.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
17
In this study, we focused on resources that provided Resource Description Framework (RDF)
compliantdataorapublicSPARQLendpointfordataqueriessincetheyinherentlytakeadvantage
ofontologies.We failed to retrievea single sample inBioSamples resource thatwoulddirectly
relatetoinvitrodesignontologytermfromOntologyforBiomedicalInvestigation(OBI)ontology.
Inaddition,wesearchedBioSamplesfordatasetscharacterizedwithmolarconcentrationunitor
cell-lineontologyterms.Similarly, theresultswere far fromcomprehensive.Wethinkthat this
approach can be further improved once better ontology mappings of the metadata become
available.Prospectively,thereisaclearpotential,providedabetterannotationofdata,forexample
byutilizingtheCellosaurusontology33thatcontainsacontrolledvocabularyforaround100000
celllines.
UsingSPARQLandothersemanticwebtechnologies,weshowedthatassociationsbetweenraw
datasamplesandontologytermsdoexistandhencetheambiguityofidentifyingandretrievingin
vitrodata,canbereduced.Nevertheless,despitethecurrentstateofdataannotation,inorderto
integrate data from different public resources, the compound formatting needs to be further
harmonizedandthedomainvocabularystandardized56.Moreimportantly,thenewandexisting
resourcesneedtoadjustthemselvestotakefulladvantageofexistingstandardizedontologies.So
far,theyhaveonlybeenimplementedbytheselectedfew.
Toacertainextent,thereareinitiativesthatutilizesemanticwebtechnologiestosolveproblems
withcompounddataintegration.Assuch,theOpenPHACTSproject,encouragesdataresourcesto
publishtheirdataasRDF, inorder tobuildan“openpharmaceuticalspace”wheredatacanbe
accessedthroughauser-friendlysoftwareinterface10.Asoftoday,itisnotclearhowthesoftware
canbeusedbyresearcherswhodonothaveexplicitknowledgeinAPIprogramming.Nevertheless,
it can potentially be used to build new, researcher friendly applications on Open PHACTS
interoperabledata.Inparallel,OpenToxaimstoprovideasingleaccesstobothinvivoandinvitro
toxicity data and a framework for storing and executing predictive computational models60.
SimilartoOpenPHACTS,thisinitiativeisnotapplicabletonon-expertresearcher,whoislooking
forcompoundassociatedrawomicsdata,thusagapconnectingcompoundstotheirrawinvitro
datastillpersists.
Invitrotoxicology:themagnitudeofthechallengeahead(conclusionandout-look)
Tens of thousands of compounds need to be tested in order to correctly evaluate the
pharmaceuticalchemicalspace61.Realizingthistaskusing invivomodels isexpensiveandtime
consuming.Instead,alternativetestmethodssuchashigh-throughputinvitroassaysprovidethe
capabilitytotestthousandsofcompoundsacrossawiderangeofconcentrations.Togetherwith
predictive in silico applications, such as Toxtree62, which is an open-source application
commissionedbytheEuropeanCommissionJointResearchCenter’sEuropeanChemicalBureau,
theyhavethepotentialtoimproveriskassessmentofchemicalsforthebenefitofhumanhealth
andenvironment.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
18
Figure2.Agraphicsillustratingtheproblemsofintegratingknowledgebetweencompoundofinterestanddifferenttypesofdataresources.Theproblemscanbesolvedwithintegratedapproachesusingontologies,semantic-webtechnologiesandbetterannotationofthedata.
There already exists a substantial corpus of resources that contain data on a large number of
chemicalcompounds.Thesedataandtheirsourcesarediverseandtheyneedtobeintegratedin
Compound data is generated by researchers, screening facilities and assay developers
Data for compound of interest is (A) curated from publications into compound-specificdatabases or (B) raw data is stored in high-throughput omics databases.
A SINGLE COMPOUND DATA ACCESS POINT WOULD RESOLVE COMPOUND IDENTIFICATIONISSUES BETWEEN RESOURCES AND ACCELERATE DATA RETRIEVAL AND ANALYSIS
Curation of dataOntologies Semantic-web technologies
+ +AZ
There is little communication between compound-specific (A) and high-throughput omics (B) dataresources.
Identification of compound data in compound-specific resources (A) can be done easily but attentionmust be given to compound naming and identifiers that can differ between resources.Due to compound naming and identifier ambinguity the identification of compound data inhigh-throughput omics resources (B) is complicated and time consuming.
compound of inte
rest
=
Semantic web technologies can be used to connect knowledge in different resources based on shared ontology terms.
A unifying resources that takes advantage of ontologies, semantic web technologies and clean data annotation will provide an invaluable service to researchers globally, improve metadata quality, researcher’s efficiency and save considerable amount of time and money.
Curation of existing and new data with ontology terms will improve data annotation. This will reduce ambiquity in accessing relevant compound information.
Standardized ontologies can be used to identify relevant data associated with specific keywords (e.g. in vitro) that fit experimental description.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
19
order to answer the systemic effects of compounds (Figure 2). Accessing published datawith
correctcompoundinformationisessential.Theproblemsencounteredinaccessingdataonour
modelcompounds,demonstratethatusingtheresultsfrompublicationsstoredinpublicresources
andcross-referencingthemwithomcisdatastillrequiressubstantial investigativecapacity.We
recognize,thattherearetwosolutionsfortheproblemsofaccessinginvitroactivitydata:curation
ofalreadyexistingdataandimprovedannotationsbytheexperimenters.Curationcouldtransform
poorlyandhardtoreachcompoundinformationintoarichandrelevantresource.Withoutnew
experimental efforts, it could right away enhance our knowledge of toxicology and decrease
unnecessaryadditionalexperimentation,includinganimaltesting.Theproblemofthecurationis
unfortunatelyinlargepartsinsurmountable:themanpowerandexpertiserequiredtotransform
thepresent“datagraveyard”intoalivingresourcewouldnecessitatefinancialmeanswhichsimply
donotexist.Thus,wefullysupporttheconceptofacurationeffortbutwearealsoperfectlyaware
of its limitations. However, efforts similar to SourceData63, that allows to annotated already
publishedfiguresinpublication,couldprovideapotentialcure.Alternativetocurationisimproved
annotationby theexperimenter themselves.Obviously, this isaprospectiveeffort thatwillnot
applytoexistingdata. It isalso limitedbythewillingnessandexpertiseof theexperimenterto
standardize data presentation. For example, the use of proper compound identifiers (e.g.
InChIKeys)andontologytermsisnotpartof thepresentscientificculture.However,theuseof
ontologybrowserscouldalleviatethistransition.Nevertheless,currentlyresearchersreceivelittle
to no reward for the broad accessibility and perennity of their data. Thus, while improved
annotationbytheexperimenterisanoblegoalforthefuture,thiswillnothappenautomatically.
Novelapproaches forscientificpublishingsuchas theuseofproperannotationswithontology
termsandattributionofcredittoscientistsfortheirdatawillbeneededtoreallymakeanimpact.
Wouldtheeffortsnecessaryforgeneralaccessiontoinvitrocompounddatabeworththemoney
and time? Considering the success of UniProt which incorporates extensively curated and
trustworthyproteindata,theanswerisyes.IndeedacarefulanalysispublishedintheEMBL-EBI
valuereport64estimated46%increaseinresearchefficiencyforscientistsaccessinginformation
relevanttotheirresearchquestion.Witharound400000uniquevisitorspermonth,thereported
estimationshowsanenormouscost-effectbenefitfortheresearchercommunity.Theinterestin
chemical compounds isevenbigger:PubChemalonereceivesabout1millionuniqueusersper
month65. However, omics resources do not readily link to PubChem (or other compound
databases)andoftenambiguouscompoundnamesareusedtoannotatethedata.Thus,thetime
lostbythecommunitytoassemblecompoundinformationisconsiderable.Thishighlightstheneed
for an improved resource that would enhance the efficiency and speed of accessing raw and
analyzedcompounddatainareliable,simplifiedandintuitivemanner.Itwouldallowresearchers
tofocusondataanalysisanditsinterpretationinsteadofcollectionandcuration.Andfinally,such
aresourcecouldcontributetopublichealthbyallowingabetteridentificationandmanagement
ofpotentiallyhazardouscompounds.
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
20
References
1 Kim,S.etal.PubChemSubstanceandCompounddatabases.NucleicAcidsRes44,D1202-1213,doi:10.1093/nar/gkv951(2016).
2 Fonger,G.C.,Hakkinen,P.,Jordan,S.&Publicker,S.TheNationalLibraryofMedicine's(NLM)HazardousSubstancesDataBank(HSDB):background,recentenhancementsandfutureplans.Toxicology325,209-216,doi:10.1016/j.tox.2014.09.003(2014).
3 Bento,A.P.etal.TheChEMBLbioactivitydatabase:anupdate.NucleicAcidsRes42,D1083-1090,doi:10.1093/nar/gkt1031(2014).
4 Krewski,D.etal.Toxicitytestinginthe21stcentury:avisionandastrategy.JToxicolEnvironHealthBCritRev13,51-138,doi:10.1080/10937404.2010.483176(2010).
5 Tice,R.R.,Austin,C.P.,Kavlock,R.J.&Bucher,J.R.Improvingthehumanhazardcharacterizationofchemicals:aTox21update.EnvironHealthPerspect121,756-765,doi:10.1289/ehp.1205784(2013).
6 Daneshian,M.,Kamp,H.,Hengstler,J.,Leist,M.&vandeWater,B.Highlightreport:LaunchofalargeintegratedEuropeaninvitrotoxicologyproject:EU-ToxRisk.ArchToxicol90,1021-1024,doi:10.1007/s00204-016-1698-7(2016).
7 Huang,R.etal.ProfilingoftheTox2110Kcompoundlibraryforagonistsandantagonistsoftheestrogenreceptoralphasignalingpathway.SciRep4,5664,doi:10.1038/srep05664(2014).
8 Schmidt,B.Z.etal.Invitroacuteanddevelopmentalneurotoxicityscreening:anoverviewofcellularplatformsandhigh-throughputtechnicalpossibilities.ArchToxicol91,1-33,doi:10.1007/s00204-016-1805-9(2017).
9 Colaianna,M.etal.Fingerprintingofneurotoxiccompoundsusingamouseembryonicstemcelldualluminescencereporterassay.ArchToxicol91,365-391,doi:10.1007/s00204-016-1690-2(2017).
10 Williams,A.J.etal.OpenPHACTS:semanticinteroperabilityfordrugdiscovery.DrugDiscovToday17,1188-1198,doi:10.1016/j.drudis.2012.05.016(2012).
11 Sansone,S.A.etal.Towardinteroperablebiosciencedata.NatGenet44,121-126,doi:10.1038/ng.1054(2012).
12 Murray-Rust,P.,Mitchell,J.B.&Rzepa,H.S.Communicationandre-useofchemicalinformationinbioscience.BMCBioinformatics6,180,doi:10.1186/1471-2105-6-180(2005).
13 Yang,G.,Nowsheen,S.,Aziz,K.&Georgakilas,A.G.ToxicityandadverseeffectsofTamoxifenandotheranti-estrogendrugs.PharmacolTher139,392-404,doi:10.1016/j.pharmthera.2013.05.005(2013).
14 Phiel,C.J.etal.Histonedeacetylaseisadirecttargetofvalproicacid,apotentanticonvulsant,moodstabilizer,andteratogen.JBiolChem276,36734-36741,doi:10.1074/jbc.M101287200(2001).
15 Tomson,T.etal.Withdrawalofvalproicacidtreatmentduringpregnancyandseizureoutcome:ObservationsfromEURAP.Epilepsia57,e173-177,doi:10.1111/epi.13437(2016).
16 Smith,J.B.&Willis,A.L.Aspirinselectivelyinhibitsprostaglandinproductioninhumanplatelets.NatNewBiol231,235-237(1971).
17 Degtyarenko,K.etal.ChEBI:adatabaseandontologyforchemicalentitiesofbiologicalinterest.NucleicAcidsRes36,D344-350,doi:10.1093/nar/gkm791(2008).
18 Williams,A.J.etal.TheCompToxChemistryDashboard:acommunitydataresourceforenvironmentalchemistry.JCheminform9,61,doi:10.1186/s13321-017-0247-6(2017).
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
21
19 Pence,H.E.&Williams,A.ChemSpider:AnOnlineChemicalInformationResource.JournalofChemicalEducation87,1123-1124,doi:10.1021/ed100697w(2010).
20 Faulconbridge,A.etal.UpdatestoBioSamplesdatabaseatEuropeanBioinformaticsInstitute.NucleicAcidsRes42,D50-52,doi:10.1093/nar/gkt1081(2014).
21 Kolesnikov,N.etal.ArrayExpressupdate--simplifyingdatasubmissions.NucleicAcidsRes43,D1113-1116,doi:10.1093/nar/gku1057(2015).
22 Petryszak,R.etal.ExpressionAtlasupdate--anintegrateddatabaseofgeneandproteinexpressioninhumans,animalsandplants.NucleicAcidsRes44,D746-752,doi:10.1093/nar/gkv1045(2016).
23 Vizcaino,J.A.etal.2016updateofthePRIDEdatabaseanditsrelatedtools.NucleicAcidsRes44,D447-456,doi:10.1093/nar/gkv1145(2016).
24 Wishart,D.S.etal.HMDB3.0--TheHumanMetabolomeDatabasein2013.NucleicAcidsRes41,D801-807,doi:10.1093/nar/gks1065(2013).
25 Wishart,D.etal.T3DB:thetoxicexposomedatabase.NucleicAcidsRes43,D928-934,doi:10.1093/nar/gku1004(2015).
26 Barrett,T.etal.NCBIGEO:archiveforfunctionalgenomicsdatasets--update.NucleicAcidsRes41,D991-995,doi:10.1093/nar/gks1193(2013).
27 TheUniProt,C.UniProt:theuniversalproteinknowledgebase.NucleicAcidsRes45,D158-D169,doi:10.1093/nar/gkw1099(2017).
28 Gilson,M.K.etal.BindingDBin2015:Apublicdatabaseformedicinalchemistry,computationalchemistryandsystemspharmacology.NucleicAcidsRes44,D1045-1053,doi:10.1093/nar/gkv1072(2016).
29 Law,V.etal.DrugBank4.0:sheddingnewlightondrugmetabolism.NucleicAcidsRes42,D1091-1097,doi:10.1093/nar/gkt1068(2014).
30 Sterling,T.&Irwin,J.J.ZINC15--LigandDiscoveryforEveryone.JChemInfModel55,2324-2337,doi:10.1021/acs.jcim.5b00559(2015).
31 Tomasulo,P.ChemIDplus-supersourceforchemicalanddruginformation.MedRefServQ21,53-59,doi:10.1300/J115v21n01_04(2002).
32 Davis,A.P.etal.TheComparativeToxicogenomicsDatabase:update2017.NucleicAcidsRes45,D972-D978,doi:10.1093/nar/gkw838(2017).
33 Bairoch,A.TheCellosaurus:acelllineknowledgeresource.http://web.expasy.org/cellosaurus/.
34 Smith,B.etal.TheOBOFoundry:coordinatedevolutionofontologiestosupportbiomedicaldataintegration.NatBiotechnol25,1251-1255,doi:10.1038/nbt1346(2007).
35 Whetzel,P.L.etal.BioPortal:enhancedfunctionalityvianewWebservicesfromtheNationalCenterforBiomedicalOntologytoaccessanduseontologiesinsoftwareapplications.NucleicAcidsRes39,W541-545,doi:10.1093/nar/gkr469(2011).
36 Jupp,S.,Burdett,T.,Leroy,C.&Parkinson,H.E.inSWAT4LS.118-119.
37 NationalCenterforBiotechnologyInformation.PubChemCompoundDatabase;CID=77999,https://pubchem.ncbi.nlm.nih.gov/compound/77999(accessedJan.3,2018).
38 CHEBI:50122,https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:50122(accessed:Jan.3,2017).
39 Warr,W.A.Representationofchemicalstructures.WileyInterdisciplinaryReviews:ComputationalMolecularScience1,557-579,doi:10.1002/wcms.36(2011).
40 Chambers,J.etal.UniChem:aunifiedchemicalstructurecross-referencingandidentifiertrackingsystem.JCheminform5,3,doi:10.1186/1758-2946-5-3(2013).
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
22
41 Samplemeta-dataareavailableintheBioSamplesdatabase(http://www.ebi.ac.uk/biosamples)underaccessionnumbersSAMEA862998.
42 ZINC15Subsets,http://zinc15.docking.org/substances/subsets/.
43 NationalCenterforBiotechnologyInformation.PubChemBioAssayDatabase;AID=47506,https://pubchem.ncbi.nlm.nih.gov/bioassay/47506(accessedJan.3,2018).
44 MicroarraydataareavailableintheArrayExpressdatabase(http://www.ebi.ac.uk/arrayexpress)underaccessionnumberE-GEOD-5679.
45 Samplemeta-dataareavailableintheBioSamplesdatabase(http://www.ebi.ac.uk/biosamples)underaccessionnumbersSAMEG19226.
46 Bard,J.B.&Rhee,S.Y.Ontologiesinbiology:design,applicationsandfuturechallenges.NatRevGenet5,213-222,doi:10.1038/nrg1295(2004).
47 Ong,E.etal.Ontobee:Alinkedontologydataservertosupportontologytermdereferencing,linkage,queryandintegration.NucleicAcidsRes45,D347-D352,doi:10.1093/nar/gkw918(2017).
48 Hastings,J.etal.Thechemicalinformationontology:provenanceanddisambiguationforchemicaldataonthebiologicalsemanticweb.PLoSOne6,e25513,doi:10.1371/journal.pone.0025513(2011).
49 Bandrowski,A.etal.TheOntologyforBiomedicalInvestigations.PLoSOne11,e0154556,doi:10.1371/journal.pone.0154556(2016).
50 Sarntivijai,S.etal.CLO:Thecelllineontology.JBiomedSemantics5,37,doi:10.1186/2041-1480-5-37(2014).
51 Abeyruwan,S.etal.EvolvingBioAssayOntology(BAO):modularization,integrationandapplications.JBiomedSemantics5,S5,doi:10.1186/2041-1480-5-S1-S5(2014).
52 Unitsofmeasurement(UO),https://code.google.com/archive/p/unit-ontology/.
53 Mungall,C.J.,Torniai,C.,Gkoutos,G.V.,Lewis,S.E.&Haendel,M.A.Uberon,anintegrativemulti-speciesanatomyontology.GenomeBiol13,R5,doi:10.1186/gb-2012-13-1-r5(2012).
54 Ison,J.etal.EDAM:anontologyofbioinformaticsoperations,typesofdataandidentifiers,topicsandformats.Bioinformatics29,1325-1332,doi:10.1093/bioinformatics/btt113(2013).
55 Malone,J.etal.ModelingsamplevariableswithanExperimentalFactorOntology.Bioinformatics26,1112-1118,doi:10.1093/bioinformatics/btq099(2010).
56 Harland,L.etal.Empoweringindustrialresearchwithsharedbiomedicalvocabularies.DrugDiscovToday16,940-947,doi:10.1016/j.drudis.2011.09.013(2011).
57 Williams,A.J.,Ekins,S.&Tkachenko,V.Towardsagoldstandard:regardingqualityinpublicdomainchemistrydatabasesandapproachestoimprovingthesituation.DrugDiscovToday17,685-701,doi:10.1016/j.drudis.2012.02.013(2012).
58 Perez-Riverol,Y.etal.DiscoveringandlinkingpublicomicsdatasetsusingtheOmicsDiscoveryIndex.NatBiotechnol35,406-409,doi:10.1038/nbt.3790(2017).
59 Pletnev,I.etal.InChIKeycollisionresistance:anexperimentaltesting.JCheminform4,39,doi:10.1186/1758-2946-4-39(2012).
60 Hardy,B.etal.Collaborativedevelopmentofpredictivetoxicologyapplications.JCheminform2,7,doi:10.1186/1758-2946-2-7(2010).
61 Cumming,J.G.,Davis,A.M.,Muresan,S.,Haeberlein,M.&Chen,H.Chemicalpredictivemodellingtoimprovecompoundquality.NatRevDrugDiscov12,948-962,doi:10.1038/nrd4128(2013).
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
23
62 Patlewicz,G.,Jeliazkova,N.,Safford,R.J.,Worth,A.P.&Aleksiev,B.AnevaluationoftheimplementationoftheCramerclassificationschemeintheToxtreesoftware.SARQSAREnvironRes19,495-524,doi:10.1080/10629360802083871(2008).
63 Liechti,R.etal.SourceData:asemanticplatformforcuratingandsearchingfigures.NatMethods14,1021-1022,doi:10.1038/nmeth.4471(2017).
64 Beagrie,N.H.,J.TheValueandImpactoftheEuropeanBioinformaticsInstitute,FullReport.(2016).
65 Kim,S.,Thiessen,P.A.,Bolton,E.E.&Bryant,S.H.PUG-SOAPandPUG-REST:webservicesforprogrammaticaccesstochemicalinformationinPubChem.NucleicAcidsRes43,W605-611,doi:10.1093/nar/gkv396(2015).
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint
24
CompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
AcknowledgementsSIandJSwerepartiallysupportedbySwissCentreofAppliedHumanToxicology(SCAHT).
.CC-BY-NC-ND 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted January 17, 2018. . https://doi.org/10.1101/248898doi: bioRxiv preprint