Post on 23-Jan-2018
QueryFedera*onoverBiomedicalLinkedOpenData
Applica*onsinPharmacovigilanceandQues*on-Answering
MaulikR.KamdarMusenLab
Pre-proposalTalkAugust9,2017
1
ThedataandknowledgediscoveryboFleneck
BiomedicalQueries
Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?
ListmolecularcharacterisMcsoftheanMneoplasMcdrugs,thattargetEGFRandhaveMol.Wt<300g/mol.
2
DesirableDrugsMolecularcharacterisMcsProteinTargetsDownstreamGenes…
BiomedicalInformaMcsResearchMethods
OpenPHACTS.Williams,etal.DrugDiscoveryToday,2012
Post-markeMngsurveillancefordetecMngdrug-druginteracMonsandtheadversereacMons
3JaneP.F.BaiandDarrellR.Abernethy.Annualreviewofpharmacologyandtoxicology53(2013)
Mechanism-basedpredicMon
JaneP.F.BaiandDarrellR.Abernethy.Annualreviewofpharmacologyandtoxicology53(2013) 4
Isolateddatabasesandknowledgebases
DISTRIBUTED DATA and KNOWLEDGE
5
• Formats(XML,CSV,MySQLDatabase,etc.)• EnMtyNotaMons(Ensembl,Entrez,HGNC,etc.)• Schemas(SmallCompound,Compound,etc.)
Anoveldataintegra*onmethodtotacklethechallengesofinconsistencies,
incompletenessandheterogeneityacrossdataandknowledgesources
7
SemanMcWebTechnologies
8BernersLee,ScienMficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)
SemanMcWebtechnologies
9BernersLee,ScienMficAmerican2001TimBerners-Lee:ThenextWebofopen,linkeddata(TEDTalk2009)
RDF:Publishingdataasagraph
10
mol_weight
ResourceDescripMonFramework(RDF)
target name
typeprocess
UniformResourceIdenMfier
RDF:Publishingdataasagraph
11
589.25
mol_weight
Gleevec(Mol.Wt.:589.25g/mol,Half-Life:18hours)inhibitsPDGFR,involvedinsignaltransducMon.
“18hours”half-life
x-ref
GleevecDrugB:DB00619
Gleevec
ResourceDescripMonFramework(RDF)
Inhibits
target name
type
GO:0007165(Signal
TransducMon)
process
PDGFRKEGG:D01441h@p://bio2rdf.org/kegg:D01441
h@p://bio2rdf.org/drugbank:DB00619
UniformResourceIdenMfier
SPARQL:Queryingthegraph
<1000
mol_weight
?half-life
x-ref
?
?
Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteins
involvedinsignaltransducMon?
SPARQLQueryLanguage12
Inhibits
?target name
type
GO:0007165(Signal
TransducMon)
process
LifeSciencesLinkedOpenData(LSLOD)Cloud
Saleem,M.,Kamdar,MR.etal.,JournalofWebSemanMcs2014.
Callahan,A.,etal.,ISWC2013.
14
EbolaVirusknowledgebasequeriestheLSLODcloud
16
MaulikR.KamdarandMichelDumonMer.AnEbolaVirus-centeredKnowledgeBase.Database(2015)
QueryingSta*s*cs:• 40+LinkedDataSources• 10,000+classes,
objectanddataproperMes• 30,000+edges
AutomatedqueryingacrosstheLSLODcloud
Ini*alinsights:• Minimalsharingofcommon
vocabulariesandontologies• Hubnodes:
• rdfs:label• dc:Mtle• bio2rdf:idenMfier
MaulikR.Kamdar,etal.(manuscriptinpreparaMon) 19
MiningLSLODcloudisnotaseasyasitseems…
• IsolatedSPARQLendpointsorRDFDumps(withvaryingsupporttoSPARQLoperators)
• DifferentURInotaMons,withnoexplicitx-refslinks• hCp://bio2rdf.org/uniprot:P45059• hCp://purl.uniprot.org/uniprot/P45059
• MalformedURIs,unavailableSPARQLendpoints,etc.• hCp://bio2rdf.org/kegg:map00010
hCp://bio2rdf.org/kegg:00010• hCp://bio2rdf.org/go:0030307\”
• HeterogeneityintheLSLODclouddatasets
20
22
InconsistentA@ributeValues:Differentactualvaluesanddatatypes
Gleevecmolecular-weight
493.61 Gleevecmol_weight
589.25
(clinicalfeatures) (biologicalfeatures)
Name Mol.Formula(KEGG)
Mol.Formula(DrugBank)
Lepirudin C287H440N80O111S6 C287H440N80O110S6
PyridoxalPhosphate C8H10NO6P.H2O C8H10NO6P
Cevimeline (C10H17NOS)2.2HCl.H2O C10H17NOS
Cispla*n PtCl2.2NH3 Cl2H4N2Pt
Sodiumbicarbonate NaHCO3 CHNaO3
24
Incompleteness:CompletelyuniqueenMMesacrosssources
E1:Drug
Findingssimilarto“Willthecorrectdrugspleasestandup?”-Southanetal.GCC2016
HeterogeneityintheLSLODCloud
25
• InconsistentA@ributevaluesforen**es• IncompleteEn**es
• IncompleteRela*onsbetweenen**es
HeterogeneityintheLSLODCloud
27
• InconsistentA@ributevaluesforen**es• IncompleteEn**es
• IncompleteRela*onsbetweenen**es
• InconsistentURIlabelsforclasses,rela*onsanda@ributes
28
LabelMismatch:Differentlabelsforclasses,relaMonsandaFributes
Gleevecmolecular-weight
493.61 Gleevecmol_weight
589.25
(clinicalfeatures) (biologicalfeatures)
Source UniformResourceIden*fier(URI) ParsedLabel
hFp://bio2rdf.org/drugbank_vocabulary:Molecular-Weight MolecularWeight
hFp://www.biopax.org/release/biopax-level3.owl#molecularWeight MolecularWeight
hFp://semanMcscience.org/resource/CHEMINF_000198molecularweightcalculatedbypipelinepilot
hFp://mo-ld.org/mine_vocabulary:hasMolecularWeight HasMolecularWeight
hFp://bio2rdf.org/kegg_vocabulary:mol_weight MolWeight
HeterogeneityintheLSLODCloud
29
• InconsistentA@ributevaluesforen**es• IncompleteEn**es
• IncompleteRela*onsbetweenen**es
• InconsistentURIlabelsforclasses,rela*onsanda@ributes
• InconsistentGraphpa@ernsforSPARQLqueries
30
ModelMismatch:DifferentgraphpaFernstocapturegranularity
(clinicalfeatures) (biologicalfeatures)
Gleevec PDGFRdrug-target
Gleevec
Inhibits
PDGFRtarget
name
type
PubMed:21152856
source
Source GraphPa@ern
E1<--drug--gene-drug-Associa*on--gene-->E2
E1<--chemical--Chemical-Gene-Associa*on--gene-->E2
HeterogeneityintheLSLODCloud
31
• InconsistentA@ributevaluesforen**es• IncompleteEn**es
• IncompleteRela*onsbetweenen**es
• InconsistentURIlabelsforclasses,rela*onsanda@ributes
• InconsistentGraphpa@ernsforSPARQLqueries
Andmanyotherproblems…MaulikR.Kamdar,etal.(manuscriptinpreparaMon)
DataWarehousing:TransformingdataunderoneuniformschemaanduniformnotaMons
34
WAREHOUSING
OpenPHACTS.Williams,2012DataGraphs
✓EfficientqueryexecuMon✓Completeresults✗ Datacopies✗ Inflexible,notscalable
QueryFedera*on:RewriMngandexecuMngqueriesacrossdifferentsources
QUERY FEDERATION
Drugv molecular-weight<1000v target
v process=“GO:0007165”v half-life
35Schwarte,etal.ISWC2012
Drugv molecular-weight<1000v targetv half-life
Drugv molecular-weight<1000v target
v process=“GO:0007165”
Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?
LabelmismatchmakessimplisMcqueryfederaMondifficult…
36
Gleevecmolecular_weight
493.61 Gleevecmol_weight
589.25
(clinicalfeatures) (biologicalfeatures)
Mappingsourceschemastoanontology
Callahan,etal.JournalofBiomedicalSemanMcs2013
ChemicalEnMty
Protein
Process
isPar*cipantIn
isPar*cipantInisP
ar*cipan
tIn
Seman*cScienceIntegratedOntology
37
Q:ChemicalsthatparMcipateinthesameprocessesasSaccharomycesProteins
SaccharomycesGenomeDatabase
U2AF1
Protein
GO_Code
RNASplicing
hasAnnota*on
hasAnnota*on
Compara*veToxicogenomicsDatabase
Chemical
BiologicalProcess
par*cipates
RNASplicing
Vinclozolin
par*cipates
UsingontologymappingrulesforqueryrewriMng
Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?
?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000?protein<hasGO><GO:0007165>
?sa<Drug>{?s<molecular-weight>?mw}?s<drug-target>?protein{?s<half-life>?hl}?mw<1000
?sa<Drug>?s<mol_wt>?mw{?s<target>?protein}?protein<hasGO><GO:0007165>
QueryRewriteQueryRewri*ng
?DrugDrugBank:drug-target?Protein?DrugKEGG:target?Protein
MappingRules:
?DrughasTarget?Protein
38
Thisdoesnotsolvethemodelmismatchproblem…
Gleevec PDGFRdrug-target
Gleevec
Inhibits
PDGFRtarget
name
type
PubMed:21152856
source
PDGFRQueryResults:
?DrugDrugBank:drug-target?Protein?DrugKEGG:target?Protein
MappingRules:
?DrughasTarget?Protein
39
(clinicalfeatures) (biologicalfeatures)
40
Proposal:MappinggraphpaFernstoamodel
nametarget
drug-target
mol_weight
molecularWeight value
LSLODcloudsourceschemas GraphPaFerns Model
Proposal:UsinggraphpaFernsforqueryrewriMng
?DrugDrugBank:drug-target?Protein?DrugKEGG:target?blankKEGG:link?Protein
MappingRules:
Whatarethehalf-livesofdrugsthathaveMol.Wt<1000g/molandinhibitproteinsinvolvedinsignaltransducMon?
?sa<Drug>?s<hasMolWt>?mw?s<hasTarget>?protein?s<hasHalfLife>?hl?mw<1000g/mol?protein<hasGO><GO:0007165>
?sa<Drug>{?s<molecular-weight>?mw}?s<drug-target>?protein{?s<half-life>?hl}?mw<1000g/mol
?sa<Drug>?s<mol_wt>?mw{?s<target>?protein_blank?protein_blank<link>?protein}?protein<hasGO><GO:0007165>
QueryRewriteQuery
Rewri*ng
41
?DrughasTarget?Protein
LifeSciencesLinkedOpenDataCloud
QueryFederationMappingRules
DataModel
Queries
PhLeGrA– LinkedGraphAnalyMcsinPharmacology
43
PhlegraisaspidergenusoftheSalMcidaefamily,commonlytermedjumpingspiders.
Inputdatamodel:Underlyingmechanismsbehinddrug-adversereacMonassociaMons
44
Drug1(InacMveState)Enzyme
Drug1
J.Jiaetal.NaturereviewsDrugdiscovery,2009.
Inputdatamodel:Underlyingmechanismsbehinddrug-adversereacMonassociaMons
45
Drug1(InacMveState)
Drug1(IncreasedToxicity)
Drug2(TargetsEnzyme)
J.Jiaetal.NaturereviewsDrugdiscovery,2009.
Inputdatamodel
Concept
E1 Drug
E2 Protein
E3 Pathway
E4 AdverseDrugReacMon
Rela*on
R1 DrughasTargetProtein
R2 DrughasEnzymeProtein
R3 DrughasTransporterProtein
R4 ProteinisPresentInPathway
R5 PathwayisImplicatedInADR
Inputmappingrules:GraphpaFernsmappedtoDrughasTargetProtein
Source GraphPa@ern
E1<--drug--Target-Rela*on--target-->E2
E1<--drug--gene-drug-Associa*on--gene-->E2
E1--target-->_:blank--link-->E2
E1<--chemical--Chemical-Gene-Associa*on--gene-->E2
47
Gleevec
Inhibits
PDGFRtarget
link
type
PubMed:21152856
source
LifeSciencesLinkedOpenDataCloud
QueryFederationMappingRules
DataModel
Drug Protein PathwayAdverseReaction
Queries
k-parMtenetworkcanbegeneratedasoutput
48
MaulikR.KamdarandMarkA.Musen.PhLeGrA:GraphAnalyMcsinPharmacologyovertheWebofLifeSciencesLinkedOpenDataCloud.Interna*onalConferenceonWorldWideWeb(WWW)(2017)
EnMMesandrelaMonsfrom4differentsourcesareretrievedtocreatethek-parMtenetwork
Thisk-parMtenetworkisgeneratedin<1day
49
LifeSciencesLinkedOpenDataCloud
QueryFederationMappingRules
DataModel
Drug Protein PathwayAdverseReaction
GraphAnalyticsModule
Queries
AgraphanalyMcsmoduletorankthemechanisms
51
ImplemenMngnetwork-basedapriorialgorithm
• Inputs–Outcomesdatabase:– USFDAAdverseEventReporMngSystem(FAERS):2013-2015– 3millioncasereportswith
Drugs,AdverseReacMons,IndicaMons,Dosesetc.
• Associa*on:{Drug}n-->ADR– FilteringnodesandpathsbasedontheSupportstaMsMc.– PredicMngifanassociaMonexistsbasedontheNetwork-based
Rela*veRepor*ngRa*ostaMsMc– RankingunderlyingmechanismsbasedontheConfidencestaMsMc.
Harpaz,etal.2010,Inokuchi,etal.2000 52
ValidaMonoftheapproach
• “Silver”standardvalidaMonsets:– ObservaMonalMedicalOutcomesPartnership(OMOP)dataset– ExploringandUnderstandingAdverseDrugReacMons(EU-ADR)dataset– Drugs.comandMediSpanDrug-druginteracMonsdataset(Iyer,etal.2014)
• BaselineMethods:– BayesianConfidencePropagaMonNeuralNetwork(BCPNN)– GammaPoissonShrinkage(GPS)
Dataset UniqueDrugs UniqueADRs Posi*veAssoc. Nega*veAssoc.
OMOP 155 4 137 158
EU-ADR 59 9 44 39
Iyer,etal. 252 9 315 288
53
Preliminaryresultsshowcomparableperformance
54
Dataset BCPNN GPS Network-basedRRR
OMOP 0.70 0.70 0.72
EU-ADR 0.75 0.76 0.78
Iyer,etal. 0.81 0.83 0.82
MaulikR.KamdarandMarkA.Musen.Mechanism-basedPharmacovigilanceovertheLifeSciencesLinkedOpenDataCloud.AmericanMedicalInforma*csAssocia*on(AMIA)AnnualSymposium(2017)
Thestorysofar..
• ComparableperformancewithexisMngbaselinemethodsusedtodetectsignalsinUSFAERSdatasetsforpharmacovigilance.
• Event-specificthresholdscanleadtoanAUROCstaMsMc>0.75formorethan146AdversereacMons.
• Mechanism-basedpharmacovigilancewithconfidencestaMsMcsforunderlyingmechanisms.
56
PlansfordissertaMon
• IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)– Querycomplexity,queryexecuMonMme,completeness.
58
PlansfordissertaMon
• IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)– Querycomplexity,queryexecuMonMme,completeness.
• Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.
59
ReVeaLD:Real-MmeVisualExplorerandAggregatorofLinkedData
ListmolecularcharacterisMcsoftheanMneoplasMcdrugs,thattargetEGFRandhaveMol.Wt<300g/mol.
MaulikR.Kamdar,etal.ReVeaLD:Auser-driven,domain-specificinteracMvesearchplawormforbiomedicalresearch.JournalofBiomedicalInforma*cs(2014) 60
hFps://www.youtube.com/watch?v=6HHK4ASIkJM
PlansfordissertaMon
• IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)– Querycomplexity,queryexecuMonMme,completeness.
• Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.
• IwillevaluateanapproachtoautomatethegeneraMonofthemappingrulesusedbythequeryfederaMonmethod.
61
Semi-automaMnggeneraMonofmappingrules
Approachtomapgraphpa@erns:• WordEmbeddings• GraphLevenshteinDistance• Instance-levelSimilariMes
WhatIqueriedsofar:• 40+LinkedDataSources• 10,000+classes,
objectanddataproperMes• 30,000+edges
62
PlansfordissertaMon
• IwillperformaCompara*veevalua*onofmypaFern-basedfederaMonmethodwithexisMngmethods(FedX,SPLENDID)– Querycomplexity,queryexecuMonMme,completeness.
• Iwillcombinemypastresearchonques*on-answeringovertheLSLODcloud,withtheupdatedqueryfederaMonmethod.
• IwillevaluateanapproachtoautomatethegeneraMonofthemappingrulesusedbythequeryfederaMonmethod.
• I plan to getdomain-specific feedback from the PharmGKBteamaxertheupdatedapplicaMonsaredeployedonline.
63
Acknowledgments
MusenLab- TaniaTudorache- CsongorNyulas- MaFhewHorridge- SimonWalk- RafaelGonçalves- JosefHardi- MarcosMarMnez- MarMnO’Connor- JohnGraybeal- AlexScrenchukAndothers…BMIStudents
65
MarkMusenRussAltmanJureLeskovecMichelDumonMerTeriKleinRainerWinnenbergJuanBandaErikVanMulligenAmrapaliZaveriStefanDeckerMaryJeanneOlivaJoanMeneesAylaAkgulSteveBagley