Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland...
-
Upload
godfrey-berry -
Category
Documents
-
view
218 -
download
0
Transcript of Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland...
Discovery from Linking Open Data (LOD) Annotated Datasets
Louiqa RaschidUniversity of Maryland
PAnG/PSL/ANAPSID/Manjal
Agenda
• Motivation• Challenges• Solution approaches
• Emergence of biological datasets in the cloud ofLinked Data.
• Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus.
• Links form a graph that captures meaningful knowledge.• Sense making of annotation graphs can explain phenomena,
identify anomalies and potentially lead to discovery.
Agenda
• Motivation– Drug re-purposing– Cross ontology patterns and literature imprint– Cross genome analysis
• Challenges• Solution approaches
Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population.Compute similarity score [-1, +1]
Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.
Sirota et al Findings• Efficacy (literature) for 2 drugs: topiramate and prednisolone.• Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma.• Methodology does not provide avenues for explanation, validation or discovery.
Sirota et al:Identified anomaly in this cluster
Limitations and Extensions• Sirota et al.
– Anomaly in drug cluster but their methodology does not allow further investigation.
• Sims et al.– Methodology is limited to co-occurrence analysis.
• Cannot exploit heterogeneous evidence from LOD sources.
• Cannot exploit knowledge in ontologies.• Finding patterns in graph datasets and visualization
and explanation.
Agenda• Challenges
– Exploiting LOD to create datasets.– Knowledge captured in ontologies. – Similarity metrics/distances tuned for ontologies.– Discovering and validating patterns in graphs.– Literature imprint.– Heterogeneous evidence.– Reasoning with uncertainty.
Solution Approaches• PAnG
• PSL
• Manjal• ANAPSID
• Thanks to our collaborators / domain experts:• Olivier Bodenreider, NLM, NIH• Sherri de Coronado, NCI, NIH• Andreas Thor, University of Leipzig
• Louiqa Raschid ++ at UMD
• Lise Getoor ++ at UMD
• Padmini Srinivasan ++ University of Iowa• Maria Esther Vidal ++ Universidad Simon Bolivar
Integrated access for heterogeneous data sources:adaptive query processing for SPARQL endpoints
TheArabidopsisInformationResource
GeneOntology
ClinicalTrials
Patterns inANnotationGraphs
PSL: Annotation computation by knowledge propagationPANG: Pattern identification using dense subgraphs and graph summaries.
Manjal – Text Mining for
MEDLINE
Annotation Visualizer – Visualize and explore
annotations and patterns
Solution approaches
Motivation: Gene Annotation Graphs• Genes are annotated with Gene Ontology (GO)
and Plant Ontology (PO) terms
• Prediction of new annotations as hypothesis for experiments– Link prediction is predicting new functional
annotations for a gene
Anno-tations
Link Prediction Framework
• Dense Subgraph (optional)– Focus on highly connected subgraphs
• Graph summarization: – Identify basic pattern (structure) of the graph
• Link Prediction– Predicted links reinforce underlying graph
pattern
TripartiteAnno-tationGraph (TAG)
Ranked Listof pre-dictedLinks
Link Prediction
Link PredictionScoring
FunctionScoring
Function
Dense Subgraph
Dense SubgraphDistance
RestrictionDistance
RestrictionDenseSubgraph
Filter
GraphSumma-rization
GraphSumma-rization
Cost ModelCost
Model Graphsummary
Link Prediction
Dense Subgraph• Motivation: graph area that is rich or dense with
annotation is an “interesting region”• Density of a subgraph = number of induced
edges / number of vertices• Tripartite graph with node set (A, B, C) is
converted into bipartite graph with (A, C)– Weighted edges = number of shared b’s– Apply technique of [1]
• Distance restriction for DSG possible– Hierarchically arranged ontology terms– All node pairs of A and C are within a given distance
[1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010
Graph Summarization• Minimum description length approach [2]
– Loss-free; employs cost model • Graph summary
= Signature + Corrections• Signature: graph pattern / structure
– Super nodes = complete partitioning of nodes– Super edges = edges between super nodes
= all edges between nodes of super nodes• Corrections: edges e between
individual nodes– Additions: e G but e signature– Deletions: e G but e signature
[2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008
PO_20030
PO_9006
PO_37
PO_20038
HY5
PHOT1
CIB5
CRY2
COP1
CRY1
PO_20030
PO_9006
PO_37
PO_20038
HY5PHOT1
CIB5CRY2COP1CRY1
==
PO_20030
PO_9006
PO_37
PO_20038
HY5
PHOT1
CIB5
CRY2
COP1
CRY1
PO_20030
PO_9006
PO_37
PO_20038
HY5PHOT1
CIB5CRY2COP1CRY1
==
PO_20030
PO_9006
PO_37
PO_20038
HY5
CIB5
COP1
PHOT1
CRY2
CRY1
DSG+GS
PSL
Distance metrics
Distance metrics
Different retrieved sets of lung cancer related clinical trials
• Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.
• Retrieve 100 trials using “lung carcinoma” in the CONDITION field.
• Retrieve 100 trails using “lung carcinoma” in any field.
Retrieve 100 clinical trials using search keyword “lung cancer”.Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.
100 clinical trials using search keyword“lung carcinoma” for CONDITION.
100 clinical trials using search keyword“lung carcinoma” for ALL FIELDS.
Questions?
PAnG/PSL/ANAPSID/Manjal