Post on 06-Aug-2015
Enabling knowledge management in the Agronomic Domain
Pierre Larmande Ins-tute of Computa-onnal Biology
(IBC) Pierre.larmande@ird.fr
Outline
• Data integration challenges in the Life Sciences"
• Ontologies/ Semantic Web Technologies"
• Agronomic Linked Data project"
• NGS Big Data project"
Data landscape in the Life Sciences
• The availability of biological data has increased"
• Advancements in:"• computational biology"• genome sequencing"• high-throughput technologies "
• Integrative approaches are necessary to understand the functioning of biological systems"
• Lack of effective approaches to integrate data that has created a gap between data and knowledge"
• Need for an effective method to bridge gap between data and underlying meaning"
• Harvest the power of overlaying different data sets"
Data integration challenges
• Today’s Web content is suitable for human consumption"
• Collection of documents"• the existence of links that establish connections
between documents"
• Low on data interoperability and lacks semantics."
Today’s Web
• Ontologies are formal representations of knowledge - definitions of concepts, their attributes and relations between them."
• To integrate data, improve machine interoperability and data analysis required a conceptual scaffold."
• Ontological terms used across databases"• provide cross-domain common entry points in the description."
• An array of ontologies are being used to bring structured integration of various datasets."
• The Open Biomedical Ontologies (OBO) initiative:"• serves as an umbrella for well structured orthogonal
ontologies."• Ontologies represented in OBO format and OWL"
Ontologies
Semantic Web Technology
• An extension of the current Web technologies."
• Enables navigation and meaningful use of digital resources."
• Support aggregation and integration of information from diverse sources."
• Based on common and standard formats."
Resource Description Framework (RDF)
• Framework for representing information about resources on the Web"
• Provides a labeled connection between two resources"
• Uses Unique Resource Identifiers (URI)"
• Statements take the form of triples:"
Subject Predicate Object
<Gene_A> <codes_for> <Protein_A>
RDF Triple
• Combining the triples results in a directed, labeled graph."
<Gene_A>
<Protein_A> <has_func7on>
<BP_A>
<MF_A>
<Gene_X>
<regulates>
SPARQL
• Language which allows querying RDF models (graphs) "
• Powerful, flexible"
• Its syntax is similar to the one of SQL"
Agronomic Linked Data (AgroLD)
• Semantic web based system that captures knowledge pertaining to plant data"
• Aim:"
• Capability to answer complex real life questions"
• Efficient information integration / retrieval"
• Easy extensibility"
AgroLD – Phase I
• AgroLD will be developed in phases – "
• Phase I: includes data on Oryza sps. and Arabidopsis thaliana!
• SPARQL endpoint: http://bioportal.lirmm.fr:8081/test/ "
AgroLD
• Integrates information from:"
• Ontologies: Gene Ontology, Plant Ontology, Plant Trait Ontology, Plant Environment Ontology, Crop Ontology"
• Other information sources: "
• Gene/Protein information: GOA, Gramene, Tair, OryzaBase, UniPort "
• QTL information: Gramene"
• Pathway information: RiceCyc, AraCyc"
• Germplasm information: Oryza Tag Line"
• Homology prediction: GreenPhyl"
Information in AgroLD
Biological process Cellular
Component
Molecular Function
Interaction
Gene
Taxon
Protein
protein
Modification
Pathway
has_par7cipant
contains
has_func7on
has_agent
acts_on
is_member_of
codes_for
occurs_in
has_source
protein
Target Gene
Knowledge representation in AgroLD
Future directions
• Phase II: having both wider and deeper coverage to promote comparative analysis"• Include varied data types – gene expression data, protein-protein
interaction, Transcription factor- target gene "
• Developing methods to aid the process of hypotheses generation - e.g. inference rules."
• Pluggable with workflow platforms e.g.: Galaxy, OpenAlea (VirtualPlants)."
• Engage with biologists to mobilise ‘user-pull’:"• Develop real world use cases – studying the molecular mechanism
of panicle differentiation in rice"
Aims of Gigwa
• Manage genomic, transcriptomic and genotyping data resul-ng from NGS analyses
• Handle large VCF files to filter, query and extract data
• Provide a web user interface to make the system accessible for all
Main features
o Based on NoSQL technology o Handles VCF files (Variant Call Format) and annota-ons
o Supports mul-ple variant types: SNPs, InDels, SSRs, SV
o Powerful genotyping queries o Easily scalable with MongoDB sharding
o Transparent access
Demo – datatypes selec7on
Several types of selection: Variant search is restricted to the selected reference sequences and subset of individuals