Post on 30-Dec-2015
description
1
Data Integration and Extraction over Molecular Biological Data
Cui Tao
supported by NSF
2
Motivation
Online biological data: Highly diverse in granularity and
variety Various formats Different terminologies, ID systems,
units
3
How to Build a Gene Extraction Ontology? Concepts Relationship sets Constraints Data Frames
4
How to Build a Gene Extraction Ontology?
(G*A*U*C*)*
(G*A*T*C*)*
5
Knowledge Sources Gene Ontology
Thousands of terms
All Species Toolkit 1,231,935 species names
Protein Databases Thousands of protein names
(Molecular Function, Biological Process, Cellular Component)
6
Extraction Rules Statistical NLP Machine learning
Naïve Bayes Hidden Markov Models Decision Trees
7
Integration
8
9
10
11
12
13
Integration Information Hidden behind Links
14
15
16
17
Query-based Extraction
Query the gene extraction ontology
Find applicable resources Fill out forms Extract information
18
Query-based Extraction
Example: “Find the alfR gene, its sequence, its protein's function, and any mutant that inhibits this gene.”
Gene NameGene Sequence
Gene
Mutant
Protein FunctionMutant Function
19
20
21
22
23
Contribution Provides a way to automatically
integrate online biological data from different sources
Provides an approach that can find proper online resources, fill out online forms and extract data depending on user’s query