Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data...

47
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar July 26, 2005

description

 

Transcript of Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data...

Page 1: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar

July 26, 2005

Page 2: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Semantic Web Vision

Page 3: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Background and Motivation

Transformation of biology from a data poor science into a data rich science

Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone)

Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources

InterProMIPS

Swissprot

Page 4: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS (INtelligent Data Understanding System)

SwissProt, OSwissProt

Learning Algorithms

MIPS, OMIPS InterPro,OInterPro

Ontology O’Ontology O

O

Goal: knowledge discovery from large,

distributed, semantically heterogeneous data

Page 5: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition

System Summary and Work in Progress

Page 6: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Semantically Heterogeneous Data

Protein ID Protein Name Protein Sequence Prosite Motifs EC Number

P35626Beta-adrenergic

receptor kinase 2

MADLEAVLAD

VSYLMAMEKS

RGS

PROT_KIN_DOM

PH_DOMAIN

2.7.1.126

Beta-adrenergic

receptor kinase

Q12797Aspartyl/asparaginyl

beta-hydroxylase

MAQRKNAKSS

GNSSSSGSGS

TPR

TPR_REGION

TPR

1.14.11.16

Peptide-aspartate

beta-dioxygenase

Data sources need to be made self-describing by specifying the relevant meta data

Accession Number AN

Gene AA Sequence LengthPfam

DomainsMIPS Funcat

P32589 SSE1

STPFGLDLGN

NNSVLAVARN

692 HSP70 16.01 protein binding

P07278 BCY1

VSSLPKESQA

ELQLFQNEIN

415 RIIa

16.19.01 cyclic nucleotide

binding (cAMP, cGMP, etc.)

D1

D2

Page 7: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Meta Data

Schema – structure of data Specification of the attributes of the data and their types

Ontology – conceptualization of semantics of data Domains of attributes and relationships between values

Protein ID : Swissprot ID

Protein Name: String

Protein Sequence: AA String

Prosite Motifs: Motifs

EC Number: EC Hierarchy

Schema for protein data in D1

Page 8: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Attribute value hierarchy

An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data

Example: MIPS Funcat Hierarchy

Page 9: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Making data sources self-describing - Ontology-extended data source

Accession Number: MIPS ID

Gene:

Gene ID

Length:

Positive Integer

Prosite Motifs: Motifs

MIPS Funcat:

MIPS Hierarchy

Data

Schema

Ontology

+

+

P32589 SSE1STPFGLDLGNNNSVLAVARN 692 HSP70 16.01 protein binding

P07278 BCY1VSSLPKESQAELQLFQNEIN 415 RIIa

16.19.01 cyclic nucleotidebinding (cAMP, cGMP.)

Page 10: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

User view

PID: Swissprot ID

Source:

Species String

Protein:

AA String

Structural Class: SCOP

GO Function:

GO Hierarchy

MIPS Swissprot

User Schema

Data Sources of Interest

User ViewUser OntologyA user view is given by:

a set of ontology-extended data sources that are of interest to the user

a user schema and ontology (defining a virtual data source)

a set of mappings from data source schemas and ontologies to the user schema and ontology

Page 11: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings

The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at: Schema Level: between attributes in different schemas Ontology Level: between values of the attributes described in

different ontologies

Page 12: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene ID

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: Swissprot ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 13: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene Set

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: Swissprot ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 14: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein Sequence : D1≡ AA Composition : DU

AA Sequence : D2 ≡ AA Composition : DU

Protein ID: Swissprot ID

Protein Name: String

Protein Sequence:

AA StringProsite Motifs: AA String

EC Number: EC Hierarchy

Accession No AN: MIPS ID

Gene: Gene ID

AA Sequence:

AA StringLength: Pos Integer

MIPS Funcat: MIPS Hierarchy

Pfam Motifs: Motifs

D1

D2

PID: Swissprot ID

Protein:

AA StringGO Function:GO Hierarchy

DUSource: Species String

Page 15: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at schema level

Protein ID : D1≡ PID : DU

Accession Number AN : D2≡ PID : DU

Protein Sequence : D1≡ AA Composition : DU

AA Sequence : D2 ≡ AA Composition : DU

EC Number : D1 ≡ GO Function : DU’

MIPS Funcat : D2 ≡ GO Function : DU

Protein ID: SwissProt ID

Protein Name: String

Protein Sequence:

AA String

Prosite Motifs:

AA String

EC Number:

EC Hierarchy

Accession No AN:

MIPS ID

Gene:

Gene ID

AA Sequence:

AA String

Length:

Pos Integer

MIPS Funcat:

MIPS Hierarchy

Pfam Motifs:

Motifs

D1

D2

PID: SwissProt ID

Protein:

AA String

GO Function:

GO HierarchyDU

Source:

Species String

Page 16: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

DUDUD1

Page 17: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

EC 2.7.1.126 : D1 ≡ GO 0047696 : DU

DUD1

Page 18: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

DU

EC 2.7.1 : D1 GO 00047696 : DU

D1

Page 19: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Mappings at ontology level

D1

EC 2.7.1.126: D1 GO 0004672 : DU

DU

Page 20: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Integration ontology

An ontology (OU,) is called an integration ontology of a set of data source ontologies O1,…,OK if there exists K partial injective mappings Φ1,…,ΦK from O1,…,OK, respectively, to OU such that: x i y implies Φi(x) Φi(y), for all x,yOi

Order preservation

(x:Oi op y:OU)IC, then (Φi(x) op y), for all x Oi and y OU

Semantic correspondence preservation

We provide user-friendly tools for specifying semantic correspondences that are used to infer mappings semi-automatically

The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner

Page 21: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Sample Query

Return ALL proteins whose GO function isa nucleotide binding

Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism

Page 22: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition

System Summary and Work in Progress

Page 23: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning classifiers from data

DataLabeled Examples

Learner Classifier(hypothesis)

Standard learning algorithms assume centralized access to data

Unlabeled Examples

Classification

Learning

Classifier Class

Page 24: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Human and yeast protein training data

GO 0016208: AMP binding

GO 0005515: protein binding

GO 0004597: peptide-aspartate

GO 0047696: beta-adrenergic-receptor kinase

activity

GO Function

VSSLPKESQA

ELQLFQNEIN

STPFGLDLGN

NNSVLAVARN

MAQRKNAKSS

GNSSSSGSGS

MADLEAVLAD

VSYLMAMEKS

Sequence

Mainly alpha

Alpha betaYeastP39708

Mainly alpha YeastQ01574

Not KnownHumanQ12797

Mainly beta

Few Secondary Structures

HumanP35626

Structural Classes

SourcePID

Attributes/Features/Variables Class/Label

Examples/Instances/Cases

Page 25: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Probabilistic models for protein function classification

GO 0016208: AMP binding

GO 0005515: protein binding

GO 0004597: peptide-aspartate

GO 0047696: beta-adrenergic-receptor

kinase activity

GO Function

VSSLPKESQA

ELQLFQNEIN

STPFGLDLGN

NNSVLAVARN

MAQRKNAKSS

GNSSSSGSGS

MADLEAVLAD

VSYLMAMEKS

Sequence

P39708

Q01574

Q12797

P35626

PIDNaïve Bayes AlgorithmNaïve Bayes Algorithm–Very simple algorithm, works Very simple algorithm, works surprisingly well in practicesurprisingly well in practice–Treats every sequence Treats every sequence SS as a as a “bag” of amino-acids “bag” of amino-acids AA11,…,A,…,Ann

–““Gold standard” for evaluating Gold standard” for evaluating other methodsother methods

c(S)= argmaxc j ∈C

P(c j | A1,...,An )

Most probable class of c(S) is:

=argmaxc j ∈C

P(A1,...,An | c j )P(c j )

P(A1,..., An )€

=argmaxc j ∈Y

P(A1,..., An | c j )P(c j )

=argmaxc j ∈C

P(c j ) P(Ai | c j )i

Page 26: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning classifiers from data revisited

Learning = Information extraction + Hypothesis generation

Query s(D,hi->hi+1)

Answer s(D,hi->hi+1)

Data D

Learner Partial hypothesis hi

Information extraction = Sufficient statistics gathering

Hypothesis Generationhi+1R(hi , s(D, hi->hi+1))

Statistical query formulation

Page 27: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Sufficient statistics for learning classifiers

A statistic s(D) is called a sufficient statistic for a parameter θ if s(D) captures all the information about the parameter θ, contained in the data D. We are interested in minimal sufficient statistics [Cassela and Berger, 2001].

A statistic sL(D) is called a sufficient statistic for learning a hypothesis h using a learning algorithm L applied to a data set D if there exists an algorithm that takes sL(D) as input and outputs h [Caragea et al., 2004a].

Page 28: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Naïve Bayes learning as information gathering and hypothesis generation

count(AminoAcid,Class) and count(Class)Sufficient statistics:

c(S = A1,...,An ) = argmaxc j ∈C

P(c j ) P(Ai | c j )i

∏Naïve Bayes class:

Query answering engine

For each ai & For each cj

Counts

Counts(Ai|cj), Counts(cj)

P(cj) & P(ai|cj)

Compute

Naïve Bayes

Data

Page 29: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning classifiers from distributed data

Information extraction from distributed data + Hypothesis generation

Query s(D,hi->hi+1)

Answer s(D,hi->hi+1)

Query Decomposition

Answer Composition

D1

D2

DK

q1

q2

qK

Statistical Query Formulation

Hypothesis Generationhi+1R(hi , s(D, hi->hi+1 ))

LearnerPartial hypothesis hi

Query answering engine

Page 30: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Learning classifiers from semantically heterogeneous data sources

O

Query s(D,hi)

Answer s(D,hi)

Query Decomposition

Answer Composition

D1,O1

D2, O2

DK, OK

Ontology

M(O1...OK , O)

Mappings from O1 … OK to O

Statistical Query Formulation

Hypothesis Generationhi+1R(hi , s(D, hi))

LearnerPartial hypothesis hi

q2

qK

q1

Page 31: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition

System Summary and Work in Progress

Page 32: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Ontology-based information integration in INDUS

Page 33: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Capabilities of INDUS

INDUS provides support for: Specification and update of schemas and ontologies Specification of mappings between ontologies Registration of new data sources Specification of user views Specification and execution of queries across distributed,

semantically heterogeneous data sources Learning classifiers from semantically heterogeneous data

Page 34: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Tools

Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source

schemas Mapping Editor for specifying mappings between

ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying

results

Page 35: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Ontologists

A domain ontologist can: Specify or update ontologies Specify or update schemas Specify or update mappings between ontologies Specify or update mappings between schemas

Page 36: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Data Providers

A data provider can: Associate a predefined schema and ontology with a data

source Specify data source location, type and access procedures Register a data source Act as a domain ontologist

Page 37: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Experts

A domain expert can specify an application view, i.e., Select data sources of interest in an application domain Select an application specific schema Select an application specific ontology Select relevant mappings

A domain expert can serve as Domain ontologist Data provider

Page 38: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS Users: Domain Scientists

A domain scientist can Select an application view Formulate and execute queries

A domain scientist can act as Domain ontologist Data provider Domain expert

Page 39: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

INDUS

Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable

ontologies and mappings (no single global ontology) Data integration on the fly Semantic integrity of queries ensured by means of

semantics preserving mappings

Page 40: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Related work

Information integration: [Sheth and Larson, 1990; Davidson et al., 2001; Eckman, 2003; Levy, 1998]

Biological data integration: SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc.

Ontology and mappings editors: Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc.

Ontology-extended relational algebra: [Bonatti et al., 2003]

Page 41: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Outline

Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition

System Summary and Work in Progress

Page 42: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Summary

Page 43: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Ontologies and mappings Support for more expressive ontologies (beyond hierarchies)

[Bao et al., 2005] Support for interactive specification of mappings between

ontologies, including automated generation of candidate mappings

Support for modular ontologies and mappings [Bao and Honavar, 2004]

Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings

More powerful reasoning to ensure the semantic integrity of mappings

Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL)

Page 44: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Query Processing Query optimization under access, bandwidth and

computational constraints Implementation of data retrieval procedures (iterators) for

widely used bioinformatics data sources Support for data caching and data sharing

Page 45: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Knowledge Acquisition Support for learning classifiers and other predictive models

from semantically heterogeneous data [Caragea et al., 2005]

Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ]

Support for annotating and sharing results of knowledge acquisition

Page 46: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

Work in progress

Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships Prediction of protein function [Andorf et al., 2004] Prediction of protein-protein, protein-DNA and protein-RNA

interfaces [Yan et al., 2004] Analysis, visualization, and interpretation of gene expression data

[Caragea et al., 2005] Modeling and discovery of gene regulatory networks

Usability studies Design of better user interfaces Performance evaluation

Page 47: Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources

Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).

Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory

http://www.cild.iastate.edu/software/indus.html