A Metadata Integration Assistant Generator for Heterogeneous Databases

34
Young-Kwang Nam Joseph Goguen Guilian Wang A Metadata Integration A ssistant Generator for H eterogeneous Databases

description

A Metadata Integration Assistant Generator for Heterogeneous Databases. Young-Kwang Nam Joseph Goguen Guilian Wang. Data Integration in Synthetic Scientific Applications. Applications. Integrated result without inconsistency, etc. Query. global unified schema/ontology. - PowerPoint PPT Presentation

Transcript of A Metadata Integration Assistant Generator for Heterogeneous Databases

Page 1: A Metadata Integration Assistant Generator for Heterogeneous Databases

Young-Kwang Nam

Joseph Goguen

Guilian Wang

A Metadata Integration Assistant Generator for Heterogeneous Database

s

Page 2: A Metadata Integration Assistant Generator for Heterogeneous Databases

Data Integration in Synthetic Scientific Applications

Integrated result without inconsistency, etc.

Applications

Integration System

datasource 1

datasource 2

datasource n

local schema/ontology

local schema/ontology

local schema/ontology

global unifiedschema/ontology

Query

Page 3: A Metadata Integration Assistant Generator for Heterogeneous Databases

Why Difficult: Data Heterogeneity

• Platform & System Heterogeneity– OS, Hardware – DBMSs, Concurrency control and recovery capabilities

• Syntactic & Structural Heterogeneity– Machine readable aspects of representation – Data models, Schemas,

• Semantic Heterogeneity– Naming conflicts: synonyms, homonyms– Scaling & precision conflicts– Sampling rates, error distribution, etc.

Page 4: A Metadata Integration Assistant Generator for Heterogeneous Databases

More Difficult: Flexible Integration

• No all-encompassing system satisfies everyone:– frequent update of sources– frequent change of user requirements– non-published data from one’s own lab

• Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists

• Domain knowledge is crucial for – solving heterogeneities– query optimization

• Desirable to support domain scientists to do data integration on their own

Page 5: A Metadata Integration Assistant Generator for Heterogeneous Databases

A Common Data Integration Architecture

Mediator

datasource 1

datasource 2

datasource n

Query

Wrapper Wrapper Wrapper

Result

An Integrated View Materialized or Virtual

Page 6: A Metadata Integration Assistant Generator for Heterogeneous Databases

Structural vs. Semanticwrt Mediation Level

• Structural approach (Mediated schema approach)– integration by generating mediated schema that characterize a

set of data sources

• Semantic approach (Ontology-based approach)– difficult to integrate structural aspects of sources from

semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions

– integration by sharing a common ontology among the differentdata sources

Page 7: A Metadata Integration Assistant Generator for Heterogeneous Databases

Global-as-view vs. Local-as-viewwrt Mapping Direction

• Global-as-view approach– each item in Global schema/ontology as a view (query)

over source schemas/ontologies– query(G) = query(f(S1, S2, …, Sn))– straightforward query rewriting

• Local-as-view approach– Each source as a view/query over global schema/ontology– query(G) = query(f1

-1 (S1), f2-1(S2), …, fn

-1 (Sn))– easy adding or removing sources

Page 8: A Metadata Integration Assistant Generator for Heterogeneous Databases

Representative Systems

• TSIMMIS (Stanford & IBM, 1995)

• MedMaker (Stanford, 1996)

• MIX (SDSC&UCSD, 2000)

• IM (AT&T, 1996)

• Clio+Garlic (IBM, 2000)

• DIXSE (UT, 2001)

• XYLEME (2001)

• HERMES (UMD, 1994)

• SIMS (USC, 1996)

• Observer (UG, 1996)

• Infosleuth (MCC, 1997)

• COIN (MIT, 1999)

• Ontobroker (Ger., 2000)

• KIND (SDSC&UCSD, 2001)

Page 9: A Metadata Integration Assistant Generator for Heterogeneous Databases

Our Approach

• Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance

• Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation

• Local-as-view: easily adds or removes sources, convenient to fit applications

• GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths)

• Automatically generate DDXMI for query decomposition

• Semantic functions

Page 10: A Metadata Integration Assistant Generator for Heterogeneous Databases

Current Prototype Architecture

User query (XML query)

DDXMIColumn or Path

Column or Path For each DB

XML/DB1 XML/DB2 XML/DBn

XML/DBengine2

query2

XML/DBengine1

query1

XML/DBenginen

queryn

queryGenerator/collector

result1result2

resultn

Page 11: A Metadata Integration Assistant Generator for Heterogeneous Databases

Distributed Database XML Metadata Interface (DDXMI)

• Include Database or XML document name or location information

• Contain table columns or XML path information

• Function or operation name for resolving semantic issues about table columns or XML elements and attributes

Page 12: A Metadata Integration Assistant Generator for Heterogeneous Databases

DDXMI DTD

<!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)><!ELEMENT DDXMI.header (documentation,version,date,authorization)><!ELEMENT documentation (#PCDATA)><!ELEMENT version (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT authorization (#PCDATA)><!ELEMENT DDXMI.isequivalent (source,destination*)*><!ELEMENT source (#PCDATA)><!ELEMENT destination (#PCDATA)><!ELEMENT documentspec (document, (elementname,operation*)*)><!ELEMENT document (#PCDATA)><!ELEMENT elementname (#PCDATA)><!ELEMENT operation (#PCDATA)>

Page 13: A Metadata Integration Assistant Generator for Heterogeneous Databases

How to generate DDXMI

• Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems

• Parse the master DTD and generate a path for each element from root to current element

• Assign the master index number to the site element node which has the same meaning of the master DTD node

• May include a function name for some nodes

• Generate DDXMI file automatically by collecting over same index numbers

Page 14: A Metadata Integration Assistant Generator for Heterogeneous Databases

Generate Master Index

Page 15: A Metadata Integration Assistant Generator for Heterogeneous Databases

Site1 : Book1 DTD Tree

Index number functionname

Page 16: A Metadata Integration Assistant Generator for Heterogeneous Databases

Book1 Path Information

0 book1.xml1 /bib/book11 /bib/book/price12 /bib/book/author1211 /bib/book/author/first1212 /bib/book/author/last13 /bib/book/title15 /bib/book/publisher16 /bib/book/editor161 /bib/book/editor/affiliation162 /bib/book/editor/last162 /bib/book/editor/first

Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name

Site1 Index

Page 17: A Metadata Integration Assistant Generator for Heterogeneous Databases

Site 2 : Book2 DTD Tree

Page 18: A Metadata Integration Assistant Generator for Heterogeneous Databases

Book2 Path Information

0 book2.xml1 /arts/book12 /arts/book/author1211 /arts/book/author/firstname1212 /arts/book/author/lastname13 /arts/book/title15 /arts/book/publisher

Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name

Site2 Index

Page 19: A Metadata Integration Assistant Generator for Heterogeneous Databases

Site 3 : Book3 DTD Tree

Page 20: A Metadata Integration Assistant Generator for Heterogeneous Databases

Book3 Path Information

Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name

0 book3.xml1 /bookstore/book11 /bookstore/book/price12 /bookstore/book/author1211

/bookstore/book/author/name1212

/bookstore/book/author/name13 /bookstore/book/title

Site3 Index

Page 21: A Metadata Integration Assistant Generator for Heterogeneous Databases

XML Query Languages

• XQL : takes a document point of view• XML-QL : takes a database point of view• Quilt : draws from both areas

– proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu

– Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype

• XQuery proposal follows Quilt closely

Page 22: A Metadata Integration Assistant Generator for Heterogeneous Databases

How to generate site queries

• Parse the master query, a query over the global schema

• If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it

• If there is no corresponding path in the DDXMI, then put it as a null value

no queries generated for that site

Page 23: A Metadata Integration Assistant Generator for Heterogeneous Databases

How to get site element names

book

price authorpublisher

yeartitle

editor

full_name

first_namelast_name

affiliationfull_name

Master index

book

bookstore

Site Index

price_info

price

DDXMI

[In Quilt Query]

1.book bookstore/book

2. price bookstore/book/price_info/price

price_info/price

cut!!<source>book</source> <destination>booksore/book</destination><source>book/price</source> <destination>bookstore/book/price_info/price<destination>

Page 24: A Metadata Integration Assistant Generator for Heterogeneous Databases

1:1 Mapping ExampleFOR $book IN document("book.xml")//book

[publisher = "Addison-Wesley"] RETURN <book>$book/title</book>

book

priceauthor

publisher

yeartitle

editor

full_name

first_name last_name

affiliation full_name

Master index

book

bib

publisher title

Book1

book

arts

publisher title

Book2

book

bookstore

title

Book3

Page 25: A Metadata Integration Assistant Generator for Heterogeneous Databases

Query Execution Result

Page 26: A Metadata Integration Assistant Generator for Heterogeneous Databases

1:N Mapping ExampleFOR $edi IN document("book.xml")//book/editorRETURN <editor>$edi/full_name</editor>

book

priceauthor

publisheryear

title

editor

full_name

first_name last_name

affiliation full_name

Master index

book

bib

editor

Book1

book

artsBook2

book

bookstoreBook3

last first

<source>/book/editor/full_name</source><destination>/bib/book/editor/last,/bib/book/editor/first</destination>

DDXMI

Page 27: A Metadata Integration Assistant Generator for Heterogeneous Databases

Query Execution Result

Page 28: A Metadata Integration Assistant Generator for Heterogeneous Databases

N:1 Mapping ExampleFOR $a IN document("book.xml")//book//authorRETURN <author> $a/last_name,$a/first_name </author>

book

priceauthor

publisheryear

title

editor

full_name

last_name first_name

affiliationfull_name

Master index

book

bib

author

Book1

book

bookstoreBook3last first

book

arts

author

Book2

lastname firstname

author

name

<operation>lstring</operation>

<operation>fstring</operation>

Page 29: A Metadata Integration Assistant Generator for Heterogeneous Databases

Query Generation Result

import split as UDF_split;

FUNCTION fstring($str){ split(" ",$str)[1]}

FUNCTION lstring($str){ split(" ",$str)[2]}

FOR $a IN document("book3.xml") //book//author

RETURN <author> fstring($a/name), lstring($a/name)</author>

Page 30: A Metadata Integration Assistant Generator for Heterogeneous Databases

Query Execution Result

Page 31: A Metadata Integration Assistant Generator for Heterogeneous Databases

Semantic Function Involved ExampleFOR $book IN document("book.xml")//bookRETURN <book> $book/title,$book/author,$book/price </book>

<operation>div(100)</operation>

book

priceauthor

publisheryear

title

editor

full_name

first_name last_name

affiliation full_name

Master index

book

bib

price

Book1

book

artsBook2

book

bookstoreBook3

price

Page 32: A Metadata Integration Assistant Generator for Heterogeneous Databases

Query Execution Result

Page 33: A Metadata Integration Assistant Generator for Heterogeneous Databases

Remaining Issues• Handle attributes: one DTD has an attribute but others don’t, or an attri

bute in one DTD as an element in others• More efficient way for generating DDXMI file automatically when there

are many paths in the master DTDe.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically

• Migrate to XML schemas, instead of DTDs• Support JOIN, PRODUCT generated by queries• Move to XQuery and a query engine with distributed query support• Integrate the individual site query results as one return as a single data s

ource ready for further analysis • Provide mechanisms for removing redundancy• Justify the semantics of the query generated

Page 34: A Metadata Integration Assistant Generator for Heterogeneous Databases

• Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names

• Uses Quilt as its XML query language. • A DDXMI file is generated based on the mappings, and is

used to translate queries over the virtual master database into sub-queries to local databases

• An experiment testing feasibility is reported in which 3 different bibliography databases are integrated.

• Implemented with Java Webserver and JavaCC• Move to real applications, e.g. in the context of NSF proje

ct SEEK (Science Environment for Ecological Knowledge)

Conclusion