CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research...

CIS607, Fall 2005CIS607, Fall 2005Semantic Information Semantic Information

IntegrationIntegrationArticle Name: Clio Grows Up: From Research Article Name: Clio Grows Up: From Research Prototype to Industrial ToolPrototype to Industrial Tool

Name: DH(Dong Hwi) kwakName: DH(Dong Hwi) kwakDate: 10/26/05 4:00 pmDate: 10/26/05 4:00 pm

IntroductionIntroduction

Mapping between different Mapping between different representations of data are important.representations of data are important.

Because it is necessary to integrate Because it is necessary to integrate and exchange of data residing at a and exchange of data residing at a multiple sites, in different formats (or multiple sites, in different formats (or schemas), and even under different schemas), and even under different data models (such as relational or XML)data models (such as relational or XML)

Data Exchange vs. Data Data Exchange vs. Data IntegrationIntegration

Data Exchange (or data translation): Data Exchange (or data translation): The task of restructuring data from a The task of restructuring data from a source format (or schema) into a source format (or schema) into a target format(or schema).target format(or schema).

Data Integration (or federation) : The Data Integration (or federation) : The ability to query a set of ability to query a set of heterogeneous data sources via a heterogeneous data sources via a virtual unified target schema.virtual unified target schema.

Relationship or MappingRelationship or Mapping

In both cases (data exchange, data In both cases (data exchange, data integration), relationship or mapping integration), relationship or mapping must first be established between the must first be established between the source schema and the target schema.source schema and the target schema.

Two complementary levelTwo complementary level Syntactic one: schema matchingSyntactic one: schema matching Operational one: instances over the Operational one: instances over the

source schema with instances over target source schema with instances over target schemaschema

Move actual data from source to a targetMove actual data from source to a target Answer queriesAnswer queries

Clio SystemClio System

Two important componentTwo important component Mapping Generation ComponentMapping Generation Component

Takes as input correspondences between the source and Takes as input correspondences between the source and target schemastarget schemas

Generates a schema mapping consisting of Generates a schema mapping consisting of logical logical mappingmapping that provide an interpretation of the given that provide an interpretation of the given correspondencescorrespondences

Query Generation ComponentQuery Generation Component To convert a set of logical mappings into an executable To convert a set of logical mappings into an executable

transformation scripttransformation script SQL, SQL/XML, XQuery and XSLTSQL, SQL/XML, XQuery and XSLT

The user can interact with the system through GUI The user can interact with the system through GUI during a designduring a design

The user can view, add, and remove The user can view, add, and remove correspondencescorrespondences

Clio SystemClio System Two main Two main

componentcomponent Mapping Mapping

GenerationGeneration Query Query

GenerationGeneration

Mapping & Query Mapping & Query GenerationGeneration

Figure 2 illustrates an Figure 2 illustrates an actual Clio screenshot actual Clio screenshot showing portions of showing portions of two gene expression two gene expression schemas and schemas and correspondencescorrespondences

Left: Source schemaLeft: Source schema Relational schemaRelational schema

Right: Target schemaRight: Target schema XML schemaXML schema


Mapping GenerationMapping Generation For any two related elements in the source For any two related elements in the source

schema, for which there exist correspondences schema, for which there exist correspondences into two related elements in the target schema.into two related elements in the target schema.

FACTOR_NAMEFACTOR_NAME and and BIOLOGY_DESCBIOLOGY_DESC: There is a foreign key : There is a foreign key that links the that links the EXPERIMENTFACOREXPERIMENTFACOR table to the table to the EXPERIMENTSETEXPERIMENTSET table table

biology_descbiology_desc and and factor_namefactor_name: the latter is a child of : the latter is a child of the former in XML schema.the former in XML schema.

Therefore there will be a mapping that maps related Therefore there will be a mapping that maps related instances of instances of FACTOR_NAMEFACTOR_NAME and and BIOLOGY_DESC BIOLOGY_DESC into into related instances related instances biology_descbiology_desc and and factor_namefactor_name


Generation of Generation of tableauxtableaux The first step of the The first step of the

algorithm is to algorithm is to generate all the basic generate all the basic ways in which ways in which elements relate to elements relate to each other within one each other within one schemaschema

In figure 3 we show In figure 3 we show several of the source several of the source and target tableaux and target tableaux that are generated that are generated for our examplefor our example


Generation of logical Generation of logical mappingmapping The second step of The second step of

the algorithm is the the algorithm is the generation of logical generation of logical mappingmapping

The basic algorithm The basic algorithm pairs all the existing pairs all the existing tableaux in the tableaux in the source with the source with the existing tableaux in existing tableaux in the target, and find the target, and find the correspondences the correspondences that are covered by that are covered by each paireach pair

MM11 is obtained from (S is obtained from (S22, T, T22)) MM33 is obtained from the pair(S is obtained from the pair(S44, T, T44))


Mapping languageMapping language Skolem functions: This functions can Skolem functions: This functions can

explicitly represent target elements for explicitly represent target elements for which no source value is givenwhich no source value is given

For example, the mapping m1 will not For example, the mapping m1 will not specify a value for the specify a value for the @id@id attribute attribute under under exp_factorexp_factor

A Skolem function creates a unique A Skolem function creates a unique value for this attributesvalue for this attributes


Query GenerationQuery Generation Each logical mapping is compiled into a Each logical mapping is compiled into a

query graphquery graph For each logical mapping, query For each logical mapping, query

generators walk the relevant part of the generators walk the relevant part of the target schema and create the necessary target schema and create the necessary join and grouping conditionjoin and grouping condition

In Figure 5, the XQuery fragment In Figure 5, the XQuery fragment produces the target <exp_set>produces the target <exp_set>


Lines 1-4 in Figure 5 Lines 1-4 in Figure 5 implements join of two implements join of two tables tables EXPERIMENTFACOREXPERIMENTFACOR and and EXPERIMENTSETEXPERIMENTSET

Lines 7-10 output the Lines 7-10 output the attributes within attributes within exp_setexp_set

Lines 25-31 produce an Lines 25-31 produce an actual actual exp_factorexp_factor elementelement

Line 30 crates a value Line 30 crates a value for the for the idid element element

Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)

Redundancy CheckRedundancy Check If there a is one-to-one mapping of the If there a is one-to-one mapping of the

variables of Tvariables of T11 into the T into the T22, T, T11 is a sub- is a sub-tableau of Ttableau of T22

Can significantly reduce the amount of Can significantly reduce the amount of irrelevant mappingsirrelevant mappings


Hybrid AlgorithmHybrid Algorithm There are two phases in mapping There are two phases in mapping

generation (precomputation, insertion of generation (precomputation, insertion of correspondences)correspondences)

The separation enables users to speed up The separation enables users to speed up the addition of correspondences in the GUIthe addition of correspondences in the GUI

The disadvantage of this separation might The disadvantage of this separation might be occurred when the schemas are largebe occurred when the schemas are large

A large amount of memory may be needed to A large amount of memory may be needed to hold all the data structurehold all the data structure


Hybrid AlgorithmHybrid Algorithm The main idea behind this algorithm is to The main idea behind this algorithm is to

precompute only a bounded number of source precompute only a bounded number of source tableaux and target tableauxtableaux and target tableaux

Sometimes correspondences between elements Sometimes correspondences between elements may fail because of the deeper schema treesmay fail because of the deeper schema trees

Generate a source tableau that includes all the Generate a source tableau that includes all the set-type elements -> The tableaux are closed set-type elements -> The tableaux are closed under the chase. -> Thus including all the other under the chase. -> Thus including all the other schema elements associated via foreign keyschema elements associated via foreign key

The data structures holding the sub-tableaux The data structures holding the sub-tableaux and the sub-skeleton relationship are updated.and the sub-skeleton relationship are updated.

The algorithm may lose its completeness (small The algorithm may lose its completeness (small price)price)


Performance evaluation: mapping MAGE-MLPerformance evaluation: mapping MAGE-ML MAGE-ML is a complex XML schemaMAGE-ML is a complex XML schema Two experiments are performedTwo experiments are performed

Control the nesting level of the precomputed tableaux Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound)number of precomputed tableaux (Lower bound)

Control the nesting level of the precomputed tableaux Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema precomputed tableaux(maxium 110 per schema (actual improvement)(actual improvement)


First experimentFirst experiment Load the MAGE-ML (source): less than 1 secLoad the MAGE-ML (source): less than 1 sec Precompute all(1030) tableaux: 2.6 secPrecompute all(1030) tableaux: 2.6 sec Computing subtableaux relationship: 74 secComputing subtableaux relationship: 74 sec Memory to hold the data structure: 335 MBMemory to hold the data structure: 335 MB Loading the MAGE-ML schema: run out of memoryLoading the MAGE-ML schema: run out of memory

Second experimentSecond experiment Precomputation of the tableaux(116 now): 0.5 secPrecomputation of the tableaux(116 now): 0.5 sec Computing subtableaux relationship: 0.7 secComputing subtableaux relationship: 0.7 sec Memory to hold the data structure: 163 MBMemory to hold the data structure: 163 MB The amount of memory needed to hold everything: 251 The amount of memory needed to hold everything: 251

MBMB Overall, the performance of Overall, the performance of hybrid algorithmhybrid algorithm is is

quite acceptable.quite acceptable.

Practical ChallengesPractical Challenges(Query Generation: Deep Union)(Query Generation: Deep Union)

There are two drawbacksThere are two drawbacks There is no duplicate removal within and among There is no duplicate removal within and among

query fragmentsquery fragments There is no grouping of data among query There is no grouping of data among query

fragmentsfragments (OrderID, ItemID)(OrderID, ItemID)

Input data :{(oInput data :{(o11,i,i11),(o),(o11,i,i22)})} Ouput data: {(oOuput data: {(o11,(i,(i11,i,i22)),(o)),(o11,(i,(i11,i,i22))}))} Second query: {(oSecond query: {(o11,(i,(i33))}))}

We would expect this second tuple to be merged We would expect this second tuple to be merged with previous result and produce only one tuple with previous result and produce only one tuple for ofor o11 with {(o with {(o11,(i,(i11,i,i2, 2, ii33)} We call this special )} We call this special union operation is union operation is deep uniondeep union

Remaining ChallengesRemaining Challenges

Complex mappings are need a more Complex mappings are need a more expressive correspondence selection expressive correspondence selection mechanism than that supported by Cliomechanism than that supported by Clio

Exploring the need for logical mapping Exploring the need for logical mapping that nest other logical mapping insidethat nest other logical mapping inside

Mapping adaptation issues when source Mapping adaptation issues when source and target schemas changeand target schemas change

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research...

Documents

Transcript of CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research...