CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research...
CIS607, Fall 2005CIS607, Fall 2005Semantic Information Semantic Information
IntegrationIntegrationArticle Name: Clio Grows Up: From Research Article Name: Clio Grows Up: From Research Prototype to Industrial ToolPrototype to Industrial Tool
Name: DH(Dong Hwi) kwakName: DH(Dong Hwi) kwakDate: 10/26/05 4:00 pmDate: 10/26/05 4:00 pm
IntroductionIntroduction
Mapping between different Mapping between different representations of data are important.representations of data are important.
Because it is necessary to integrate Because it is necessary to integrate and exchange of data residing at a and exchange of data residing at a multiple sites, in different formats (or multiple sites, in different formats (or schemas), and even under different schemas), and even under different data models (such as relational or XML)data models (such as relational or XML)
Data Exchange vs. Data Data Exchange vs. Data IntegrationIntegration
Data Exchange (or data translation): Data Exchange (or data translation): The task of restructuring data from a The task of restructuring data from a source format (or schema) into a source format (or schema) into a target format(or schema).target format(or schema).
Data Integration (or federation) : The Data Integration (or federation) : The ability to query a set of ability to query a set of heterogeneous data sources via a heterogeneous data sources via a virtual unified target schema.virtual unified target schema.
Relationship or MappingRelationship or Mapping
In both cases (data exchange, data In both cases (data exchange, data integration), relationship or mapping integration), relationship or mapping must first be established between the must first be established between the source schema and the target schema.source schema and the target schema.
Two complementary levelTwo complementary level Syntactic one: schema matchingSyntactic one: schema matching Operational one: instances over the Operational one: instances over the
source schema with instances over target source schema with instances over target schemaschema
Move actual data from source to a targetMove actual data from source to a target Answer queriesAnswer queries
Clio SystemClio System
Two important componentTwo important component Mapping Generation ComponentMapping Generation Component
Takes as input correspondences between the source and Takes as input correspondences between the source and target schemastarget schemas
Generates a schema mapping consisting of Generates a schema mapping consisting of logical logical mappingmapping that provide an interpretation of the given that provide an interpretation of the given correspondencescorrespondences
Query Generation ComponentQuery Generation Component To convert a set of logical mappings into an executable To convert a set of logical mappings into an executable
transformation scripttransformation script SQL, SQL/XML, XQuery and XSLTSQL, SQL/XML, XQuery and XSLT
The user can interact with the system through GUI The user can interact with the system through GUI during a designduring a design
The user can view, add, and remove The user can view, add, and remove correspondencescorrespondences
Clio SystemClio System Two main Two main
componentcomponent Mapping Mapping
GenerationGeneration Query Query
GenerationGeneration
Mapping & Query Mapping & Query GenerationGeneration
Figure 2 illustrates an Figure 2 illustrates an actual Clio screenshot actual Clio screenshot showing portions of showing portions of two gene expression two gene expression schemas and schemas and correspondencescorrespondences
Left: Source schemaLeft: Source schema Relational schemaRelational schema
Right: Target schemaRight: Target schema XML schemaXML schema
Mapping & Query Mapping & Query GenerationGeneration
Mapping GenerationMapping Generation For any two related elements in the source For any two related elements in the source
schema, for which there exist correspondences schema, for which there exist correspondences into two related elements in the target schema.into two related elements in the target schema.
FACTOR_NAMEFACTOR_NAME and and BIOLOGY_DESCBIOLOGY_DESC: There is a foreign key : There is a foreign key that links the that links the EXPERIMENTFACOREXPERIMENTFACOR table to the table to the EXPERIMENTSETEXPERIMENTSET table table
biology_descbiology_desc and and factor_namefactor_name: the latter is a child of : the latter is a child of the former in XML schema.the former in XML schema.
Therefore there will be a mapping that maps related Therefore there will be a mapping that maps related instances of instances of FACTOR_NAMEFACTOR_NAME and and BIOLOGY_DESC BIOLOGY_DESC into into related instances related instances biology_descbiology_desc and and factor_namefactor_name
Mapping & Query Mapping & Query GenerationGeneration
Generation of Generation of tableauxtableaux The first step of the The first step of the
algorithm is to algorithm is to generate all the basic generate all the basic ways in which ways in which elements relate to elements relate to each other within one each other within one schemaschema
In figure 3 we show In figure 3 we show several of the source several of the source and target tableaux and target tableaux that are generated that are generated for our examplefor our example
Mapping & Query Mapping & Query GenerationGeneration
Generation of logical Generation of logical mappingmapping The second step of The second step of
the algorithm is the the algorithm is the generation of logical generation of logical mappingmapping
The basic algorithm The basic algorithm pairs all the existing pairs all the existing tableaux in the tableaux in the source with the source with the existing tableaux in existing tableaux in the target, and find the target, and find the correspondences the correspondences that are covered by that are covered by each paireach pair
MM11 is obtained from (S is obtained from (S22, T, T22)) MM33 is obtained from the pair(S is obtained from the pair(S44, T, T44))
Mapping & Query Mapping & Query GenerationGeneration
Mapping languageMapping language Skolem functions: This functions can Skolem functions: This functions can
explicitly represent target elements for explicitly represent target elements for which no source value is givenwhich no source value is given
For example, the mapping m1 will not For example, the mapping m1 will not specify a value for the specify a value for the @id@id attribute attribute under under exp_factorexp_factor
A Skolem function creates a unique A Skolem function creates a unique value for this attributesvalue for this attributes
Mapping & Query Mapping & Query GenerationGeneration
Query GenerationQuery Generation Each logical mapping is compiled into a Each logical mapping is compiled into a
query graphquery graph For each logical mapping, query For each logical mapping, query
generators walk the relevant part of the generators walk the relevant part of the target schema and create the necessary target schema and create the necessary join and grouping conditionjoin and grouping condition
In Figure 5, the XQuery fragment In Figure 5, the XQuery fragment produces the target <exp_set>produces the target <exp_set>
Mapping & Query Mapping & Query GenerationGeneration
Lines 1-4 in Figure 5 Lines 1-4 in Figure 5 implements join of two implements join of two tables tables EXPERIMENTFACOREXPERIMENTFACOR and and EXPERIMENTSETEXPERIMENTSET
Lines 7-10 output the Lines 7-10 output the attributes within attributes within exp_setexp_set
Lines 25-31 produce an Lines 25-31 produce an actual actual exp_factorexp_factor elementelement
Line 30 crates a value Line 30 crates a value for the for the idid element element
Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)
Redundancy CheckRedundancy Check If there a is one-to-one mapping of the If there a is one-to-one mapping of the
variables of Tvariables of T11 into the T into the T22, T, T11 is a sub- is a sub-tableau of Ttableau of T22
Can significantly reduce the amount of Can significantly reduce the amount of irrelevant mappingsirrelevant mappings
Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)
Hybrid AlgorithmHybrid Algorithm There are two phases in mapping There are two phases in mapping
generation (precomputation, insertion of generation (precomputation, insertion of correspondences)correspondences)
The separation enables users to speed up The separation enables users to speed up the addition of correspondences in the GUIthe addition of correspondences in the GUI
The disadvantage of this separation might The disadvantage of this separation might be occurred when the schemas are largebe occurred when the schemas are large
A large amount of memory may be needed to A large amount of memory may be needed to hold all the data structurehold all the data structure
Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)
Hybrid AlgorithmHybrid Algorithm The main idea behind this algorithm is to The main idea behind this algorithm is to
precompute only a bounded number of source precompute only a bounded number of source tableaux and target tableauxtableaux and target tableaux
Sometimes correspondences between elements Sometimes correspondences between elements may fail because of the deeper schema treesmay fail because of the deeper schema trees
Generate a source tableau that includes all the Generate a source tableau that includes all the set-type elements -> The tableaux are closed set-type elements -> The tableaux are closed under the chase. -> Thus including all the other under the chase. -> Thus including all the other schema elements associated via foreign keyschema elements associated via foreign key
The data structures holding the sub-tableaux The data structures holding the sub-tableaux and the sub-skeleton relationship are updated.and the sub-skeleton relationship are updated.
The algorithm may lose its completeness (small The algorithm may lose its completeness (small price)price)
Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)
Performance evaluation: mapping MAGE-MLPerformance evaluation: mapping MAGE-ML MAGE-ML is a complex XML schemaMAGE-ML is a complex XML schema Two experiments are performedTwo experiments are performed
Control the nesting level of the precomputed tableaux Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), no limits on the total (maximum 6 nested level of sets), no limits on the total number of precomputed tableaux (Lower bound)number of precomputed tableaux (Lower bound)
Control the nesting level of the precomputed tableaux Control the nesting level of the precomputed tableaux (maximum 6 nested level of sets), the total number of (maximum 6 nested level of sets), the total number of precomputed tableaux(maxium 110 per schema precomputed tableaux(maxium 110 per schema (actual improvement)(actual improvement)
Practical ChallengesPractical Challenges(Mapping Generation)(Mapping Generation)
First experimentFirst experiment Load the MAGE-ML (source): less than 1 secLoad the MAGE-ML (source): less than 1 sec Precompute all(1030) tableaux: 2.6 secPrecompute all(1030) tableaux: 2.6 sec Computing subtableaux relationship: 74 secComputing subtableaux relationship: 74 sec Memory to hold the data structure: 335 MBMemory to hold the data structure: 335 MB Loading the MAGE-ML schema: run out of memoryLoading the MAGE-ML schema: run out of memory
Second experimentSecond experiment Precomputation of the tableaux(116 now): 0.5 secPrecomputation of the tableaux(116 now): 0.5 sec Computing subtableaux relationship: 0.7 secComputing subtableaux relationship: 0.7 sec Memory to hold the data structure: 163 MBMemory to hold the data structure: 163 MB The amount of memory needed to hold everything: 251 The amount of memory needed to hold everything: 251
MBMB Overall, the performance of Overall, the performance of hybrid algorithmhybrid algorithm is is
quite acceptable.quite acceptable.
Practical ChallengesPractical Challenges(Query Generation: Deep Union)(Query Generation: Deep Union)
There are two drawbacksThere are two drawbacks There is no duplicate removal within and among There is no duplicate removal within and among
query fragmentsquery fragments There is no grouping of data among query There is no grouping of data among query
fragmentsfragments (OrderID, ItemID)(OrderID, ItemID)
Input data :{(oInput data :{(o11,i,i11),(o),(o11,i,i22)})} Ouput data: {(oOuput data: {(o11,(i,(i11,i,i22)),(o)),(o11,(i,(i11,i,i22))}))} Second query: {(oSecond query: {(o11,(i,(i33))}))}
We would expect this second tuple to be merged We would expect this second tuple to be merged with previous result and produce only one tuple with previous result and produce only one tuple for ofor o11 with {(o with {(o11,(i,(i11,i,i2, 2, ii33)} We call this special )} We call this special union operation is union operation is deep uniondeep union
Remaining ChallengesRemaining Challenges
Complex mappings are need a more Complex mappings are need a more expressive correspondence selection expressive correspondence selection mechanism than that supported by Cliomechanism than that supported by Clio
Exploring the need for logical mapping Exploring the need for logical mapping that nest other logical mapping insidethat nest other logical mapping inside
Mapping adaptation issues when source Mapping adaptation issues when source and target schemas changeand target schemas change