Data Exchange with  Data-Metadata Translations

Post on 30-Dec-2015

35 views 0 download

description

Paolo  Papotti Università Roma Tre. Wang-Chiew Tan UC  Santa Cruz. Mauricio A.  Hernández IBM Almaden Research Center. Data Exchange with  Data-Metadata Translations. VLDB 2008. August 24 -- Auckland, New Zealand. Data exchange scenarios may involve metadata transformations. - PowerPoint PPT Presentation

Transcript of Data Exchange with  Data-Metadata Translations

Data Exchange with Data-Metadata Translations

Data Exchange with Data-Metadata Translations

Mauricio A. Mauricio A. HernándezHernández

IBMIBMAlmaden ResearchAlmaden ResearchCenterCenter

Wang-ChiewWang-ChiewTan Tan     UC  Santa CruzUC  Santa Cruz

Paolo Paolo PapottiPapotti   UniversitàUniversitàRoma TreRoma Tre

VLDBVLDB20082008

August 24 -- Auckland, New Zealand

2

• Data exchange scenarios may involve metadata transformations.

– E.g., Pivot/Unpivot in spreadsheets.

[example from Miller98]

Data-Metadata TranslationsData-Metadata Translations

• Mapping systems support Data-to-Data transformations with fixed schemas.

• Goal: Extend mapping systems to support Data-Metadata Translations.

3

Source schema S

Source schema S

Target schema T

Target schema T

Declarative (internal) representationDeclarative (internal) representation

GUIGUI

Executable code (XSLT, XQuery, Java)Executable code (XSLT, XQuery, Java)

II JJ

IBM Clio

HepTox

MS ADO.net

Altova MapForce

StylusStudio

BEA Aqualogic

Data exchange

Mapping SystemsMapping Systems

Outline

1. Data and Metadata translations

Data-to-DataData-to-Data

Metadata-to-DataMetadata-to-Data

Data-to-MetadataData-to-Metadata Graphic Design

2. Generation Algorithms

Mapping GenerationMapping Generation

Query GenerationQuery Generation

Graphic Design

3. Results & Discussion

ExperimentsExperimentsRelated WorkRelated Work

ConclusionConclusion

• Mapping Generation Algorithm: [PVMHF 2002]

– Input: Source and Target schemas, and correspondences.

– Output: declarative schema mapping

• For example:

Data-to-DataData-to-Data

Source: Rcd Sales: SetOf Rcd country region style shipdate units price

Target: Rcd CountrySales: SetOf Rcd country Sales: SetOf Rcd style shipdate units id

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

MappingsMappings

• Query Generation into multiple query languages:– Input: a data to data schema mapping– Output: a query script (XQuery, XSLT, SQL, etc.)

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

for $s in Source.Salesexists $t in Target.CountrySales, $c in $t.Saleswhere $t.country = $s.country and $c.style = $s.style and $c.shipdate = $s.shipdate and $c.units = $s.units

for $x0 in $doc/Source/Sales return ( <CountrySales>

<country> { $x0/country/text() } </country> …

for $x0 in $doc/Source/Sales return ( <CountrySales>

<country> { $x0/country/text() } </country> …

7

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m1

“USA”

8

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m2

“UK”

9

Source.Sales month USA UK Italy Jan 120 223 89 Feb 83 168 56

Target.Sales month country units Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

m1: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “USA” and $t.units = $s.USA

m2: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “UK” and $t.units = $s.UK

m3: for $s in Source.Sales exists $t in Target.Sales where $t.month = $s.month and $t.country = “Italy” and $t.units = $s.Italy

““State-of-the-art” Metadata-to-DataState-of-the-art” Metadata-to-Data

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

How can we transform the following source data into the corresponding target?

Schema mapping m3

“Italy”

10

Source: Rcd Sales: SetOf Rcd month USA UK Italy

Target: Rcd Sales: SetOf Rcd month country units

countries label value

Select the elements to group

Placeholder Copy elements’

values

Copy elements’ labels

Source.Sales Jan 120 223 89 Feb 83 168 56

Target.Sales Jan USA 120 Jan UK 223 Jan Italy 89 Feb USA 83 Feb UK 168 Feb Italy 56

Set of labels (strings)

Dynamic selection of the source

element

Is a label value

for $s in Source.Sales, $c in {“USA”, “UK”, “Italy”}{“USA”, “UK”, “Italy”}exists $t in Target.Saleswhere $t.month = $s.month and $t.country = $c and $t.units = $s.($c)

MetadatA-Data (MAD) mapping:

Metadata-to-Data: Our solutionMetadata-to-Data: Our solution

11

Target: Rcd Stockquotes: SetOf Rcd time symbols label value

Source: Rcd StockTicker: SetOf Rcd time symbol price Dynamic

element

Now we want to support the opposite operation [example from Miller98]

The target schema depends on the source data

We define a target template: Nested Dynamic Output Schemas (ndos)

Run-time: The dynamic element defines the target instance and the target schema.

Data-to-MetadataData-to-Metadata

StockTicker (time: 0900, Symbol : MSFT, Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM, Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT, Price: 27.30 )

There are two possible interpretations for the target ndos:

Consider this mapping and this source instance:

Stockquotes (time: 0900, MSFT: 27.20 ) Stockquotes (time: 0900, IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30 )

Target: Rcd Stockquotes: SetOf Rcd time symbols: Choice MSFT IBM

Computed Target Instance

Source Instance

First alternative: Heterogeneous target records

Computed Target Schema

Data-to-Metadata: Heterogeneous recordsData-to-Metadata: Heterogeneous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

StockTicker (time: 0900, Symbol : MSFT Price: 27.20 ) StockTicker (time: 0900, Symbol : IBM Price: 120.00 ) StockTicker (time: 0905, Symbol : MSFT Price: 27.30 )

There are two possible interpretations for the target ndos:

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Consider this mapping and this source instance:

Computed Target Instance

Source Instance

Computed Target SchemaTarget: Rcd Stockquotes: SetOf Rcd time MSFT IBM

Stockquotes (time: 0900, MSFT: 27.20, IBM: null ) Stockquotes (time: 0900, MSFT: null , IBM: 120.00 ) Stockquotes (time: 0905, MSFT: 27.30, IBM: null )

Second alternative: Homogeneous target records

14

Natural solution for the Relational data model

Stockquotes(time: 0900, MSFT : 27.20, IBM: null ) Stockquotes(time: 0900, MSFT : null , IBM: 120.00) Stockquotes(time: 0905, MSFT : 27.30, IBM: null )

Homogeneity Constraint:“For every pair of tuples t1 and t2, if a is a label in t1, then a is a label in t2”

for $t1 in Target.Stockquotes, $t2 in Target.Stockquotes, $a in dom ($t1)exists $a’ in dom ($t2)where $a = $a’

Stockquotes(time: 0900, MSFT : 27.20 ) Stockquotes(time: 0900, IBM : 120.00 ) Stockquotes(time: 0905, MSFT : 27.30 )

Natural solution for semi-structured data models (XSD, DTD, JSON)

Data-to-Metadata: Homogenous recordsData-to-Metadata: Homogenous records

Target: Target: RcdRcd Stockquotes: Stockquotes: SetOf RcdSetOf Rcd timetime symbolssymbols labellabel valuevalue

Source: Source: RcdRcd StockTickerStockTicker: : SetOf RcdSetOf Rcd timetime symbolsymbol priceprice

15

Source.Salescountry region style shipdate units price USA East Tee 12-07 11 1200 USA East Elec. 12-07 12 3600 USA West Tee 01-08 10 1600 UK West Tee 02-08 12 2000

MAD Mapping GenerationMAD Mapping Generation

Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1

valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price

Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice <ByShipDateCountry>

<12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>

<ByShipDateCountry> <12-07> <USA> <style>Tee</style><units>11</units><price>1200</price> </USA><USA> <style>Elec.</style><units>12</units><price>3600</price> </USA> </12-07> <01-08> <USA> <style>Tee</style><units>10</units><price>1600</price> </USA> </01-08> <02-08> <UK> <style>Tee</style><units>12</units><price>2000</price> </UK> </02-08></ByShipDataCountry>

16

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $y in dates, $u in case $t of $y, $z in countries, $v in $u.($z) where $y = $s.shipdate and $z= $s.country and $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($z) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]

for $s in Source.Salesexists $t in Target.ByShipdateCountry, $u in case $t of $s.shipdate, $v in $u.($s.country) where $v.style = $s.style and $v.units = $s.units and $v.price = $s.price and $u.($s.country) = SK[$s.shipdate,$s.country]

MAD Mapping GenerationMAD Mapping Generation

Target: Target: RcdRcd ByShipdateCountryByShipdateCountry: : SetOf ChoiceSetOf Choice datesdates labellabel1 1

valuevalue1 1 : : RcdRcd countriescountries labellabel22 valuevalue2 2 : : SetOfSetOf RcdRcd stylestyle unitsunits price price

Source: Source: RcdRcd SalesSales: : SetOf RcdSetOf Rcd countrycountry regionregion stylestyle shipdateshipdate unitsunits priceprice

This is what we get from Clio [PVMHF 02]

Three Steps:

1. Modify schemas with dynamic placeholders

2. Compile mappings

3. Simplify mapping

Q1

I

S S1

Phase 1: Q1 shreds the source instance I over relational views of the target schema

conforms-to

[PVMHF [PVMHF 02]02]

Query Generation: two-phase algorithmQuery Generation: two-phase algorithm

Q2

r r r r

Phase 2: Q2 assembles the target instance J from the relational views

conforms-to

JJ

TT1 T2

T3T4

I

S S1

Q1

Phase 1: Q1 shreds the source instance I over relational views of the target ndos

conforms-to

New Query GenerationNew Query Generation

Q2

Q3

Q4

Phase 2: Q2 assembles the target instance J from the relational views

Q3 computes the target schema T

Q4 is the optional post - processing

conforms-to

conforms-to

JJ

TT1 T2

T3T4

ndosT1 T2

T3

r r r r

19

Commercial Tool

MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools

1

10

100

1000

10000

100000

0 100 200 300 400 500 600

Number of distinct labels

Que

ry e

xecu

tion

time

[s] Naive query

MAD Clio vs. Commercial ToolsMAD Clio vs. Commercial Tools

1

10

100

1000

10000

100000

0 100 200 300 400 500 600

Number of distinct labels

Que

ry e

xecu

tion

time

[s]

Naive queryDynamic queryStatic query

48 source labels (10 MB): naïve 183 s, dynamic 14 s, optimized 10 s

Optimized query

MAD Clio

21

12 target labels (10 MB): naïve 590 s, optimized 80 s [1 phase: 3 s]

MAD Clio Performance

22

• Lots of related work in the relational setting:– FIRA/FISQL [Wyss,Robertson 2005] has an excellent survey.– SchemaSQL [Lakshmanan,Sadri,Subramanian 1996],

FIRA/FISQL [Wyss,Robertson 2005] • Extensions to SQL to handle metadata as data

• Only relational dynamic output schemas

• Language and semantics, NO transformations from GUI

• In XML settings– HepTox [BCHLP 2005], commercial mapping tools [Altova

MapForce, MS ADO.net, StylusStudio, BEA (Oracle) Aqualogic]• No dynamic elements in the target

Some Related Work

23

Source schema S

Target schema T

Declarative (internal) representation

GUI

Executable code (XSLTXSLT, XQuery, JavaJava)

New construct to iterate over elements’ labels: placeholder

Target schema can be incomplete: nested dynamic output schema (ndos)

New constructs for the mapping language

New mapping & query generation algorithms

Including a query to generate the target schema.

Data exchange with data-metadata support: Data to Data is a special case

MAD ClioMAD Clio

24

Thank you.Thank you.

Questions?Questions?

Data Exchange with Data-Metadata Translations

Data Exchange with Data-Metadata Translations

25

...<properties name=“price” lang=“en-us”

date=“01-01-2008” ... > <pval>48.15</pval></properties> ...

...<price value=“48.15” lang=“en-us” date=“01-01-2008” ... /> ...

for $x1 in Source.properties, $x2 in { @lang, @date, …, @format }exists $y1 in Target.($x1.@name),where $y1.@value = $x1.pval and $y1.($x2) = $x1.($x2)

Source: Rcd properties: SetOf Rcd @name @lang @date … @format pval

<<attrs>> label value

Target: Rcd label1 value1: SetOf Rcd @value label2 value2

<<names>>

<<elems>>

Metadata to Metadata: placeholder to dynamic element

Metadata-to-MetadataMetadata-to-Metadata