Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis...
Graph-Based Modeling of ETL Activities with Multi-Level
Transformations and Updates
Alkis Simitsis1, Panos Vassiliadis2, Manolis Terrovitis1, Spiros Skiadopoulos1
(1) National Technical University of Athens{asimi,mter,spiros}@dbnet.ece.ntua.gr
(2) University of [email protected]
DaWaK'05, Copenhagen, August 2005 2
Outline
Background & Motivation
Database updates and composite transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005 3
Outline
Background & Motivation
Database updates and composite transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005 4
Extract-Transform-Load (ETL)
Sources
Extract Transform & Clean
DW
Load
DSA
DaWaK'05, Copenhagen, August 2005 5
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSUPP
Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Motivation
Sources DW
DSA
DaWaK'05, Copenhagen, August 2005 6
Background
Traditional workflow modeling has treated workflows as graphs with control-flow semantics
We take advantage of the data-centric, script-based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph
Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03]
DaWaK'05, Copenhagen, August 2005 7
R
A1
A2
A3
A4
A1
A2
A3
A4
IN.A1
A1
A2
A3
A4
A1
A2
A3
A4
IN OUT
POPULATED_FIELD
IN
NotNull_A1
R
NotNull_A1 SK_A2
…
Legend:
Background [DMDW’02, CAiSE’03]
Black-box model: no semantics in the
graph
DaWaK'05, Copenhagen, August 2005 8
Why is it important?
What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute?
Straightforward to follow the data propagation chain
How “good” is my design of the ETL scenario?Detection of inconsistencies
Detection of important, vulnerable or useless attributes
Well-defined quality measurement theory based on graphs
DaWaK'05, Copenhagen, August 2005 9
Graph-based modeling of data centric workflows is important!!!
PKEY
SUPPKEY
PKEY PKEY
SUPPKEY
PKEY
SUPPKEY
SKEY
AddSPK1 SK1IN INOUT OUT
PKEY
SK
SOURCE LOOKUP_PS
Vulnerable point in the scenario
Transitive: In Out Deg
SKEY 8 0 8
Avg/
attribute2.83 2.83 5.67
Avg/
entity5.67 5.67 11.33
DaWaK'05, Copenhagen, August 2005 10
Contribution
We incorporate update semantics in our graph-based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins.We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction.
Long version and ER’05: details for adding internal semantics
DaWaK'05, Copenhagen, August 2005 11
Outline
Background & Motivation
Database updates and composite transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005 12
Updates and Transformations
In this paper, we consider adding graph modeling techniques for several kinds of activities:
Update (INS, UPD, DEL) activitiesAggregatesRules employing negation and aliasesFunctions
We use LDL++ as language that declaratively describes the semantics
DaWaK'05, Copenhagen, August 2005 13
Updates to the database
An update expression is of the formhead <- query part, update part
with the following semantics: 1. a query to the database for the tuples that abide by the
query part
2. we update the predicate of the update part as specified in the rule.
raise1(Name, Sal, NewSal) <-employee(Name, Sal), Sal = 1100, (a)NewSal = Sal * 1.1, (b)- employee(Name, Sal), (c)+ employee(Name, NewSal). (d)
DaWaK'05, Copenhagen, August 2005 14
Updates to the databaseraise1
inputFunctio
n_*output
Name
Sal
NewSal
Scale
Name
Sal
NewSal
Sal=
=1.1
=1100
Employee
Name
Sal
+/-
+
-
DaWaK'05, Copenhagen, August 2005 15
Updates to the database
A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule.For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity.
DaWaK'05, Copenhagen, August 2005 16
Aggregation
Aggregation in LDL: 1. grouping of values to a bag and
2. application of an aggregate function over the values of the bag.
R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<-dw.partsupp(skey,suppkey,date,qty,cost)
R17: temp(skey,day,<cost>) <-aggregate1.a_in(skey,suppkey,date,qty,cost).
R18: aggregate1.a_out(skey,day,min_cost) <-temp(skey,day,all_costs),aggr(min,all_costs,min_cost).
R19: v1(skey,day,min_cost) <-aggregate1.a_out(skey,day,min_cost).
DaWaK'05, Copenhagen, August 2005 17
Aggregationaggregate1
in temp aggr
head
skey
suppkey
<cost>
min_cost
day
min
all_costs
<>
min
g
day
qty
cost
skey
g
out
skey
min_cost
day
R17R18
head
=
=
DaWaK'05, Copenhagen, August 2005 18
Aggregation
Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’.
Provider relations for attributes used as groupers are tagged with ‘g’.
One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max).
DaWaK'05, Copenhagen, August 2005 19
Outline
Background & Motivation
Database updates and composite transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005 20
Zooming in and out
a3
B.IN
B.OUT1
B.OUT2
S4
A B
D
C
E
S1 S2S3
S4
B.P
a1
a2
a1
a2
Activity Level
Schema Level
Attribute Level
There is a principled way of zooming in and out, in various levels of abstraction:
The attribute level
The schema level
The activity level
DaWaK'05, Copenhagen, August 2005 21
Zooming in and out
For each node x of the architecture graph G(V,E) representing a schema:
1. for each provider edge (xa,y) or (y,xa), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course);
2. remove the provider edges (xa,y) and (y,xa) of the previous step;
3. remove the nodes of the attributes of x and the respective part-of edges.
DaWaK'05, Copenhagen, August 2005 22
Zooming in and outaggregate1
in temp aggr
head
out
R17 R18
head
Gs
Ga
aggregate1
in temp aggr
head
skey
suppkey
<cost>
min_cost
day
min
all_costs
<>
min
g
day
qty
cost
skey
g
out
skey
min_cost
day
R17R18
head
=
=
DaWaK'05, Copenhagen, August 2005 23
Zooming in and outaggregate1
in temp aggr
head
skey
suppkey
<cost>
min_cost
day
min
all_costs
<>
min
g
day
qty
cost
skey
g
out
skey
min_cost
day
R17R18
head
=
=
aggregate1
in temp aggr
head
out
R17 R18
head
Gs
Ga
Measure Definition Ga Gs
Size Size(G) 7 25
Length Max provider path 3 2
Complexity 0.5*ext. edges + int. edges
8 36
Cohesion F_IN+F_OUT
F (IN+OUT)1 0.75
DaWaK'05, Copenhagen, August 2005 24
Outline
Background & Motivation
Database updates and composite transformations
Zooming in and out the Architecture Graph
Conclusions and Future Work
DaWaK'05, Copenhagen, August 2005 25
Summary
We have incorporated update semantics in our graph-based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction
Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities
http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf
DaWaK'05, Copenhagen, August 2005 26
This work is part of the Arktos II project
Future work includes research in
What-if analysis of ETL scenarios
Measures for the quality of the design of ETL scenarios
http://www.cs.uoi.gr/~pvassil/projects/arktos_II
On-going/Future WorkArktos II
DaWaK'05, Copenhagen, August 2005 27http://www.cs.uoi.gr/~pvassil/projects/arktos_II
Thank you!Arktos II
DaWaK'05, Copenhagen, August 2005 29
Vision – big picture
The major research goal is to be able to have a metadata repository that incorporates metadata
on the static part of an information system, i.e., tables, constraints, query forms, etcon the dynamic part, i.e., data-centric software modules
We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools
DaWaK'05, Copenhagen, August 2005 30
Preliminaries
Data types
Constants
Attributes
RecordSets
Function types
Functions
R
$2€
1
PKEY
my$2€
Integer
DaWaK'05, Copenhagen, August 2005 31
Relationships
Instance-Of Relationships
Part-Of Relationships
Regulator Relationships
Provider Relationships
Derived Provider
Relationships
DaWaK'05, Copenhagen, August 2005 32
Activities
Name
Input Schemata
Output Schema
Rejections Schema
Parameter List
Output/Rejection Operational Semantics
OutputActivity
Parameters
Input 1
Input 2Rejected
Rows
DaWaK'05, Copenhagen, August 2005 33
Importance Metrics
Dependency: the in-degree of the node with respect to the provider edges;
Responsibility: the out-degree of the node with respect to the provider edges;
Degree: dependency + responsibility
Local vs. Transitive
DaWaK'05, Copenhagen, August 2005 34
Functions
Functions are treated as any other predicate in LDL, with the following special characteristics:The function involves a list of parameters, the last of which is the return value of the function.All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes.The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship).
DaWaK'05, Copenhagen, August 2005 35
Aliases & Negation
Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra-activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body.
DaWaK'05, Copenhagen, August 2005 36
Activity semantics in LDLR06:a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost).R07:a_in2(pkey,source,skey)<-
lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1.R08:a_out(pkey,suppkey,date,qty,cost,skey)<-
a_in1(pkey,date,qty,cost),a_in2(pkey,source,l_skey).
R09:D2E.a_in(skey,suppkey,date,qty,cost)<-
sk.a_out(pkey,suppkey,date,qty,cost,skey)
DaWaK'05, Copenhagen, August 2005 37
Activities
pkey
suppkey
date
qty
cost
DSA_PS
R06 head
SK
Program
R08
a_in1 a_outd2e.a_in
R09
pkey
suppkey
date
qty
cost
l_key
source
l_skey
lookup
R07 head
a_in2
pkey
source
skey
=
=
1
headhead
pkey
suppkey
date
skey
cost
skey
suppkey
date
qty
cost
cost