Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis...

38
Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis 1 , Panos Vassiliadis 2 , Manolis Terrovitis 1 , Spiros Skiadopoulos 1 (1) National Technical University of Athens {asimi,mter,spiros}@dbnet.ece.ntua.gr (2) University of Ioannina [email protected]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Graph-Based Modeling of ETL Activities with Multi-Level Transformations and Updates Alkis Simitsis...

Graph-Based Modeling of ETL Activities with Multi-Level

Transformations and Updates

Alkis Simitsis1, Panos Vassiliadis2, Manolis Terrovitis1, Spiros Skiadopoulos1

(1) National Technical University of Athens{asimi,mter,spiros}@dbnet.ece.ntua.gr

(2) University of [email protected]

DaWaK'05, Copenhagen, August 2005 2

Outline

Background & Motivation

Database updates and composite transformations

Zooming in and out the Architecture Graph

Conclusions and Future Work

DaWaK'05, Copenhagen, August 2005 3

Outline

Background & Motivation

Database updates and composite transformations

Zooming in and out the Architecture Graph

Conclusions and Future Work

DaWaK'05, Copenhagen, August 2005 4

Extract-Transform-Load (ETL)

Sources

Extract Transform & Clean

DW

Load

DSA

DaWaK'05, Copenhagen, August 2005 5

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD

1

DW.PARTSUPP

Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP

FTP2

DS.PS_NEW

2

DIFF2

DS.PS_OLD

2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Motivation

Sources DW

DSA

DaWaK'05, Copenhagen, August 2005 6

Background

Traditional workflow modeling has treated workflows as graphs with control-flow semantics

We take advantage of the data-centric, script-based nature of ETL activities to model their internals as graphs, too, as a graph, which we call Architecture Graph

Our previous efforts [DMDW’02] handled simple cleanings and transformations and templates to simplify the definition of scenarios [CAiSE’03]

DaWaK'05, Copenhagen, August 2005 7

R

A1

A2

A3

A4

A1

A2

A3

A4

IN.A1

A1

A2

A3

A4

A1

A2

A3

A4

IN OUT

POPULATED_FIELD

IN

NotNull_A1

R

NotNull_A1 SK_A2

Legend:

Background [DMDW’02, CAiSE’03]

Black-box model: no semantics in the

graph

DaWaK'05, Copenhagen, August 2005 8

Why is it important?

What part of the scenario is affected if we delete an attribute? Which attributes/tables are involved in the population of an attribute?

Straightforward to follow the data propagation chain

How “good” is my design of the ETL scenario?Detection of inconsistencies

Detection of important, vulnerable or useless attributes

Well-defined quality measurement theory based on graphs

DaWaK'05, Copenhagen, August 2005 9

Graph-based modeling of data centric workflows is important!!!

PKEY

SUPPKEY

PKEY PKEY

SUPPKEY

PKEY

SUPPKEY

SKEY

AddSPK1 SK1IN INOUT OUT

PKEY

SK

SOURCE LOOKUP_PS

Vulnerable point in the scenario

Transitive: In Out Deg

SKEY 8 0 8

Avg/

attribute2.83 2.83 5.67

Avg/

entity5.67 5.67 11.33

DaWaK'05, Copenhagen, August 2005 10

Contribution

We incorporate update semantics in our graph-based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins.We introduce a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction.

Long version and ER’05: details for adding internal semantics

DaWaK'05, Copenhagen, August 2005 11

Outline

Background & Motivation

Database updates and composite transformations

Zooming in and out the Architecture Graph

Conclusions and Future Work

DaWaK'05, Copenhagen, August 2005 12

Updates and Transformations

In this paper, we consider adding graph modeling techniques for several kinds of activities:

Update (INS, UPD, DEL) activitiesAggregatesRules employing negation and aliasesFunctions

We use LDL++ as language that declaratively describes the semantics

DaWaK'05, Copenhagen, August 2005 13

Updates to the database

An update expression is of the formhead <- query part, update part

with the following semantics: 1. a query to the database for the tuples that abide by the

query part

2. we update the predicate of the update part as specified in the rule.

raise1(Name, Sal, NewSal) <-employee(Name, Sal), Sal = 1100, (a)NewSal = Sal * 1.1, (b)- employee(Name, Sal), (c)+ employee(Name, NewSal). (d)

DaWaK'05, Copenhagen, August 2005 14

Updates to the databaseraise1

inputFunctio

n_*output

Name

Sal

NewSal

Scale

Name

Sal

NewSal

Sal=

=1.1

=1100

Employee

Name

Sal

+/-

+

-

DaWaK'05, Copenhagen, August 2005 15

Updates to the database

A side-effect rule is treated as an activity, with the corresponding node. The output schema of the activity is derived from the structure of the predicate of the head of the rule.For every predicate with a + or – in the body of the rule, a respective provider edge from the output schema of the side-effect activity is assumed. For every predicate that appears in the rule without a + or – tag, we assume the respective input schema. Provider edges from this predicate towards these schemata are added as usual. The same applies for the attributes of the input and output schemata of the side effect activity.

DaWaK'05, Copenhagen, August 2005 16

Aggregation

Aggregation in LDL: 1. grouping of values to a bag and

2. application of an aggregate function over the values of the bag.

R16: aggregate1.a_in(skey,suppkey,date,qty,cost)<-dw.partsupp(skey,suppkey,date,qty,cost)

R17: temp(skey,day,<cost>) <-aggregate1.a_in(skey,suppkey,date,qty,cost).

R18: aggregate1.a_out(skey,day,min_cost) <-temp(skey,day,all_costs),aggr(min,all_costs,min_cost).

R19: v1(skey,day,min_cost) <-aggregate1.a_out(skey,day,min_cost).

DaWaK'05, Copenhagen, August 2005 17

Aggregationaggregate1

in temp aggr

head

skey

suppkey

<cost>

min_cost

day

min

all_costs

<>

min

g

day

qty

cost

skey

g

out

skey

min_cost

day

R17R18

head

=

=

DaWaK'05, Copenhagen, August 2005 18

Aggregation

Relations which create a set from the values of a field employ a pair of regulator edges through an intermediate node ‘<>’.

Provider relations for attributes used as groupers are tagged with ‘g’.

One of the attributes of the aggr function node consumes data from a constant that indicates which aggregate function should be used (e.g., avg, min, max).

DaWaK'05, Copenhagen, August 2005 19

Outline

Background & Motivation

Database updates and composite transformations

Zooming in and out the Architecture Graph

Conclusions and Future Work

DaWaK'05, Copenhagen, August 2005 20

Zooming in and out

a3

B.IN

B.OUT1

B.OUT2

S4

A B

D

C

E

S1 S2S3

S4

B.P

a1

a2

a1

a2

Activity Level

Schema Level

Attribute Level

There is a principled way of zooming in and out, in various levels of abstraction:

The attribute level

The schema level

The activity level

DaWaK'05, Copenhagen, August 2005 21

Zooming in and out

For each node x of the architecture graph G(V,E) representing a schema:

1. for each provider edge (xa,y) or (y,xa), involving an attribute of x and an entity y, external to x, introduce the respective provider edge between x and y (unless it already exists, of course);

2. remove the provider edges (xa,y) and (y,xa) of the previous step;

3. remove the nodes of the attributes of x and the respective part-of edges.

DaWaK'05, Copenhagen, August 2005 22

Zooming in and outaggregate1

in temp aggr

head

out

R17 R18

head

Gs

Ga

aggregate1

in temp aggr

head

skey

suppkey

<cost>

min_cost

day

min

all_costs

<>

min

g

day

qty

cost

skey

g

out

skey

min_cost

day

R17R18

head

=

=

DaWaK'05, Copenhagen, August 2005 23

Zooming in and outaggregate1

in temp aggr

head

skey

suppkey

<cost>

min_cost

day

min

all_costs

<>

min

g

day

qty

cost

skey

g

out

skey

min_cost

day

R17R18

head

=

=

aggregate1

in temp aggr

head

out

R17 R18

head

Gs

Ga

Measure Definition Ga Gs

Size Size(G) 7 25

Length Max provider path 3 2

Complexity 0.5*ext. edges + int. edges

8 36

Cohesion F_IN+F_OUT

F (IN+OUT)1 0.75

DaWaK'05, Copenhagen, August 2005 24

Outline

Background & Motivation

Database updates and composite transformations

Zooming in and out the Architecture Graph

Conclusions and Future Work

DaWaK'05, Copenhagen, August 2005 25

Summary

We have incorporated update semantics in our graph-based modeling of ETL activities. We also cover complex transformations like negation, aggregation and self-joins. We have introduced a systematic way of transforming the Architecture Graph to allow zooming in and out at multiple levels of abstraction

Long version and ER’05: internal semantics for activities and quality measures for the design of ETL activities

http://www.cs.uoi.gr/~pvassil/publications/2005_ER_AG/ETL_blueprints_long.pdf

DaWaK'05, Copenhagen, August 2005 26

This work is part of the Arktos II project

Future work includes research in

What-if analysis of ETL scenarios

Measures for the quality of the design of ETL scenarios

http://www.cs.uoi.gr/~pvassil/projects/arktos_II

On-going/Future WorkArktos II

DaWaK'05, Copenhagen, August 2005 27http://www.cs.uoi.gr/~pvassil/projects/arktos_II

Thank you!Arktos II

DaWaK'05, Copenhagen, August 2005 28

Backup Slides

DaWaK'05, Copenhagen, August 2005 29

Vision – big picture

The major research goal is to be able to have a metadata repository that incorporates metadata

on the static part of an information system, i.e., tables, constraints, query forms, etcon the dynamic part, i.e., data-centric software modules

We invest on a graph-based modeling approach, based on the flexibility of graphs as modeling tools

DaWaK'05, Copenhagen, August 2005 30

Preliminaries

Data types

Constants

Attributes

RecordSets

Function types

Functions

R

$2€

1

PKEY

my$2€

Integer

DaWaK'05, Copenhagen, August 2005 31

Relationships

Instance-Of Relationships

Part-Of Relationships

Regulator Relationships

Provider Relationships

Derived Provider

Relationships

DaWaK'05, Copenhagen, August 2005 32

Activities

Name

Input Schemata

Output Schema

Rejections Schema

Parameter List

Output/Rejection Operational Semantics

OutputActivity

Parameters

Input 1

Input 2Rejected

Rows

DaWaK'05, Copenhagen, August 2005 33

Importance Metrics

Dependency: the in-degree of the node with respect to the provider edges;

Responsibility: the out-degree of the node with respect to the provider edges;

Degree: dependency + responsibility

Local vs. Transitive

DaWaK'05, Copenhagen, August 2005 34

Functions

Functions are treated as any other predicate in LDL, with the following special characteristics:The function involves a list of parameters, the last of which is the return value of the function.All function parameters referenced in the body of the rule either as homonyms with attributes, of other predicates or through equalities with such attributes, are linked through equality regulator relationships with these attributes.The return value is possibly connected to the output through a provider relationship (or with some other predicate of the body, through a regulator relationship).

DaWaK'05, Copenhagen, August 2005 35

Aliases & Negation

Alias relationships. An alias relationship is introduced whenever the same predicate appears in the same rule (e.g., in the case of a self-join). All the nodes representing these occurrences of the same predicate are connected through alias relationships to denote their semantic interrelationship. Note that due to the fact that intra-activity programs do not directly interact with external recordsets or activities, this practically involves the rare case of internal intermediate rules Negation. When a predicates appears negated in a rule body, then the respective part-of edge between the rule and the literal’s node is tagged with ‘⌐’. Note that negated predicates can appear only in the rule body.

DaWaK'05, Copenhagen, August 2005 36

Activity semantics in LDLR06:a_in1(pkey,suppkey,date,qty,cost)<- ps(pkey,suppkey,date,qty,cost).R07:a_in2(pkey,source,skey)<-

lookUp(l_pkey,source,l_skey), pkey=l_pkey, skey=l_skey,source=1.R08:a_out(pkey,suppkey,date,qty,cost,skey)<-

a_in1(pkey,date,qty,cost),a_in2(pkey,source,l_skey).

R09:D2E.a_in(skey,suppkey,date,qty,cost)<-

sk.a_out(pkey,suppkey,date,qty,cost,skey)

DaWaK'05, Copenhagen, August 2005 37

Activities

pkey

suppkey

date

qty

cost

DSA_PS

R06 head

SK

Program

R08

a_in1 a_outd2e.a_in

R09

pkey

suppkey

date

qty

cost

l_key

source

l_skey

lookup

R07 head

a_in2

pkey

source

skey

=

=

1

headhead

pkey

suppkey

date

skey

cost

skey

suppkey

date

qty

cost

cost

DaWaK'05, Copenhagen, August 2005 38

Zooming in and out

A

A.IN1

A.IN2

P

A.OUT

A.REJ

P1

P2

A

A.IN1

A.IN2

P

A.OUT

A.REJ