A Domain Specific Model for Generating ETL Workflows from ...

53
1 Wesley Deneke A Domain Specific Model for Generating ETL Workflows from Business Intents Thesis directors Wing-Ning Li Craig Thompson Thesis committee Gordon Beavers Rick Couvillion

Transcript of A Domain Specific Model for Generating ETL Workflows from ...

1

Wesley Deneke

A Domain Specific Model for Generating ETL

Workflows from Business Intents

Thesis directors

Wing-Ning Li

Craig Thompson

Thesis committee

Gordon Beavers

Rick Couvillion

2

Outline

Problem Context

Thesis and Objectives

Approach

Prototype

Results

Contributions and Future Work

3

The Big Picture

“Big Data”

4

How to leverage this data?

Data processing

Raw data ➔ Knowledge

5

How to leverage this data?

Data Sources

Diverse

Large

Changing

Processing Tasks

Objective specific

Manipulate data

6

Extract Transform Load (ETL)

Extract

Transform

Load

ETL Tools

7

Extract Transform Load (ETL)

Solutions represented as workflows.

8

Workflow Representation

Directed Acyclic Graph

G is a pair (V, E), where:

V is a set of vertices

E is a set of ordered pairs of vertices that denote

directed edges connecting vertex pairs.

9

ETL Workflows

Data representation

Data Set

Data

Records

Data

Field

10

ETL Workflows

Operators

Options

Input

Fields

Output

Fields

11

ETL Workflows

12

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

13

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

14

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

15

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

16

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

17

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

18

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

Time consuming

Required expertise

Error prone

19

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

Time consuming

Required expertise

Error prone

Operator logic

Workflow standards

Business rules

20

Problem

Analysis

Design

Construction

Verification

Maintenance

ETL Workflow Specification

Time consuming

Required expertise

Error prone

21

Problem

Automate the process of ETL workflow specification

22

Thesis Statement

ETL workflow specification can be automated in an

extensible manner by translating high-level statements

of intent into a set of ETL workflow requirements and

generating ETL workflow solutions that accomplish these

specifications.

23

Objectives

Better solution accuracy

Lower required level of expertise

Faster turn around

Fewer errors

24

Approach

Out of Scope:

Field characterization

Unknown data resources

Unknown operators

Assumptions:

Sources and sinks given

Single data source

Properly characterized

Flat files of data records

Homogenous data

Known state

Operators given

Well-defined

25

Approach

Extensible framework for the creation of domain-specific

modeling languages that enable users to express the

intent of a desired ETL solution at a high-level of

abstraction and automatically generate workflows

satisfying such specifications.

26

ETL Domain Knowledge

Not uniform across domains

Not guaranteed to remain static

Capture ETL domain knowledge in a

formal representation.

Domain-

specific

27

Domain

US

Canad

aRetail

Financia

lClient

Client

A means of constraining

the set of considerations.

28

Workflow ∪ State

S = (Q, ∑, ∂, q0, F):

Q - set of states

∑ - set of valid input symbols

∂ - set of state transitions

q0 - start state

F - set of accepting states

29

Attributed Field

The data each field contains can be

described at a higher level of

abstraction.

Name Data Type

varchar

intfloat

timestamp

30

Attributed Field

Content Type

High-level concepts used to categorize data.

Semantic relationships

31

Attributed Field

State Attributes

Distinguishable qualities that data may possess.

32

Attributed Field

33

Operators

Preconditions and Postconditions

34

Preconditions

Assertions that must be true

prior to execution to guarantee

the result produced.

Predicate expressions representing valid input

35

Preconditions

Required:

(Input1 && Input2) ||

(Input1 && Input3) ||

(Input3 && Input4 && Input6) ||

(Input3 && Input4 && Input5 && Input6) Unparsed

Name

Unparsed

Address

Parsed Address

First Name

Middle Name

Last Name

Y:

Input1::Name.Full

||Input3::Address.Primary

+Corrected

Input1::Name.Full &&

Input2::Address.PrimaryStandardized

Input1::Name.Full &&

Input2::Address.Primary

+Standardized

36

Postconditions

Assertions that will be true

after execution, provided that

the preconditions are satisfied.

37

Postconditions

Input1::Address.Primary

+Standardized

Input1::Address.Primary

+Standardized

+Verified

Input1::Name.Full

+CorrectedInput1::Name.Full

+Filtered

38

Workflow Engine

AI Planner

Initial State ➔ Source Data

Goal State ➔ Target State of Data

State Transitions ➔ Available Operators

Planning strategies:

Depth First

Breadth First

A*

39

Intent Language

Need an intuitive goal specification

Express in terms of the given domain

Familiar terminology

Understandable to users

Mapping:

High-level ➔ Low-level

Intent Goal State

40

Prototype

Characteristics:

Database storage

Object-oriented

Loosely coupled

41

Assertions

Field Mapping Assertion Field State Assertion

42

GUI

43

Workflow Engine

Goal & Input

Matches

Planning

Tree

Apply

Postconditions

44

Atomic Operators

Operator Decomposition

Determine canonical form

45

Results

46

Operators

AddressEditCheck

AddressEnhance

AddressSelect

ContactLink

IndustryCode

NameEditCheck

Parser

PremiumAddress

47

Intents

Address hygiene

Premium address hygiene

Change of address

Premium change of address

Filter profanity

Validate names

Determine industry demographic

Validate addresses

Delivery sequencing

Geocode addresses

Link contacts

48

Test Scenario 1

ParserIndustry

Code

Addres

sEnhan

ce

Addres

s

Select

Name

Edit

Check

Intents:Determine industry demographic

Address hygiene

Validate names

Initial state:Full name

Unparsed primary address

Unparsed city/state/zip

49

Test Scenario 2

Intents:Determine industry demographic

Address hygiene

Link contacts

Validate addresses

Validate names

Initial state:Full name

Unparsed primary address

Unparsed city/state/zip

ParserIndustry

Code

Addres

sEnhan

ce

Addres

s

Select

Contact

Link

Address

Edit

Check

Name

Edit

Check

50

Analysis

AccuracyScenario 1: 384 workflows

Scenario 2: 7680 workflows

Consolidated: 16 workflows

Ease of useScenario 1: 3 intents

Scenario 2: 5 intents

Reduced timeBoth < 1 minute

Error freeAssuming proper modeling

51

Contributions

Represent and enforce ETL domain knowledge.

Automatic generation of ETL workflow solutions using

AI planning.

Mapping between workflow requirements and higher-

level abstractions called “intents”.

52

Future Work

Operator verification

Correctness

Equivalence

Optimization

Data heritage

Generic set operators

Intermediate goals

Inputs mappings

Goal indexing

Caching

Nested intent

statements

Intent relationships

Result filtering

53

Questions