ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...

41
ETL

Transcript of ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...

Page 1: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

ETL

Page 2: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Mission of ETL The back room must support 4 key steps

– Extracting data from original sources – Data Quality assuring & cleaning data – Conforming the labels & measures in the data to

achieve consistency across the original sources – Delivering the data in a physical format that can be used

by query tools and report writers

Page 3: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Data Quality • Problems when data doesn’t mean what you

think it does – garbage-in-garbage-out

• Or when you don’t understand the data – complexity, lack of metadata

• Data quality problems are expensive and pervasive – cost hundreds of $ billion each year – resolving data quality problems is often the biggest

effort in business intelligence

Page 4: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Examples

• Can we interpret this data? – What do the fields mean? – What is the key? The measures?

• Data problems – Typos, multiple formats, missing / default values

• Metadata and domain expertise – Field three is Revenue. In dollars or euro?

T.Das|97336o8327|24.95|Y|-|0.0|1000 Ted J.|973-360-8779|2000|N|M|NY|1000

Page 5: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Examples • Dummy Values

• a clerk enters 999-99-9999 as a SSN rather than asking the customer, customers 235 years old

• Absence of Data • NULL for customer age

• Cryptic Data • 0 for male, 1 for female

• Contradicting Data • customers with different

addresses

• Violation of Business Rules • NULL for customer name on

an invoice

• Reused Primary Keys, Non-Unique Identifiers • a branch bank is closed.

Several years later, a new branch is opened, and the old identifier is used again

• Data Integration Problems • Multipurpose Fields,

Synonyms

5

Page 6: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Data Quality Factors • Accuracy

– data was recorded correctly without errors – example: “Jhn” vs. “John”

• Completeness – all relevant data was recorded, no lack of data – no partial knowledge of the records in a table or of the attributes

in a record • Uniqueness

– entities are recorded once – Example: the same customer entered twice

• Timeliness/Currency – data is kept up to date – example: Residence (Permanent) Address: out-dated vs. up-to-

dated • Consistency

– data agrees with itself, no discrepancies in data – example: ZIP Code and City consistent

Page 7: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Causes of Poor Data Quality – Business rules do not exist or there are no

standards for data capture – Standards may exist but are not enforced at the

point of data capture – Inconsistent data entry (incorrect spelling, use of

nicknames, middle names, or aliases) occurs – Data entry mistakes (character transposition,

misspellings, and so on) happen – Integration of data from systems with different data

standards is present – Data quality issues are perceived as time-

consuming and expensive to fix 7

Page 8: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Causes of Poor Data Quality

Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002 8

Page 9: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

07/Oct/2006 BITS-Pilani 9

“A Properly designed ETL system extracts data from the

source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions… ETL makes or breaks the data warehouse…” Ralph Kimball

ETL

Page 10: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

MS BI Architecture

10 Source: The Data Warehouse Backroom

Page 11: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Definition of ETL • Extraction Transformation Loading – ETL • Get data out of sources and load into the DW

– Data is extracted from OLTP database, transformed to match the DW schema and loaded into the data warehouse database

– May also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets

11

Page 12: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Complexity of ETL

• ETL is a complex combination of process and technology – The most underestimated process in DW

development – The most time-consuming process in DW

development • Often, 80% of development time is spent on ETL

12

Page 13: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Data marts: Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Independent data marts

13 Periodic extraction data is not completely fresh in warehouse

Page 14: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

E T

L

Single ETL for enterprise data warehouse (EDW)

Simpler data access

ODS provides option for obtaining current data

Dependent data marts loaded from EDW

Dependent data marts

14

Page 15: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Data Staging Area (DSA) • Transit storage for data underway in the ETL process

– Creates a logical and physical separation between the source systems and the data warehouse

– Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes

– Performed by central ETL logic – Easily restarted – RDBMS or flat files? (DBMS have become better)

• Allows centralized backup/recovery – Often too time consuming to load all data marts from

scratch after failure – Backup/recovery facilities needed 15

Page 16: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

16

• To stage or not to stage? • A conflict between

– getting the data from the operational systems as fast as possible

– having the ability to restart without repeating the process from the beginning

• Reasons for staging – Recoverability: stage the data as soon as it has been

extracted from the source systems and immediately after major processing (cleaning, transformation, etc).

– Backup: can reload the data warehouse from the staging tables without going to the sources

– Auditing: lineage between the source data and the underlying transformations before the load to the data warehouse

Data Staging Area (DSA)

Page 17: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Extract

17

Page 18: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

• ETL process needs to effectively integrate systems that have different: – DBMS – Operating Systems – Hardware – Communication protocols

• Need to have a logical data map before the physical

data can be transformed

• The logical data map describes the relationship between the starting points and the ending points of your ETL system – usually presented in a table or spreadsheet

Extraction

Page 19: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

• The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes

• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts

• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process

• The transformation can contain anything from the absolute solution to nothing at all

Target Source Transformation

Table Name Column Name Data Type Table Name Column Name

Data Type

Logical data map

Page 20: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Transform

• Main step where the ETL adds value • Changes data to be inserted in DW • 3 major steps

– Data Cleansing – Data Integration – Other Transformations (includes replacement

of codes, derived values, calculating aggregates)

20

Page 21: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

21

Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Cleanse

Page 22: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Typical Cleansing Tasks • Standardization

– changing data elements to a common format defined by the data steward

• Gender Analysis – returns a person’s gender based on the elements of their

name (returns M(ale), F(emale) or U(nknown)) • Entity Analysis

– returns an entity type based on the elements in a name (i.e. customer names) (returns O(rganization), I(ndividual) or U(nknown))

• De-duplication – uses match coding to discard records that represent the

same person, company or entity • Matching and Merging

– matches records based on a given set of criteria and at a given sensitivity level

22

Page 23: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Typical Cleansing Tasks • Address Verification

– verifies that addresses are valid • Householding/Consolidation

– used to identify links between records, such as customers living in the same household or companies belonging to the same parent company

• Parsing – identifies individual data elements and separates

them into unique fields • Casing

– provides the proper capitalization to data elements that have been entered in all lowercase or all uppercase

23

Page 24: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00

...

Cleansing Example

24

Page 25: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Cleansing Example

25

Page 26: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Cleansing Example

0 10,000 20,000 30,000 40,000 50,000

$ Spent

Casper Corp.

Solomon Ind.

Briggs Inc.

Anderson Cons.

26

Page 27: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Operational System of Records Data Warehouse

01 Mark Carver SAS SAS Campus Drive Cary, N.C.

02 Mark W. Craver [email protected]

03 Mark Craver Systems Engineer SAS

Mark Carver SAS SAS Campus Drive Cary, N.C.

Mark W. Craver [email protected]

Mark Craver Systems Engineer SAS

Data Matching Example

27

Page 28: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 [email protected]

Data Matching Example

Mark Carver SAS SAS Campus Drive Cary, N.C.

Mark W. Craver [email protected]

Mark Craver Systems Engineer SAS

Operational System of Records Data Warehouse

DQ

28

Page 29: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Transform

29

Transform = convert data from format of operational system to format of data warehouse

Record-level: Selection–data partitioning Joining–data combining Aggregation–data summarization

Field-level: single-field–from one field to one field multi-field–from many fields to one, or one field to many

Page 30: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

30

Single-field transformation

In general–some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression

Table lookup–another approach, uses a separate table keyed by source record code

Page 31: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

31

Multifield transformation

M:1–from many source fields to one target field

1:M–from one source field to many target fields

Page 32: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Loading

32

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

Page 33: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Loading tools • Backup and Disaster Recovery

– Restart logic, Error detection • Continuous loads during a fixed batch

window (usually overnight) – Maximize system resources to load data

efficiently in allotted time • Parallelization

– Several tenths GB per hour

33

Page 34: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

Loading and SCD • The data loading module consists of all the steps

required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes

• Creating and assigning the surrogate keys occur in this module

• When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses – Type 1 – Type 2 – Type 3 (not in MS SQLServer 2008)

Page 35: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

ETL Tools

• ETL tools from the big vendors – Oracle Warehouse Builder – IBM DB2 Warehouse Manager – Microsoft Integration Services

• 100’s others: – http://en.wikipedia.org/wiki/Extract,_transform,_load#Tools

35

Page 36: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SQL Server Integration Services

36

Page 37: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SSIS: Data Extraction

• In SSIS extractions are managed via source adapters, called Data Flow Sources – source adapters read metadata from

connection managers, – connect to the source, – read raw data from the source, and – forward it to the pipeline

37

Page 38: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SSIS: Data Extraction Connection Manager

• A connection manager is a logical representation of a physical source

• Different types of sources need different kind of information – E.g., to connect to an Excel document,

information concerning filename and path needs to be stored, which data sheets are being accessed and if there are column headers or not in the document.

38

Page 39: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SSIS: Data Cleansing

• Establishing baseline indicators that assesses data quality on data sets and decide what do to when these indicators are violated (3 options): – fix the data – discard the record – stop ETL process

• In SSIS environment baseline indicators and violation rules can be set up by branching data of different quality in a data flow pipeline: – using data flow transformations such as conditional splits and by

applying SSIS Expression Language (a language similar to C#) – use Control flow tasks like Send Mail to alert the ETL staff

39

Page 40: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SSIS: Data Flows Transformations

• Using data flow transformations such as: – fuzzy lookups that finds close or exact

matches by comparing different columns

– fuzzy groupings, which can add values for data matching similarities

• Data Flow Transformations such as: – Merge Join combines sorted data flows – Character Map to convert character

mappings – Data Conversion applies transformation

functions

40

Page 41: ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS BI Architecture . 10 Source: The Data Warehouse Backroom . Definition of ETL •

SSIS: Data Loading

• Data Flow Destinations load data into the tables in a star join schema

• Load operations, e.g.: – Data Flow Task: extract data,

apply transformations and load data

– For Loop Container – Bulk Insert Task

41