ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...
Transcript of ETL - sozialinformatik.de · ETL makes or breaks the data warehouse…” Ralph Kimball. ETL . MS...
ETL
Mission of ETL The back room must support 4 key steps
– Extracting data from original sources – Data Quality assuring & cleaning data – Conforming the labels & measures in the data to
achieve consistency across the original sources – Delivering the data in a physical format that can be used
by query tools and report writers
Data Quality • Problems when data doesn’t mean what you
think it does – garbage-in-garbage-out
• Or when you don’t understand the data – complexity, lack of metadata
• Data quality problems are expensive and pervasive – cost hundreds of $ billion each year – resolving data quality problems is often the biggest
effort in business intelligence
Examples
• Can we interpret this data? – What do the fields mean? – What is the key? The measures?
• Data problems – Typos, multiple formats, missing / default values
• Metadata and domain expertise – Field three is Revenue. In dollars or euro?
T.Das|97336o8327|24.95|Y|-|0.0|1000 Ted J.|973-360-8779|2000|N|M|NY|1000
Examples • Dummy Values
• a clerk enters 999-99-9999 as a SSN rather than asking the customer, customers 235 years old
• Absence of Data • NULL for customer age
• Cryptic Data • 0 for male, 1 for female
• Contradicting Data • customers with different
addresses
• Violation of Business Rules • NULL for customer name on
an invoice
• Reused Primary Keys, Non-Unique Identifiers • a branch bank is closed.
Several years later, a new branch is opened, and the old identifier is used again
• Data Integration Problems • Multipurpose Fields,
Synonyms
5
Data Quality Factors • Accuracy
– data was recorded correctly without errors – example: “Jhn” vs. “John”
• Completeness – all relevant data was recorded, no lack of data – no partial knowledge of the records in a table or of the attributes
in a record • Uniqueness
– entities are recorded once – Example: the same customer entered twice
• Timeliness/Currency – data is kept up to date – example: Residence (Permanent) Address: out-dated vs. up-to-
dated • Consistency
– data agrees with itself, no discrepancies in data – example: ZIP Code and City consistent
Causes of Poor Data Quality – Business rules do not exist or there are no
standards for data capture – Standards may exist but are not enforced at the
point of data capture – Inconsistent data entry (incorrect spelling, use of
nicknames, middle names, or aliases) occurs – Data entry mistakes (character transposition,
misspellings, and so on) happen – Integration of data from systems with different data
standards is present – Data quality issues are perceived as time-
consuming and expensive to fix 7
Causes of Poor Data Quality
Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002 8
07/Oct/2006 BITS-Pilani 9
“A Properly designed ETL system extracts data from the
source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions… ETL makes or breaks the data warehouse…” Ralph Kimball
ETL
MS BI Architecture
10 Source: The Data Warehouse Backroom
Definition of ETL • Extraction Transformation Loading – ETL • Get data out of sources and load into the DW
– Data is extracted from OLTP database, transformed to match the DW schema and loaded into the data warehouse database
– May also incorporate data from non-OLTP systems such as text files, legacy systems, and spreadsheets
11
Complexity of ETL
• ETL is a complex combination of process and technology – The most underestimated process in DW
development – The most time-consuming process in DW
development • Often, 80% of development time is spent on ETL
12
Data marts: Mini-warehouses, limited in scope
E
T
L
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
Independent data marts
13 Periodic extraction data is not completely fresh in warehouse
E T
L
Single ETL for enterprise data warehouse (EDW)
Simpler data access
ODS provides option for obtaining current data
Dependent data marts loaded from EDW
Dependent data marts
14
Data Staging Area (DSA) • Transit storage for data underway in the ETL process
– Creates a logical and physical separation between the source systems and the data warehouse
– Transformations/cleansing done here • No user queries (some do it) • Sequential operations (few) on large data volumes
– Performed by central ETL logic – Easily restarted – RDBMS or flat files? (DBMS have become better)
• Allows centralized backup/recovery – Often too time consuming to load all data marts from
scratch after failure – Backup/recovery facilities needed 15
16
• To stage or not to stage? • A conflict between
– getting the data from the operational systems as fast as possible
– having the ability to restart without repeating the process from the beginning
• Reasons for staging – Recoverability: stage the data as soon as it has been
extracted from the source systems and immediately after major processing (cleaning, transformation, etc).
– Backup: can reload the data warehouse from the staging tables without going to the sources
– Auditing: lineage between the source data and the underlying transformations before the load to the data warehouse
Data Staging Area (DSA)
Static extract = capturing a snapshot of the source data at a point in time
Incremental extract = capturing changes that have occurred since the last static extract
Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Extract
17
• ETL process needs to effectively integrate systems that have different: – DBMS – Operating Systems – Hardware – Communication protocols
• Need to have a logical data map before the physical
data can be transformed
• The logical data map describes the relationship between the starting points and the ending points of your ETL system – usually presented in a table or spreadsheet
Extraction
• The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes
• The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts
• The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process
• The transformation can contain anything from the absolute solution to nothing at all
Target Source Transformation
Table Name Column Name Data Type Table Name Column Name
Data Type
Logical data map
Transform
• Main step where the ETL adds value • Changes data to be inserted in DW • 3 major steps
– Data Cleansing – Data Integration – Other Transformations (includes replacement
of codes, derived values, calculating aggregates)
20
21
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Cleanse
Typical Cleansing Tasks • Standardization
– changing data elements to a common format defined by the data steward
• Gender Analysis – returns a person’s gender based on the elements of their
name (returns M(ale), F(emale) or U(nknown)) • Entity Analysis
– returns an entity type based on the elements in a name (i.e. customer names) (returns O(rganization), I(ndividual) or U(nknown))
• De-duplication – uses match coding to discard records that represent the
same person, company or entity • Matching and Merging
– matches records based on a given set of criteria and at a given sensitivity level
22
Typical Cleansing Tasks • Address Verification
– verifies that addresses are valid • Householding/Consolidation
– used to identify links between records, such as customers living in the same household or companies belonging to the same parent company
• Parsing – identifies individual data elements and separates
them into unique fields • Casing
– provides the proper capitalization to data elements that have been entered in all lowercase or all uppercase
23
Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00
...
Cleansing Example
24
Cleansing Example
25
Cleansing Example
0 10,000 20,000 30,000 40,000 50,000
$ Spent
Casper Corp.
Solomon Ind.
Briggs Inc.
Anderson Cons.
26
Operational System of Records Data Warehouse
01 Mark Carver SAS SAS Campus Drive Cary, N.C.
02 Mark W. Craver [email protected]
03 Mark Craver Systems Engineer SAS
Mark Carver SAS SAS Campus Drive Cary, N.C.
Mark W. Craver [email protected]
Mark Craver Systems Engineer SAS
Data Matching Example
27
01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 [email protected]
Data Matching Example
Mark Carver SAS SAS Campus Drive Cary, N.C.
Mark W. Craver [email protected]
Mark Craver Systems Engineer SAS
Operational System of Records Data Warehouse
DQ
28
Transform
29
Transform = convert data from format of operational system to format of data warehouse
Record-level: Selection–data partitioning Joining–data combining Aggregation–data summarization
Field-level: single-field–from one field to one field multi-field–from many fields to one, or one field to many
30
Single-field transformation
In general–some transformation function translates data from old form to new form
Algorithmic transformation uses a formula or logical expression
Table lookup–another approach, uses a separate table keyed by source record code
31
Multifield transformation
M:1–from many source fields to one target field
1:M–from one source field to many target fields
Loading
32
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode: bulk rewriting of target data at periodic intervals
Update mode: only changes in source data are written to data warehouse
Loading tools • Backup and Disaster Recovery
– Restart logic, Error detection • Continuous loads during a fixed batch
window (usually overnight) – Maximize system resources to load data
efficiently in allotted time • Parallelization
– Several tenths GB per hour
33
Loading and SCD • The data loading module consists of all the steps
required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes
• Creating and assigning the surrogate keys occur in this module
• When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses – Type 1 – Type 2 – Type 3 (not in MS SQLServer 2008)
ETL Tools
• ETL tools from the big vendors – Oracle Warehouse Builder – IBM DB2 Warehouse Manager – Microsoft Integration Services
• 100’s others: – http://en.wikipedia.org/wiki/Extract,_transform,_load#Tools
35
SQL Server Integration Services
36
SSIS: Data Extraction
• In SSIS extractions are managed via source adapters, called Data Flow Sources – source adapters read metadata from
connection managers, – connect to the source, – read raw data from the source, and – forward it to the pipeline
37
SSIS: Data Extraction Connection Manager
• A connection manager is a logical representation of a physical source
• Different types of sources need different kind of information – E.g., to connect to an Excel document,
information concerning filename and path needs to be stored, which data sheets are being accessed and if there are column headers or not in the document.
38
SSIS: Data Cleansing
• Establishing baseline indicators that assesses data quality on data sets and decide what do to when these indicators are violated (3 options): – fix the data – discard the record – stop ETL process
• In SSIS environment baseline indicators and violation rules can be set up by branching data of different quality in a data flow pipeline: – using data flow transformations such as conditional splits and by
applying SSIS Expression Language (a language similar to C#) – use Control flow tasks like Send Mail to alert the ETL staff
39
SSIS: Data Flows Transformations
• Using data flow transformations such as: – fuzzy lookups that finds close or exact
matches by comparing different columns
– fuzzy groupings, which can add values for data matching similarities
• Data Flow Transformations such as: – Merge Join combines sorted data flows – Character Map to convert character
mappings – Data Conversion applies transformation
functions
40
SSIS: Data Loading
• Data Flow Destinations load data into the tables in a star join schema
• Load operations, e.g.: – Data Flow Task: extract data,
apply transformations and load data
– For Loop Container – Bulk Insert Task
41