Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 39

M2: Extract-Transform-Load (ETL)

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

mailto:[email protected]


Objectives

• Database Vs Data Warehouse

• ER Diagram for database

• Data warehouse architecture

• ETL (Data Extraction, Transformation and Loading) definitions

• ETL design principles

• ETL functions


Data warehouse architecture


Difference between Databases and Data Warehouse

• Database• A database is made up of a collection of tables that stores a specific

set of structured data. • A table contains a collection of rows, and columns,• Each column(also called field or attribute) in the table is designed to

store a certain type of information.

• OLTP• OLTP (On-line Transaction Processing) is characterized by a large

number of short on-line transactions (INSERT, UPDATE, DELETE). • The main emphasis for OLTP systems is put on very fast query

processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second.


Difference between Databases and Data Warehouse

• Data Warehouse• A data warehouse is a subject-oriented, integrated, time-variant, and

non-volatile collection of data in support of management’s decision making process”

• A data warehouse is designed for OLAP

• OLAP • OLAP (On-line Analytical Processing) is characterized by relatively low

volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques.

• In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas.


Data Warehousing Components

Operational DB

Operational DB

Operational DB

Extract

Transform

Load

(ETL)

Data Warehouse

OLAP

Data Mining


Data Flow

Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley


OLTP vs. OLAP

• IT systems can be divided into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyse it.


OLTP vs. OLAP

OLTP systems OLAP systems

Holds current data Holds historical data

Stores detailed data Stored detailed and summarized data

Data is dynamic Data is static

Repetitive processing Ad hoc processing

Predictable pattern of usage Unpredictable pattern of usage

Transaction-driven Analysis-driven

Application-oriented Subject-oriented

Supports day-to-day operations Support strategic decisions

Serves large number of clerks/operational users Serve low number of managers

Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley


Data Acquisition & Integration

• A process to populate a data warehouse.

• Three main functions:• Extract: retrieves data in a source system to produce a new

Source Data.

• Transform: Inspects, cleanses, and conforms the new source data into a data warehouse (or called Load Data).

• Load: updates a data warehouse using the data provided in the Load Data.

• These three functions are more commonly known as ETL.


DW Data Components

• Fact Table• Tell “one version of the truth” of the subject

• Numerical Measurement; sale amount, total customer, etc.

• Key(s) to dimension table

• Dimension Table• Identify the key cell of fact table

• Drill down, Roll up

• Describe the subject; • product name,

• customer name,

• store location


ETL Process

• ETL process begins by defining a data source and which data you are interested in the data source in copying a new destination.

• You may need to perform one or more transformations on the data for retrieval purpose.• E.g., you may need to transfer “True” or “False” (string type) into

“1” or “0” (Boolean type).

• You also need to use a load sequence to inject the transformed data into the appropriate destination (or called target system, always a data warehouse or a section of a data warehouse architecture, e.g., Data Mart, in this unit).


Source System Analysis

• It provides significant insight and understanding of the enterprise and its data for a data warehouse to express the enterprise at any level.

• It examines the enterprise data for its informational content – the meaning of the data and how it captures and expresses that meaning.

• A data warehouse designer in this stage should focus on the enterprise and the analysis of its data.

• An early and common mistake in data warehouse design is the use of source system analysis to search for source data within the enterprise to fit the definition of a data warehouse.

• The designer must be allowed to query and survey the enterprise data, not just a summary or description of the enterprise data.


Source System Analysis Principles

• They explain what the data warehouse designer is looking for, including• Multiple systems of record; e.g., selling products through a series of retail

outlets (West Division, East Division)

• Entity data; including physical and logical members, agents, facilities and resources:• Physical entities can be touched and uniquely identified

• Logical entities cannot be touched, e.g., concepts, constructs and hierarchies that organize and enhance the meaning of enterprise events and entities.

• Entities can also describe and qualify each other by their associations, e.g., • S Block can identify itself as a unique physical entity as well as identify the location

of lecturer #123.


Source System Analysis Principles

• Granularity. A designer must beware of the grain of all source data elements, which is determined by its level of detail, hierarchical depth, or measurement precision.

• Latency refers to the time gap between an enterprise event. It determines the earliest moment data will be available to the data warehouse.

• Transaction data; known as event data, which identify the moment when an enterprise performs its primary functions, e.g., • Sales - the moment when a retail enterprise sells something• Manufacturing – the moment when an assembly plant builds something• Service – the moment when a consulting firm provide a service.

• Snapshot data, expresses the cumulative effect of a series of transactions or events over a range of time, e.g., Web site hits per hour.


Source System Analysis Methods

• They explain how the data warehouse designer examines the source system to understand how the enterprise and its data interact.

• System documentation is a good start, which provides information about how an enterprise system is intended and expected to behave.

• The interaction of enterprise data is a good baseline from which to start.

• We should also document how an enterprise system misbehaves, creating unexpected data and results (the anomalous data).

• Source system analysis is the first opportunity to protect the quality of the data in a data warehouse.


Data Flow, State Diagram & System Record

• Data Flow Diagram is used to indentify where the data comes from, goes to, and by what transport mechanism it moves.

• The Data State Diagram is used to capture the various business meaning of a data element as it flows through the data flow diagram.

• It also indicates the relevance of a data element to the enterprise.

• It also includes any physical indications of each state.

• The authoritative point of origin for each enterprise entity at any given state is the System of Record, in where the ETL gets data and loads into a data warehouse.


Business Rules

• Business rules govern the data in the source system.

• The data profile, data flow diagram, stat state diagram and system record provide the best opportunity to identify the business rules.

• They come in three basic varieties:• Intra-record business rules:

• Column A + Colum B = Column C

• Intra-dataset business rules:• Row 1. Column A + Row 2. Column A = Row 3.Column B

• Cross dataset business rules:• File1. Column A = Table 2. Column B.


Target System Analysis

• Target system is a data warehouse, or a component of a data warehouse architecture.

• The design needs to choose the data model, RDBMS, and business intelligence reporting architecture.

• It should also indicate how the data warehouse will reflect the enterprise of the source system (e.g., purchase orders, machines, people, etc.) as those entities cycle through their states (e.g., reviewed, approved, commissioned, hired, etc.).

• Target system analysis should reveal and clarify both expectations of the data warehouse designer and customers.

• It also provides an opportunity to recognize and resolve discrepancies between the designer and customers.

• Its goal is to create a set of expectations so explicit that these expectations can be compared directly to the data in the data warehouse.


Data mapping

• It is the process by which an ETL analyst identifies the source data, specific to location, state, and timing, which will be used to satisfy the data requirements of a data warehouse.

• Transformations necessary to create the data elements, as they will be stored in a data warehouse, are also included in a Data Mapping.

• The Data Mapping document is an input into the Data Quality SLA and the Metadata SLA.

• The Data Mapping document must clearly and precisely identify the source data element that will be used, such that there is no ambiguity about the location, state, or timing of the extract of a data element.

• The Data Mapping document must clearly and precisely identify the target data element that will be populated, such that there is no ambiguity about the location and state of the data element as stored in the data warehouse.

• The Data Mapping document must clearly and precisely define the transformations necessary to create the data element as it will be stored in the data warehouse.


Types of data mapping

1. Simple data mapping

2. Derived data mapping

Source data element Transformation Target data element

Length in kilometres n/a Length in kilometres


Length in kilometres × 1000 Length in metres


Types of data mapping cont.3. Recursive data mapping


Length in kilometres n/a Length in kilometres

Price per meter n/a Price per meter


Price per meter × 1000 Price per kilometre


Length in kilometres × Total price

Price per kilometre


ETL vs. ELT

Source data

Transaction

application

ETL

Extract

Data warehouse

Transaction source

data

Transform

Transaction load

dataTransaction Table

Load

Source data

Transaction

application

ELT

Extract

Data warehouse

Transaction

source dataTransaction

source data

Load

Transform

Transaction load

data

Transaction

TableLoad

Active &

current data


ETL vs. ELT cont.

• In an ETL application, data is extracted from an operational system. A transform performs all data modifications to the Source Data. A load application reads the Load Data and performs the necessary inserts, updates, and deletes to a data warehouse.

• An ELT application performs all the functions and purposes of an ETL application.

• The difference between an ETL application and an ELT application is the platform on which the application performs it functions.

• ELT has two advantages.• A data warehouse RDBMS platform is a powerful platform. All the resources (CPU seconds,

throughput, etc.) of a data warehouse RDBMS platform are available to an ELT application.

• A copy of look-up data need not be kept and maintained on the ELT platform because the data warehouse RDBMS has access to all the data in the data warehouse.

• ELT has one disadvantage.• A portion of the data warehouse’s resources (CPU seconds, throughput, etc.) are consumed by

someone other than a data warehouse customer. Given sufficient data volumes and transformation complexity, this could adversely affect data warehouse customers.


ETL Design Principles

• ETL applications are subject to unexpected circumstances and, therefore, should expect the unexpected to occur.

• An ETL analyst must work hard to assure an ETL application is bulletproof, knowing each ETL application will behave as intended, even if the source system does not.

• ETL Process Principles (Principles 1 to 6), address specifically the executable part of an ETL application, i.e., the code that moves, copies, and transforms data.• Which is similar to a manufacturing plant. It converts and transforms

raw data (i.e., materials) into a data warehouse (i.e., finished product).

• ETL Staging Principles (Principles 7 to 11), provide design principles for managing and controlling the creation and use of stage data and structures.


Principle 01: one thing at a time

• Multitasking conserves time and resources and is contrary to all things of ETL since an ETL application, however, assumes that nothing will go as planned, and that some input values will be unreasonable and invalid.

• It is recommended to perform each action individually and then combine the separate result sets into one set of data.

• One Thing at a Time is basically a granular modular approach. Benefits of using a granular modular approach include:• Create the opportunity for Data Quality and Metadata functions to integrate

within an ETL application.• Create the opportunity to isolate violated assumptions.• Remove any question about the sequence and precedence of ETL functions,

regardless of the language or platform.


Principle 02: Know When to Begin

• Operational systems rely on operational job schedulers to know when the conditions have been satisfied for a job to begin.

• ETL applications, however, rely on conditions within precedent data (i.e., Begin Conditions). When precedent Begin Conditions have been satisfied, subsequent applications relying on those conditions can safely begin.• An Extract application will examine an operational source system prior to extracting

data.

• A Transform application will examine data provided by preceding Extract applications.

• A Load application will examine data provided by preceding Transform applications to determine whether or not Begin Conditions have been satisfied.

• Data Quality and Metadata information prove to be extremely helpful in these circumstances.

• Principle 02 is basically a backward-looking design principle.


Principle 03: Know When to End

• It is a forward-looking design that requires an ETL application to examine data it has created.

• An ETL application can verify, by examining its own output data, whether or not that ETL application has completed satisfactorily.

• Then, the results of that final review can be captured as Data Quality or Metadata information, and shared with subsequent ETL applications.


Principle 04: Large to Medium to Small

• Large to Medium to Small design assembles all applicable data elements and entities.

• Data that is no longer required is dismissed. The final data set is a load-ready file that will be loaded to a data warehouse.

• At this initial stage, all applicable data is juxtaposed simultaneously. The decision to exclude data is made in the broadest context possible, which allows the greatest control of data exclusion.


Principle 05: stage data integrity

• It is a design principle by which precedent applications create (store) a set of stage data as it will be consumed by subsequent applications.

• Once created, a set of stage data can only be consumed as a single contiguous set by subsequent applications.

• It avoid unnecessary risk and increases the overall integrity of an ETL application.

• For example, we have source raw materials data from Company A, B and C; and an application that extracts data describing raw materials from company A. We may have the following approaches:• Create a single set of stage data ABC (not a good solution, why?)

• Create a multiple data sets of stage data, A, B, C and ABC.


Principle 06: Know what you have

• It prompts an ETL application to take inventory of inbound data, rather than assume inbound data contains all that is expected.

• Information describing contents of inbound data is available through two sources: Metadata and data itself.

• The output of the comparison of inbound data and expected data includes lists of matches and mismatches (or missing data).

• Normally, an threshold is used for missing data to choose a response based on the history of data anomalies.

A

B

C

Compare

A

B

C

What you

have

What you

don’t have

Matches

Mismatches


ETL Staging Principles

• Principle 07: Name the data – describes how to identify data and its features, origin and destination with an appropriate level of granularity and control.

• Principle 08: Own the data – describes how to secure data to prevent interference by other applications, including ETL and operational applications

• Principal 09: Build the data – describes how to create a data set from its foundation

• Principal 10: Type the data – describes how to protect ETL functions from incompatible data types

• Principal 11: Land the data – describes the need to retain interim data beyond its immediate use.


ETL Functions

• ETL functions are designed to discern what has happened in the enterprise, and bring that information to the data warehouse.

• Extract functions – retrieve data from source system and store as stage data in ETL environment.

• Transform functions – are applied in staged datasets to derive required information (sets of dimension data)

• Load functions – loads data to the data warehouse.


Extract functions

1. Extract data from contiguous dataset – a simple extract function

ETL environmentSource system

Source Extract Stage

Source system ETL environment

StageExtract

2. Extract data from a data flow – needs a control mechanism based on the bundles of data

flow records.


Data level Transform functions

• Row-level transformation – is applied to every row in a staged dataset, the simplest transformations.

• Dataset-level transformation • is formed within the context of a whole set of data.• It must address the whole dataset at a time to derive the

information necessary to update each individual row.

• Surrogate key generation: intra-dataset• It generates a sequential numeric value that uniquely identifies

each row of dataset.• A surrogate key is used here to uniquely identify each raw in an

ETL application because sometimes transformed data lacks a key.


Data warehouse level transformation

• The functions must be performed within the context of the data warehouse.

• They do not have all necessary knowledge to derive the data required, and they have to use both the input data and data from the data warehouse.

• Surrogate key generation: intra-data warehouse

• The identifier should be throughout the data warehousing

• The best way is to retrieve the max identifier in the data warehouse and then assign max + 1 to the new row.


Load Data

1. Load data from a stable and contiguous dataset – the simplest and most common method

Load data

Data Load

ETL environment

Data

warehouse

Load data ETL environment

Load Data

warehouse

2. Load data from data flow – needs a control mechanism to be able to know each

row has been loaded only once and so on.


ETL Beginning to end

Customer

Expectations

Data quality

SLA

ETL direct

requirements

Metadata

SLA

Target system

analysis

Source System

Analysis

ETL indirect

requirements

Target System

Analysis

ETL principles

Customer

Expectations

Data

warehouse

Source Data

Data

Mapping/Logical

Design

Physical

Design

ETL

Application

Data

warehouse


Closing remarks

• A data warehouse designer captures customers expectations in the design of a data warehouse.

• A target system analysis captures the behaviour of data in a data warehouse design. These behaviours are expressed as direct requirements.

• Data mapping is a road map showing how an ETL application will achieve data behaviours.

• The data quality SLA and Metadata SLA capture the information necessary for customers to use the data in the data warehouse (indirect requirements):• Is the data complete?• Are there any anomalies?• When is the data available? • What is the profile of today’s data?

• The direct and indirect requirements meet together in a single physical design, which declares the physical hardware, platform, datasets and jobs that are the ETL application.

• The ETL application delivers data to a data warehouse that meets customer expectations.

Extract Transform Load (ETL)

Technology

Transcript of Extract Transform Load (ETL)