Extract Transform Load (ETL)
Click here to load reader
-
Upload
worapot-jakkhupan -
Category
Technology
-
view
1.248 -
download
0
Transcript of Extract Transform Load (ETL)
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 39
M2: Extract-Transform-Load (ETL)
The only way to do great work is to love what you do. -- Steve Jobs --
W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7
I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U
ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 39
Objectives
• Database Vs Data Warehouse
• ER Diagram for database
• Data warehouse architecture
• ETL (Data Extraction, Transformation and Loading) definitions
• ETL design principles
• ETL functions
ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 39
Data warehouse architecture
ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 39
Difference between Databases and Data Warehouse
• Database• A database is made up of a collection of tables that stores a specific
set of structured data. • A table contains a collection of rows, and columns,• Each column(also called field or attribute) in the table is designed to
store a certain type of information.
• OLTP• OLTP (On-line Transaction Processing) is characterized by a large
number of short on-line transactions (INSERT, UPDATE, DELETE). • The main emphasis for OLTP systems is put on very fast query
processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second.
ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 39
Difference between Databases and Data Warehouse
• Data Warehouse• A data warehouse is a subject-oriented, integrated, time-variant, and
non-volatile collection of data in support of management’s decision making process”
• A data warehouse is designed for OLAP
• OLAP • OLAP (On-line Analytical Processing) is characterized by relatively low
volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques.
• In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas.
ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 39
Data Warehousing Components
Operational DB
Operational DB
Operational DB
Extract
Transform
Load
(ETL)
Data Warehouse
OLAP
Data Mining
ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 39
Data Flow
Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley
ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 39
OLTP vs. OLAP
• IT systems can be divided into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyse it.
ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 39
OLTP vs. OLAP
OLTP systems OLAP systems
Holds current data Holds historical data
Stores detailed data Stored detailed and summarized data
Data is dynamic Data is static
Repetitive processing Ad hoc processing
Predictable pattern of usage Unpredictable pattern of usage
Transaction-driven Analysis-driven
Application-oriented Subject-oriented
Supports day-to-day operations Support strategic decisions
Serves large number of clerks/operational users Serve low number of managers
Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley
ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 39
Data Acquisition & Integration
• A process to populate a data warehouse.
• Three main functions:• Extract: retrieves data in a source system to produce a new
Source Data.
• Transform: Inspects, cleanses, and conforms the new source data into a data warehouse (or called Load Data).
• Load: updates a data warehouse using the data provided in the Load Data.
• These three functions are more commonly known as ETL.
ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 39
DW Data Components
• Fact Table• Tell “one version of the truth” of the subject
• Numerical Measurement; sale amount, total customer, etc.
• Key(s) to dimension table
• Dimension Table• Identify the key cell of fact table
• Drill down, Roll up
• Describe the subject; • product name,
• customer name,
• store location
ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 39
ETL Process
• ETL process begins by defining a data source and which data you are interested in the data source in copying a new destination.
• You may need to perform one or more transformations on the data for retrieval purpose.• E.g., you may need to transfer “True” or “False” (string type) into
“1” or “0” (Boolean type).
• You also need to use a load sequence to inject the transformed data into the appropriate destination (or called target system, always a data warehouse or a section of a data warehouse architecture, e.g., Data Mart, in this unit).
ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 39
Source System Analysis
• It provides significant insight and understanding of the enterprise and its data for a data warehouse to express the enterprise at any level.
• It examines the enterprise data for its informational content – the meaning of the data and how it captures and expresses that meaning.
• A data warehouse designer in this stage should focus on the enterprise and the analysis of its data.
• An early and common mistake in data warehouse design is the use of source system analysis to search for source data within the enterprise to fit the definition of a data warehouse.
• The designer must be allowed to query and survey the enterprise data, not just a summary or description of the enterprise data.
ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 39
Source System Analysis Principles
• They explain what the data warehouse designer is looking for, including• Multiple systems of record; e.g., selling products through a series of retail
outlets (West Division, East Division)
• Entity data; including physical and logical members, agents, facilities and resources:• Physical entities can be touched and uniquely identified
• Logical entities cannot be touched, e.g., concepts, constructs and hierarchies that organize and enhance the meaning of enterprise events and entities.
• Entities can also describe and qualify each other by their associations, e.g., • S Block can identify itself as a unique physical entity as well as identify the location
of lecturer #123.
ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 39
Source System Analysis Principles
• Granularity. A designer must beware of the grain of all source data elements, which is determined by its level of detail, hierarchical depth, or measurement precision.
• Latency refers to the time gap between an enterprise event. It determines the earliest moment data will be available to the data warehouse.
• Transaction data; known as event data, which identify the moment when an enterprise performs its primary functions, e.g., • Sales - the moment when a retail enterprise sells something• Manufacturing – the moment when an assembly plant builds something• Service – the moment when a consulting firm provide a service.
• Snapshot data, expresses the cumulative effect of a series of transactions or events over a range of time, e.g., Web site hits per hour.
ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 39
Source System Analysis Methods
• They explain how the data warehouse designer examines the source system to understand how the enterprise and its data interact.
• System documentation is a good start, which provides information about how an enterprise system is intended and expected to behave.
• The interaction of enterprise data is a good baseline from which to start.
• We should also document how an enterprise system misbehaves, creating unexpected data and results (the anomalous data).
• Source system analysis is the first opportunity to protect the quality of the data in a data warehouse.
ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 39
Data Flow, State Diagram & System Record
• Data Flow Diagram is used to indentify where the data comes from, goes to, and by what transport mechanism it moves.
• The Data State Diagram is used to capture the various business meaning of a data element as it flows through the data flow diagram.
• It also indicates the relevance of a data element to the enterprise.
• It also includes any physical indications of each state.
• The authoritative point of origin for each enterprise entity at any given state is the System of Record, in where the ETL gets data and loads into a data warehouse.
ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 39
Business Rules
• Business rules govern the data in the source system.
• The data profile, data flow diagram, stat state diagram and system record provide the best opportunity to identify the business rules.
• They come in three basic varieties:• Intra-record business rules:
• Column A + Colum B = Column C
• Intra-dataset business rules:• Row 1. Column A + Row 2. Column A = Row 3.Column B
• Cross dataset business rules:• File1. Column A = Table 2. Column B.
ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 39
Target System Analysis
• Target system is a data warehouse, or a component of a data warehouse architecture.
• The design needs to choose the data model, RDBMS, and business intelligence reporting architecture.
• It should also indicate how the data warehouse will reflect the enterprise of the source system (e.g., purchase orders, machines, people, etc.) as those entities cycle through their states (e.g., reviewed, approved, commissioned, hired, etc.).
• Target system analysis should reveal and clarify both expectations of the data warehouse designer and customers.
• It also provides an opportunity to recognize and resolve discrepancies between the designer and customers.
• Its goal is to create a set of expectations so explicit that these expectations can be compared directly to the data in the data warehouse.
ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 39
Data mapping
• It is the process by which an ETL analyst identifies the source data, specific to location, state, and timing, which will be used to satisfy the data requirements of a data warehouse.
• Transformations necessary to create the data elements, as they will be stored in a data warehouse, are also included in a Data Mapping.
• The Data Mapping document is an input into the Data Quality SLA and the Metadata SLA.
• The Data Mapping document must clearly and precisely identify the source data element that will be used, such that there is no ambiguity about the location, state, or timing of the extract of a data element.
• The Data Mapping document must clearly and precisely identify the target data element that will be populated, such that there is no ambiguity about the location and state of the data element as stored in the data warehouse.
• The Data Mapping document must clearly and precisely define the transformations necessary to create the data element as it will be stored in the data warehouse.
ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 39
Types of data mapping
1. Simple data mapping
2. Derived data mapping
Source data element Transformation Target data element
Length in kilometres n/a Length in kilometres
Source data element Transformation Target data element
Length in kilometres × 1000 Length in metres
ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 39
Types of data mapping cont.3. Recursive data mapping
Source data element Transformation Target data element
Length in kilometres n/a Length in kilometres
Price per meter n/a Price per meter
Source data element Transformation Target data element
Price per meter × 1000 Price per kilometre
Source data element Transformation Target data element
Length in kilometres × Total price
Price per kilometre
ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 39
ETL vs. ELT
Source data
Transaction
application
ETL
Extract
Data warehouse
Transaction source
data
Transform
Transaction load
dataTransaction Table
Load
Source data
Transaction
application
ELT
Extract
Data warehouse
Transaction
source dataTransaction
source data
Load
Transform
Transaction load
data
Transaction
TableLoad
Active &
current data
ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 39
ETL vs. ELT cont.
• In an ETL application, data is extracted from an operational system. A transform performs all data modifications to the Source Data. A load application reads the Load Data and performs the necessary inserts, updates, and deletes to a data warehouse.
• An ELT application performs all the functions and purposes of an ETL application.
• The difference between an ETL application and an ELT application is the platform on which the application performs it functions.
• ELT has two advantages.• A data warehouse RDBMS platform is a powerful platform. All the resources (CPU seconds,
throughput, etc.) of a data warehouse RDBMS platform are available to an ELT application.
• A copy of look-up data need not be kept and maintained on the ELT platform because the data warehouse RDBMS has access to all the data in the data warehouse.
• ELT has one disadvantage.• A portion of the data warehouse’s resources (CPU seconds, throughput, etc.) are consumed by
someone other than a data warehouse customer. Given sufficient data volumes and transformation complexity, this could adversely affect data warehouse customers.
ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 39
ETL Design Principles
• ETL applications are subject to unexpected circumstances and, therefore, should expect the unexpected to occur.
• An ETL analyst must work hard to assure an ETL application is bulletproof, knowing each ETL application will behave as intended, even if the source system does not.
• ETL Process Principles (Principles 1 to 6), address specifically the executable part of an ETL application, i.e., the code that moves, copies, and transforms data.• Which is similar to a manufacturing plant. It converts and transforms
raw data (i.e., materials) into a data warehouse (i.e., finished product).
• ETL Staging Principles (Principles 7 to 11), provide design principles for managing and controlling the creation and use of stage data and structures.
ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 39
Principle 01: one thing at a time
• Multitasking conserves time and resources and is contrary to all things of ETL since an ETL application, however, assumes that nothing will go as planned, and that some input values will be unreasonable and invalid.
• It is recommended to perform each action individually and then combine the separate result sets into one set of data.
• One Thing at a Time is basically a granular modular approach. Benefits of using a granular modular approach include:• Create the opportunity for Data Quality and Metadata functions to integrate
within an ETL application.• Create the opportunity to isolate violated assumptions.• Remove any question about the sequence and precedence of ETL functions,
regardless of the language or platform.
ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 39
Principle 02: Know When to Begin
• Operational systems rely on operational job schedulers to know when the conditions have been satisfied for a job to begin.
• ETL applications, however, rely on conditions within precedent data (i.e., Begin Conditions). When precedent Begin Conditions have been satisfied, subsequent applications relying on those conditions can safely begin.• An Extract application will examine an operational source system prior to extracting
data.
• A Transform application will examine data provided by preceding Extract applications.
• A Load application will examine data provided by preceding Transform applications to determine whether or not Begin Conditions have been satisfied.
• Data Quality and Metadata information prove to be extremely helpful in these circumstances.
• Principle 02 is basically a backward-looking design principle.
ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 39
Principle 03: Know When to End
• It is a forward-looking design that requires an ETL application to examine data it has created.
• An ETL application can verify, by examining its own output data, whether or not that ETL application has completed satisfactorily.
• Then, the results of that final review can be captured as Data Quality or Metadata information, and shared with subsequent ETL applications.
ICT@PSU 308-471 Data Warehousing and Data Mining 29 of 39
Principle 04: Large to Medium to Small
• Large to Medium to Small design assembles all applicable data elements and entities.
• Data that is no longer required is dismissed. The final data set is a load-ready file that will be loaded to a data warehouse.
• At this initial stage, all applicable data is juxtaposed simultaneously. The decision to exclude data is made in the broadest context possible, which allows the greatest control of data exclusion.
ICT@PSU 308-471 Data Warehousing and Data Mining 30 of 39
Principle 05: stage data integrity
• It is a design principle by which precedent applications create (store) a set of stage data as it will be consumed by subsequent applications.
• Once created, a set of stage data can only be consumed as a single contiguous set by subsequent applications.
• It avoid unnecessary risk and increases the overall integrity of an ETL application.
• For example, we have source raw materials data from Company A, B and C; and an application that extracts data describing raw materials from company A. We may have the following approaches:• Create a single set of stage data ABC (not a good solution, why?)
• Create a multiple data sets of stage data, A, B, C and ABC.
ICT@PSU 308-471 Data Warehousing and Data Mining 31 of 39
Principle 06: Know what you have
• It prompts an ETL application to take inventory of inbound data, rather than assume inbound data contains all that is expected.
• Information describing contents of inbound data is available through two sources: Metadata and data itself.
• The output of the comparison of inbound data and expected data includes lists of matches and mismatches (or missing data).
• Normally, an threshold is used for missing data to choose a response based on the history of data anomalies.
A
B
C
Compare
A
B
C
What you
have
What you
don’t have
Matches
Mismatches
ICT@PSU 308-471 Data Warehousing and Data Mining 32 of 39
ETL Staging Principles
• Principle 07: Name the data – describes how to identify data and its features, origin and destination with an appropriate level of granularity and control.
• Principle 08: Own the data – describes how to secure data to prevent interference by other applications, including ETL and operational applications
• Principal 09: Build the data – describes how to create a data set from its foundation
• Principal 10: Type the data – describes how to protect ETL functions from incompatible data types
• Principal 11: Land the data – describes the need to retain interim data beyond its immediate use.
ICT@PSU 308-471 Data Warehousing and Data Mining 33 of 39
ETL Functions
• ETL functions are designed to discern what has happened in the enterprise, and bring that information to the data warehouse.
• Extract functions – retrieve data from source system and store as stage data in ETL environment.
• Transform functions – are applied in staged datasets to derive required information (sets of dimension data)
• Load functions – loads data to the data warehouse.
ICT@PSU 308-471 Data Warehousing and Data Mining 34 of 39
Extract functions
1. Extract data from contiguous dataset – a simple extract function
ETL environmentSource system
Source Extract Stage
Source system ETL environment
StageExtract
2. Extract data from a data flow – needs a control mechanism based on the bundles of data
flow records.
ICT@PSU 308-471 Data Warehousing and Data Mining 35 of 39
Data level Transform functions
• Row-level transformation – is applied to every row in a staged dataset, the simplest transformations.
• Dataset-level transformation • is formed within the context of a whole set of data.• It must address the whole dataset at a time to derive the
information necessary to update each individual row.
• Surrogate key generation: intra-dataset• It generates a sequential numeric value that uniquely identifies
each row of dataset.• A surrogate key is used here to uniquely identify each raw in an
ETL application because sometimes transformed data lacks a key.
ICT@PSU 308-471 Data Warehousing and Data Mining 36 of 39
Data warehouse level transformation
• The functions must be performed within the context of the data warehouse.
• They do not have all necessary knowledge to derive the data required, and they have to use both the input data and data from the data warehouse.
• Surrogate key generation: intra-data warehouse
• The identifier should be throughout the data warehousing
• The best way is to retrieve the max identifier in the data warehouse and then assign max + 1 to the new row.
ICT@PSU 308-471 Data Warehousing and Data Mining 37 of 39
Load Data
1. Load data from a stable and contiguous dataset – the simplest and most common method
Load data
Data Load
ETL environment
Data
warehouse
Load data ETL environment
Load Data
warehouse
2. Load data from data flow – needs a control mechanism to be able to know each
row has been loaded only once and so on.
ICT@PSU 308-471 Data Warehousing and Data Mining 38 of 39
ETL Beginning to end
Customer
Expectations
Data quality
SLA
ETL direct
requirements
Metadata
SLA
Target system
analysis
Source System
Analysis
ETL indirect
requirements
Target System
Analysis
ETL principles
Customer
Expectations
Data
warehouse
Source Data
Data
Mapping/Logical
Design
Physical
Design
ETL
Application
Data
warehouse
ICT@PSU 308-471 Data Warehousing and Data Mining 39 of 39
Closing remarks
• A data warehouse designer captures customers expectations in the design of a data warehouse.
• A target system analysis captures the behaviour of data in a data warehouse design. These behaviours are expressed as direct requirements.
• Data mapping is a road map showing how an ETL application will achieve data behaviours.
• The data quality SLA and Metadata SLA capture the information necessary for customers to use the data in the data warehouse (indirect requirements):• Is the data complete?• Are there any anomalies?• When is the data available? • What is the profile of today’s data?
• The direct and indirect requirements meet together in a single physical design, which declares the physical hardware, platform, datasets and jobs that are the ETL application.
• The ETL application delivers data to a data warehouse that meets customer expectations.