Customised Etl
-
Upload
preethy-senthil -
Category
Documents
-
view
220 -
download
0
Transcript of Customised Etl
-
8/13/2019 Customised Etl
1/7
Wipro Confidential 2/9/2005
Customised ETL usingPerl and PL/SQL
Daison Jose([email protected])
mailto:[email protected]:[email protected]:[email protected] -
8/13/2019 Customised Etl
2/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 2
Table of Contents
Introduction....................................................................................................3ETL tool Vs building customized ETL routine .....................................................3Code Based Tools ...........................................................................................4
Perl programming........................................................................................4Pulling the data from the Source ...............................................................4Delta load ................................................................................................4Filtering, Formatting and Data Cleansing ...................................................5
Advantages..............................................................................................5SQL Loader .................................................................................................5
PL/SQL........................................................................................................6Formatting and Filtering............................................................................6Advantages..............................................................................................6Implementation........................................................................................6Calling Filter Rules in PL/SQL ....................................................................7
-
8/13/2019 Customised Etl
3/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 3
Introduction
When starting a new data warehouse project, you must make the decision to eitherbuy or build your ETL (extract, transform and load). There are a few constraints that may
prevent you from purchasing a product. After buying the hardware, database software and BItools, you may not have enough money left in your capital budget for an ETL purchase.
If you are starting with a proof of concept, then it is unlikely that you will have the time or thebudget to go through an ETL product selection and purchase cycle in the time allotted.
ETL tool Vs building customized ETL routine
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool.This determination largely depends on three things:
Complexity of the data transformation: The more complex the datatransformation is, the more suitable it is to purchase an ETL tool.
Data cleansing needs: Does the data need to go through a thorough cleansingexercise before it is suitable to be stored in the data warehouse? If so, it is best topurchase a tool with strong data cleansing functionalities. Otherwise, it may besufficient to simply build the ETL routine from scratch.
Data volume. Available commercial tools typically have features that can speed updata movement. Therefore, buying a commercial product is a better approach if thevolume of data transferred is large.
You may also see these tools classified as data cleansing tools, though here we should make acareful distinction. While data cleansing definitively can encompass the ETL functions, mosttrue data cleansing tools are not architected to perform true ETL and, vice versa, most ETLtools provide only limited true data cleansing functions.
A quick example can show the difference between the two types of tools. Suppose you havean input data file containing the full title of a movie. Particularly in commercial services, thisdata may contain any manner of formats; but for this example, let's use "Men In Black(Ws),Men in Black (Ws, Dub, Dol)". The requirement is to search for all literals like Ws, Doletc and replace it with Widescreen, Dolby and finally place it in the order based on literalranking. An example of a replaced title looks like Men In Black (Dolby, Widescreen, Dubbed).This type of complex string parsing is not generally a strong function of an ETL tool. On the
other hand, the ETL tool will generally be better at efficiently looking up the literals "Ws" in arelational database movie table and returning the integer numeric key value based on theliteral.
Development environments are typically split two ways: GUI- based or code-based tools.Code-based tools are the most familiar and may not be considered "tools" independent of thelanguage they represent. For example, Perl can be used as a code-based ETL tool, but it isalso a more generalized programming language. The embedded transactional code languageswithin common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with MicrosoftSQL Server, etc.) may also provide ETL functionality, but are not limited to this capability.Aside from general programming languages, several tools on the market utilize a custom-scripting language developed explicitly for the optimization of ETL routines.
-
8/13/2019 Customised Etl
4/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 4
Code Based Tools
This document covers how best we can design and build code based ETL tool using Perl andPL/SQL with Oracle.
Extract
The process of reading data from a database or flat files.
Transform
The process of converting the extracted data from its original state into the form it needs to bein so it can be placed into another database. Transformation occurs by using rules or lookuptables or by combining data with other data.
Load
The process of writing the data into the target database.
Perl programming
Pulling the data from the Source
Perl can be used to extract the data from the FTP server to the landing server.
Delta load
Most of the cases, the data which we pulls form the source system might be a full load. No one
wants to load the full load data to the database system. So before loading the data to thedatabase, you can do a file level comparison between todays load with yesterdays data fileand load only the delta. This way the amount of filtering if any at the flat file will get reducedand we can save the amount of data transferred over the network.
-
8/13/2019 Customised Etl
5/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 5
Filtering, Formatting and Data Cleansing
Perl is one of the great utility which can be used for formatting, filtering and data cleansing. Ifyou are talking about millions of rows in a data file where you want to do some kind offormatting and filtering, the best way is to do at the file system level, rather than loading theentire amount of data into the database. With the RegEx power, we can do most of thefiltering, formatting and cleaning. We have designed an architecture framework for this using
Perl and it is called 4F. Flat File Filtering and Formatting (available in knet)
Any programmer/designer can easily plug in this framework to there own application andutilize the power of Perl and RegEx.
Advantages
A framework that can be used to perform data transformations, Formatting and filtering of the data within files Simple to use, highly reusable and provides the capability to add, modify and delete rules
without any code changes.
File Data transformations - string manipulation based on pattern matching File Data Filtering - rejection of records based on predefined criteria RegEx based Rules specification - perform pattern matching and string manipulation. Configurable Rules List - Simple procedure to add and remove data transformation and
filtering rules.
LoggingSome of the filtering, formatting and data cleaning which has done using this framework
1. Deletion of invalid column based on some criteria2. Trimming of spaces3. InitCap of columns4. Applying proper case to columns5. Replace one specific word with another word based on condition6. Splitting of one column to two columns based on some condition. For e.g. / delimiter7. Deleting the entire row form the flat file base on condition
SQL Loader
To load the data from flat file to the staging database, you can use SQL loader as the tool. Youcan also implement very less complicated filtering rules in the SQL loader control file. This willhelp you preventing unnecessary loading of bulk data into the database system.
OPTIONS (SKIP=0)LOAD DATA
INFILE 'filename' "str '|'"APPENDINTO TABLE dvd_tmp
TRAILING NULLCOLS(
EMP_ID sequence(MAX,1),
NAME TERMINATED BY "|"RELDATE TERMINATED BY "|" "COMMON.CHECK_DATE_COMM(:RELDATE)",
-
8/13/2019 Customised Etl
6/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 6
)
In the above control file definition, COMMON.CHECK_DATE_COMM is a pl/sql packagewhich can be used to validate the date column.
PL/SQL
PL/SQL can be used to do some filtering, formatting which we cannot do at the file level or
using Perl. Some of the data filtering has to be applied based on some value which exists inthe database.
Formatting and Filtering
One approach of designing this filtering is to create a set of rules (includes filtering, formattingand cleaning) in the form of Oracle SQLs and store it in a table. Call this filter rules and applyit where it requires.
Advantages
Easy to manage the rules Flexibility to add/remove/modify the Filter criteria without making any code change Logging Ability to know which filter rule has appliedImplementation
Create a table called, filt_crit in which we can store all the filtering/formatting rules in theform of SQLs. Here is the table definition for the filter table.
CREATETABLEFILT_CRIT
(
FILT_ID NUMBER(4) NOTNULL,FILT_NM VARCHAR2(150BYTE) NOTNULL,
FILT_SQL_STMT CLOB NOTNULL,
REC_CRT_TS DATE NOTNULL,
REC_UPD_TS DATE NOTNULL,
REC_CRT_USR_ID VARCHAR2(20BYTE) NOTNULL,
REC_UPD_USR_ID VARCHAR2(20BYTE) NOTNULL,
PRCS_NM VARCHAR2(150BYTE) NOTNULL,
FILT_SEQ NUMBER(3) NOTNULL
)
Column Name Details
FILT_ID A unique identifier for the rules
FILT_NM Name of the filterFILT_SQL_STMT Filtering/formatting rules in the form of SQLs
FILT_SEQ The sequence in which the rule has to be applied
PRCS_NM The name of the process/application in which we are applyingthe rules
-
8/13/2019 Customised Etl
7/7
Customised ETL using Perl and PL/SQL
Wipro Confidential Page 7
Example
FILT_ID FILT_NM FILT_SQL_STMT FILT_SEQ PRCS_NM
1 Filter all itemsfromEMPLOYEEtable whereemployee id is101
UPDATE EMPLOYEESET del_flg = 'E',filt_id = 4,rec_upd_ts =SYSDATE,rec_upd_usr_id =USERWHEREemp_id = '101'
1 PRE_TRANSFER
Calling Filter Rules in PL/SQL
1. Open a cursor on all the rows(rules) in the filter table based on the filt_seq2. Loop through the cursor3. Apply each filter rulesPseudo code
BEGIN
FOR cursor_record IN (SELECT filt_id, filt_nm, filt_sql_stmt
FROM filt_crit
WHERE prcs_nm = vr_filterType
ORDER BY filt_id)
LOOP
BEGIN
EXECUTE IMMEDIATE TO_CHAR(cursor_record.filt_sql_stmt);
vr_filt_msg := 'Number of Rows affected : ' || SQL%ROWCOUNT || CHR(10) || CHR(10);
COMMIT;
END;
END LOOP;
EXCEPTION
WHEN OTHERS THEN
END;