Customised Etl

download Customised Etl

of 7

Transcript of Customised Etl

  • 8/13/2019 Customised Etl

    1/7

    Wipro Confidential 2/9/2005

    Customised ETL usingPerl and PL/SQL

    Daison Jose([email protected])

    mailto:[email protected]:[email protected]:[email protected]
  • 8/13/2019 Customised Etl

    2/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 2

    Table of Contents

    Introduction....................................................................................................3ETL tool Vs building customized ETL routine .....................................................3Code Based Tools ...........................................................................................4

    Perl programming........................................................................................4Pulling the data from the Source ...............................................................4Delta load ................................................................................................4Filtering, Formatting and Data Cleansing ...................................................5

    Advantages..............................................................................................5SQL Loader .................................................................................................5

    PL/SQL........................................................................................................6Formatting and Filtering............................................................................6Advantages..............................................................................................6Implementation........................................................................................6Calling Filter Rules in PL/SQL ....................................................................7

  • 8/13/2019 Customised Etl

    3/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 3

    Introduction

    When starting a new data warehouse project, you must make the decision to eitherbuy or build your ETL (extract, transform and load). There are a few constraints that may

    prevent you from purchasing a product. After buying the hardware, database software and BItools, you may not have enough money left in your capital budget for an ETL purchase.

    If you are starting with a proof of concept, then it is unlikely that you will have the time or thebudget to go through an ETL product selection and purchase cycle in the time allotted.

    ETL tool Vs building customized ETL routine

    When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool.This determination largely depends on three things:

    Complexity of the data transformation: The more complex the datatransformation is, the more suitable it is to purchase an ETL tool.

    Data cleansing needs: Does the data need to go through a thorough cleansingexercise before it is suitable to be stored in the data warehouse? If so, it is best topurchase a tool with strong data cleansing functionalities. Otherwise, it may besufficient to simply build the ETL routine from scratch.

    Data volume. Available commercial tools typically have features that can speed updata movement. Therefore, buying a commercial product is a better approach if thevolume of data transferred is large.

    You may also see these tools classified as data cleansing tools, though here we should make acareful distinction. While data cleansing definitively can encompass the ETL functions, mosttrue data cleansing tools are not architected to perform true ETL and, vice versa, most ETLtools provide only limited true data cleansing functions.

    A quick example can show the difference between the two types of tools. Suppose you havean input data file containing the full title of a movie. Particularly in commercial services, thisdata may contain any manner of formats; but for this example, let's use "Men In Black(Ws),Men in Black (Ws, Dub, Dol)". The requirement is to search for all literals like Ws, Doletc and replace it with Widescreen, Dolby and finally place it in the order based on literalranking. An example of a replaced title looks like Men In Black (Dolby, Widescreen, Dubbed).This type of complex string parsing is not generally a strong function of an ETL tool. On the

    other hand, the ETL tool will generally be better at efficiently looking up the literals "Ws" in arelational database movie table and returning the integer numeric key value based on theliteral.

    Development environments are typically split two ways: GUI- based or code-based tools.Code-based tools are the most familiar and may not be considered "tools" independent of thelanguage they represent. For example, Perl can be used as a code-based ETL tool, but it isalso a more generalized programming language. The embedded transactional code languageswithin common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with MicrosoftSQL Server, etc.) may also provide ETL functionality, but are not limited to this capability.Aside from general programming languages, several tools on the market utilize a custom-scripting language developed explicitly for the optimization of ETL routines.

  • 8/13/2019 Customised Etl

    4/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 4

    Code Based Tools

    This document covers how best we can design and build code based ETL tool using Perl andPL/SQL with Oracle.

    Extract

    The process of reading data from a database or flat files.

    Transform

    The process of converting the extracted data from its original state into the form it needs to bein so it can be placed into another database. Transformation occurs by using rules or lookuptables or by combining data with other data.

    Load

    The process of writing the data into the target database.

    Perl programming

    Pulling the data from the Source

    Perl can be used to extract the data from the FTP server to the landing server.

    Delta load

    Most of the cases, the data which we pulls form the source system might be a full load. No one

    wants to load the full load data to the database system. So before loading the data to thedatabase, you can do a file level comparison between todays load with yesterdays data fileand load only the delta. This way the amount of filtering if any at the flat file will get reducedand we can save the amount of data transferred over the network.

  • 8/13/2019 Customised Etl

    5/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 5

    Filtering, Formatting and Data Cleansing

    Perl is one of the great utility which can be used for formatting, filtering and data cleansing. Ifyou are talking about millions of rows in a data file where you want to do some kind offormatting and filtering, the best way is to do at the file system level, rather than loading theentire amount of data into the database. With the RegEx power, we can do most of thefiltering, formatting and cleaning. We have designed an architecture framework for this using

    Perl and it is called 4F. Flat File Filtering and Formatting (available in knet)

    Any programmer/designer can easily plug in this framework to there own application andutilize the power of Perl and RegEx.

    Advantages

    A framework that can be used to perform data transformations, Formatting and filtering of the data within files Simple to use, highly reusable and provides the capability to add, modify and delete rules

    without any code changes.

    File Data transformations - string manipulation based on pattern matching File Data Filtering - rejection of records based on predefined criteria RegEx based Rules specification - perform pattern matching and string manipulation. Configurable Rules List - Simple procedure to add and remove data transformation and

    filtering rules.

    LoggingSome of the filtering, formatting and data cleaning which has done using this framework

    1. Deletion of invalid column based on some criteria2. Trimming of spaces3. InitCap of columns4. Applying proper case to columns5. Replace one specific word with another word based on condition6. Splitting of one column to two columns based on some condition. For e.g. / delimiter7. Deleting the entire row form the flat file base on condition

    SQL Loader

    To load the data from flat file to the staging database, you can use SQL loader as the tool. Youcan also implement very less complicated filtering rules in the SQL loader control file. This willhelp you preventing unnecessary loading of bulk data into the database system.

    OPTIONS (SKIP=0)LOAD DATA

    INFILE 'filename' "str '|'"APPENDINTO TABLE dvd_tmp

    TRAILING NULLCOLS(

    EMP_ID sequence(MAX,1),

    NAME TERMINATED BY "|"RELDATE TERMINATED BY "|" "COMMON.CHECK_DATE_COMM(:RELDATE)",

  • 8/13/2019 Customised Etl

    6/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 6

    )

    In the above control file definition, COMMON.CHECK_DATE_COMM is a pl/sql packagewhich can be used to validate the date column.

    PL/SQL

    PL/SQL can be used to do some filtering, formatting which we cannot do at the file level or

    using Perl. Some of the data filtering has to be applied based on some value which exists inthe database.

    Formatting and Filtering

    One approach of designing this filtering is to create a set of rules (includes filtering, formattingand cleaning) in the form of Oracle SQLs and store it in a table. Call this filter rules and applyit where it requires.

    Advantages

    Easy to manage the rules Flexibility to add/remove/modify the Filter criteria without making any code change Logging Ability to know which filter rule has appliedImplementation

    Create a table called, filt_crit in which we can store all the filtering/formatting rules in theform of SQLs. Here is the table definition for the filter table.

    CREATETABLEFILT_CRIT

    (

    FILT_ID NUMBER(4) NOTNULL,FILT_NM VARCHAR2(150BYTE) NOTNULL,

    FILT_SQL_STMT CLOB NOTNULL,

    REC_CRT_TS DATE NOTNULL,

    REC_UPD_TS DATE NOTNULL,

    REC_CRT_USR_ID VARCHAR2(20BYTE) NOTNULL,

    REC_UPD_USR_ID VARCHAR2(20BYTE) NOTNULL,

    PRCS_NM VARCHAR2(150BYTE) NOTNULL,

    FILT_SEQ NUMBER(3) NOTNULL

    )

    Column Name Details

    FILT_ID A unique identifier for the rules

    FILT_NM Name of the filterFILT_SQL_STMT Filtering/formatting rules in the form of SQLs

    FILT_SEQ The sequence in which the rule has to be applied

    PRCS_NM The name of the process/application in which we are applyingthe rules

  • 8/13/2019 Customised Etl

    7/7

    Customised ETL using Perl and PL/SQL

    Wipro Confidential Page 7

    Example

    FILT_ID FILT_NM FILT_SQL_STMT FILT_SEQ PRCS_NM

    1 Filter all itemsfromEMPLOYEEtable whereemployee id is101

    UPDATE EMPLOYEESET del_flg = 'E',filt_id = 4,rec_upd_ts =SYSDATE,rec_upd_usr_id =USERWHEREemp_id = '101'

    1 PRE_TRANSFER

    Calling Filter Rules in PL/SQL

    1. Open a cursor on all the rows(rules) in the filter table based on the filt_seq2. Loop through the cursor3. Apply each filter rulesPseudo code

    BEGIN

    FOR cursor_record IN (SELECT filt_id, filt_nm, filt_sql_stmt

    FROM filt_crit

    WHERE prcs_nm = vr_filterType

    ORDER BY filt_id)

    LOOP

    BEGIN

    EXECUTE IMMEDIATE TO_CHAR(cursor_record.filt_sql_stmt);

    vr_filt_msg := 'Number of Rows affected : ' || SQL%ROWCOUNT || CHR(10) || CHR(10);

    COMMIT;

    END;

    END LOOP;

    EXCEPTION

    WHEN OTHERS THEN

    END;