En.safeWatch Profiling 2.0- ETL Technical Guide

28
en.SafeWatch Profiling 2.0 ETL Technical Guide

Transcript of En.safeWatch Profiling 2.0- ETL Technical Guide

Page 1: En.safeWatch Profiling 2.0- ETL Technical Guide

en.SafeWatch Profiling 2.0

ETL Technical Guide

April 2013

Page 2: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 2 of 24

CopyrightsCopyright© 2013 EastNets® Holding Ltd. All rights reserved. All contents, including images and graphics; trade names and trademarks in this document are copyrighted, registered or under registration process. You must obtain permission to reproduce any information, graphics, or images from this document. You do not need to obtain to cite, reference, or briefly quote this material as long as proper citation of the source of the information is made.

TrademarksEastNets® is registered Trade Mark of EastNets® Holding Ltd. located at Dubai Internet City, Building No.2 Office G02. Tel: +97143912888 Fax: +97143918652 P.O. Box 500135 Dubai-UAE.

All brand and product names are trademarks under registration or registered trademarks of its respective companies. Technical specifications and availability are subject to change without notice.

DisclaimerAlthough EastNets® has made every effort to make this document accurate, up-to-date, and complete, EastNets® offers no warrants, express or implied, related to this document. In no event shall EastNets® be liable for any loss of profits, loss of business, loss of use or data, interruption of business, or for indirect, special, incidental, or consequential damages of any kind arising from any error in this document.

Send us commentsEastNets® welcomes your comments and suggestions on the quality and usefulness of this document. Your input is an important part of the revision process.

If you find any errors or have any other suggestions to improve the document quality and clarity, please indicate the chapter and page number (if available).

Please send comments to: [email protected]

Page 3: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 3 of 24

Table of Contents

1 OVERVIEW...............................................................................................4

2 HOW ETL WORKS.....................................................................................5

2.1 REPOSITORY PANEL...................................................................................................62.2 ETL JOBS................................................................................................................72.3 REFERENTIAL INTEGRITY VALIDATION..........................................................................12

3 CREATING JOBS......................................................................................15

Table of Figures

Figure 1: ETL Tool..............................................................................................................4Figure 2: Talend Open Studio for Data Integration.............................................................5Figure 3: ‘Talend Open Studio for Data Integration’ Layout...............................................6Figure 4: Repository Panel.................................................................................................6Figure 5: ETL Jobs Components and Steps.........................................................................7Figure 6: Add tMssql_Input/tOracle_Input component........................................................8Figure 7: Component Properties.........................................................................................8Figure 8: Loading Parameters, Input Component, and Processing Input Parts...................9Figure 9: Connecting tContextLoad with tMSSQl_Input Component...................................9Figure 10: Connecting tMSSQl_Input with tMap Component............................................10Figure 11: Mapping..........................................................................................................10Figure 12: ETL Job Properties...........................................................................................11Figure 13: Running Job.....................................................................................................11Figure 14: Referential Integrity Validation Example.........................................................13Figure 15: Mapping..........................................................................................................14Figure 16: Creating Connection........................................................................................16Figure 17: Database Settings...........................................................................................16Figure 18: Adding New Job...............................................................................................17Figure 19: Searching for Components..............................................................................18Figure 20: Dragging Components....................................................................................19Figure 21: Connecting Components.................................................................................19Figure 22: tRunJob Settings..............................................................................................20Figure 23: tJava Settings..................................................................................................20Figure 24: tFileInputDelimited Settings............................................................................21Figure 25: tOracleInput/tMssqlInput Settings...................................................................22Figure 26: Mapping..........................................................................................................22Figure 27: tSchemaComplianceCheck Settings................................................................23Figure 28: Edit Schema....................................................................................................23Figure 29: tFileOutputDelimited Settings.........................................................................24

Page 4: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 4 of 24

1 Overview

ETL is a back office system tool that extracts data from the customer data files and then transforms the data in the right format to be loaded into the en.SafeWatch Profiling Application Framework, in either a real-time or batch setup.

Figure 1: ETL Tool

The need for the ETL Tool aroused when EastNets learned during the implementation of en.SafeWatch Profiling, that using the manual process caused many customers to have data challenges that prolong the project duration and make it more difficult and expensive for EastNets to run the project. The Manual Process is as follows:

1. EastNets communicates the Data layout document with the customer.2. The customer creates procedures (piece of code) to extract the required data

from the core business solution to data files. 3. Extracted files are reviewed and validated by the I&S engineer, by importing the

data and going through all import, validation, staging and LTA processes.4. The extracted data files include errors in the structure, data types, referential

constraints and field’s constraints. So, multiple iterations of validation process and script amendments are made to end up with the proper data files. In addition, the data import processes doesn’t show the completion percentage nor the clear problem log.

So EastNets introduces the ETL tool to solve the above issues. It will enable EastNets reduce data preparation phase from 4 month to 1 month or less, reduce the time to market, enhance the ROI, and increase the customer satisfaction level. It can be used for Profiling and Anti-Fraud solutions, allowing EastNets development efforts to be used more efficiently through:

1. Data Structure: - Number of fields. - End of lines (CR/LF). - Encoding (UTF-8 and UTF-16).

2. Validate Data Quality regarding all constraints:- Field type.- Field Length. - Date Format. - Mandatory/Optional field.- Allowed values (e.g. only Y or N allowed).

Page 5: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 5 of 24

3. Referential Integrity check

2 How ETL Works

In this section, we will go through the steps needed to run ETL Jobs used to Load Data for en.SafeWatch Profiling 2.0. In order to build ETL tool and its jobs, we shall use the "Talend Open Studio for Data Integration v5.0.3” which is an extensive Java based user interface (so platform independent) that has a graphical ETL job edit as well as a Data Mapper that allows operators or data processing people to connect to the different sources and to perform transformations where necessary.

Note:- This guide is subject to change and update. The following is provided

with the this Guide: Talend Open Studio for Data Integration v5.0.3

The workspace that contains the ETL Jobs.

A Properties file that contains the required parameters for ETL Jobs, like (DB connection parameters, email parameters…etc.).

The following are the steps needed to run ETL Jobs:

1. Install the “Talend Open Studio for Data Integration”, and then open it.

2. Select the attached workspace for ETL in the Workspace field.

3. Select EASTNETSETL from the Project box.

4. Click Open.

Figure 2: Talend Open Studio for Data Integration

5. From the next screen click Start. The following screen will be displayed:

Page 6: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 6 of 24

Figure 3: ‘Talend Open Studio for Data Integration’ Layout

2.1 Repository PanelThe Repository panel is the left side panel of the ‘Talend Open Studio for Data Integration’ layout. Here you can add new jobs, open jobs, add new parameters, add connection to the data base...etc. This panel includes the following:

- Job Designs: We divided the Jobs into four folders based on their functionalities: Account Jobs

Balance Jobs

Customer Jobs

Reference Tables Jobs

Transactions Jobs

- Job Context: Includes the Parameters we defined for each job.- Routines: Where we create Java class called FilterCharacters to remove

characters that are not allowed depending on en.SafeWatch Profiling 2.0 layout.- DB Connections: Where we create db connection to core bank system.

Figure 4: Repository Panel

Page 7: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 7 of 24

2.2 ETL JobsEvery ETL Job in our ETL Tool has the following steps, parts, and components that are to be connected in a specific manner to perform the job:

Figure 5: ETL Jobs Components and Steps

1. Represents loading needed parameters for the job from the Properties file.

2. Represents reading data from the core banking system using tMSSQl_Input component (in case of oracle we use tOracle_Input component) and passing the result data to tMap component that checks the incoming data type of each column and maps it from the core bank table to a field in our output file with the correct order and based on the en.SafeWatch Profiling 2.0 Data Layout.

3. Represents validating each column data length, using tSchemaComplianceCheck, to be compliant with en.SafeWatch Profiling 2.0 Data Layout, and moving incorrect or bad records to a bad file.

4. Represents writing the correct records, using tFileOutputDelimited component, to “|” file delimited.

5. Represents building a listener to the job, where email is sent in case there is an error in connecting to the core bank database that kills the ETL job.

To sum up the above, the ETL Jobs works as follows: Input from the core banking system database.

ETL process the input data and prepare it to be compliant with en.SafeWatch Profiling 2.0 Data layout.

Output processed data to “|” file delimited, that will enter the en.SafeWatch Profiling 2.0 as Input.

Output error or bad records data move to “|” file delimited.

Page 8: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 8 of 24

Example:In this example, we used the profiling database as input, due to the fact that we don’t have a core banking system database to represent the input for our ETL Jobs. The following are the steps we have done to run the std_account job which is the only completed job right now to be used for testing and explanation. The same can be done for other jobs:

1. Add tMssql_Input or tOracle_Input component:

a. Drag the input table (which is here the tAccount from the Profiling db) as shown below:

Figure 6: Add tMssql_Input/tOracle_Input component

b. Double click on the tMSSQL_Input component (tAccount table) to open the component properties section, and then enter the required query in the query attribute as shown below:

Figure 7: Component Properties2. Connect the 3 Job parts shown in the below figure together; namely the Loading

Parameters part, Input Component, and the Processing Input part; by connecting

Page 9: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 9 of 24

the tMssql_Input or tOracle_Input component with the tFileInputDelimited component(s)from one side and with the tMap component from the other side:

Figure 8: Loading Parameters, Input Component, and Processing Input Parts

a. Right click on the tContextLoad component. A menu will be displayed. b. Select Trigger > On component > Ok.c. Drag the result arrow to the tMSSQl_Input component to connect it with

the tContextLoad component as shown below:

Figure 9: Connecting tContextLoad with tMSSQl_Input Component

d. Right click on the tMSSQl_Input component. A menu will be displayed. e. Select Row > Main.f. Drag the result arrow to the tMap component to connect it with the

tMSSQl_Input component as shown below:

Page 10: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 10 of 24

Figure 10: Connecting tMSSQl_Input with tMap Component

3. Map the input data read from the database to the output file that is compliant with the en.SafeWatch Profiling 2.0 Data Layout.

a. Double click on the tMap component. The following screen will be displayed:

Figure 11: Mapping

b. Start mapping each input column read from the database, to an output field in the Output file.

c. Make sure that each field in the output file has the same type used in the input file and that the output file name is the same as that in the en.SafeWatch Profiling 2.0 Data Layout.

d. Click OK.

4. Take a look at the properties file “properties.txt” which contains all the parameters for the ETL jobs:

Page 11: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 11 of 24

Figure 12: ETL Job Properties

5. Run the job as shown below:

Figure 13: Running Job

The generated files resulting from running the job can be found at the paths specified on the “properties.txt” file. Each job should generate 2 files; one for the correct records and the other for error or bad records.

Page 12: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 12 of 24

2.3 Referential Integrity Validation In phase 2 of the ETL, we added the Referential Integrity Validation for the input files of en.SafeWatch Profiling 2.0. The references table’s files that represent lockups tables are used to validate referential integrity for jobs that need this kind of validation.

The following is a list of these jobs: - std_account

- std_account_custom_field

- std_account_declaration

- std_corrbanking

- std_country_list_detail

- std_currency_list_detail

- std_customer

- std_customer_account

- std_customer_address

- std_customer_custom_field

- std_customer_customer

- std_customer_decalartion

- std_customer_financial

- std_customer_legal

- std_customer_phyiscal

- std_eodbalance

- std_transaction

- std_transaction_information

- std_transaction_list_detail

Page 13: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 13 of 24

Example:In this example, we used the profiling database as input, due to the fact that we don’t have a core banking system database to represent the input for our ETL Jobs. The following are the steps for how we added the Referential Integrity Validation to the std_account job, which is the only completed job right now to be used for testing and explanation:

Figure 14: Referential Integrity Validation Example

In the above figure the following stars represent the following: Yellow Stars: The sub jobs that should be run before the main job in order to

generate the needed lookup files for the Referential Integrity Validation.

Green Stars: The lookup files generated by the sub jobs.

Blue Star: The error file which holds the error rows that failed to pass the Referential Integrity Validation.

1. Add the tMssql_Input or tOracle_Input component (as described in the previous example).

2. Connect the tMssql_Input or tOracle_Input component to the tMap Component (as described in the previous example).

3. Connect the tFileInputDelimited comoponent(s) (green stars) to the tMap Component (as described in the previous example).

4. Double click on the tMap component. The following screen will be displayed, where the numbers in the red squares represent the following:

(1) The MSSQL input component or Oracle input component that reads data from the core banking system database table.

(2) The lookups files generated from the sub jobs. These are used to check for the Referential Integrity.

(3) The output that will be written to a file. This file will be later the input data file for the en.SafeWatch profiling 2.0.

(4) The output that will be written to a file. This file will represent the error records that failed to pass the Referential Integrity Validation.

(5) The tSettingsMap that is used to manage joining between files.

Page 14: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 14 of 24

Figure 15: Mapping

a. Map each field in (1) to a field in (3).b. Map each field in (1) to a field in (4), then click on tSettingsMap (5) and

select True for “Catch lookup inner join reject”.c. Map each key field in (1) to its counter field in (2), then click on

tSettingsMap (5) and click on Join Model and select “Inner Join”d. Click OK.

5. Run the Job (as described in the previous example).

Note:- You can use the tLogRow component to log generated results in

console or file.

Page 15: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 15 of 24

3 Creating Jobs

This section describes how to create a job from scratch, and we will have an example on creating the std_account_job, since it is the only completed job right now to be used for testing and explanation:

1. Define the connection to the database. This should be done once and will be used with other jobs:a. From the Repository panel, right click on Db Connection, and a context menu

will be displayed:

Page 16: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 16 of 24

b. From the context menu, click on Create Connection, and the following window will be displayed:

Figure 16: Creating Connection

c. Enter the required data (the connection Name field is mandatory) then click on Next, and the following window will be displayed:

Figure 17: Database Settings

Page 17: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 17 of 24

d. Enter the needed Database Connection information in the specified fields.e. Click on the Check button to check the database settings.f. Click on Finish.

2. Create the empty job:a. From the Repository panel, right click on Job Designs, and a context menu

will be displayed:

b. From the context menu, click on Create Job, and the following window will be displayed:

Figure 18: Adding New Job

c. Enter the required data (the Job Name field is mandatory) then click on Finish.

Page 18: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 18 of 24

3. Search for the desired components needed to build the job by using the Find Component field of the Palette panel:

Figure 19: Searching for Components

The needed components to build jobs that meet the profiling jobs workflow are as follows:

- tRunJob (Optional): It is used for running sub-jobs before running the main one.

- tJava: It is used to load EASTNETS_CONFIG_HOME environment variable path.- tContextLoad: It is used to load needed configurations parameters. - tFileInputDelimited: It is used to read the properties file that contains the

needed configurations, in addition to reading the results of the sub-jobs.- tMssqlInput/ tOracleInput: It is used to write the SQL statement to retrieve

data from the Core Banking System. - tMap: Required. It is used to map the Core Banking System data file output

files needed for profiling plus checking referenota.- tFileoutputDelimited: It is used to write and generate the files resulting

from the job, and to write the error and bad files.- tSchemaComplianceCheck: It is used to check whether the core banking

system schema is compliant with the en.SafeWatch Profiling Data layout data fields length.

4. Start building the job:a. Drag the needed components to the job panel to look like the following figure:

Page 19: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 19 of 24

Figure 20: Dragging Components

b. Connect the components together as described in the following figure:

Figure 21: Connecting Components

Page 20: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 20 of 24

5. Set the needed Configurations for each component as follows (the same to be done for any job):- tRunJob:

a. Select tRunJob. The settings section for this component will be displayed at the bottom panel.

b. In the Job field, specify the needed sub-job from the workspace.c. Check the Die on Child Error checkbox.

Figure 22: tRunJob Settings

- tJava:a. Select the tJava component. The settings section for this component will

be displayed at the bottom panel.b. In the Code text box, enter the lines as appearing in the below figure:

Figure 23: tJava Settings

- tFileInputDelimited:

Page 21: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 21 of 24

a. Select the tFileInputDelimited component. The settings section for this component will be displayed at the bottom panel.

b. Specify the file path in the File name/Stream field.c. Specify the Row Separator.d. Specify the Field Separator. In this example we are using “,” because we

are dealing with configurations properties files, but in other input files we shall use “l” instead.

e. Edit selected Schema.f. Repeat the above steps for every tFileInputDelimited component used

in the job.

Figure 24: tFileInputDelimited Settings

- tOracleInput/tMssqlInput:a. Select the tOracleInput/tMssqlInput component. The settings section

for this component will be displayed at the bottom panel.b. To use a predefined connection on the workspace level, check the Use an

existing connection checkbox. In this case, you can edit in the connection properties if desired.

c. In case of a new connection, enter the needed connection information in the specified fields.

d. Enter your query in the Query textbox.

Page 22: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 22 of 24

Figure 25: tOracleInput/tMssqlInput Settings

- tMap:a. Double click on the tMap component. The following screen will be

displayed, where the numbers in the red squares represent the following:(1) The MSSQL input component or Oracle input component that reads

data from the core banking system database table.

(2) The lookups files generated from the sub jobs. These are used to check for the Referential Integrity.

(3) The output that will be written to a file. This file will be later the input data file for the en.SafeWatch profiling 2.0.

(4) The output that will be written to a file. This file will represent the error records that failed to pass the Referential Integrity Validation.

(5) The tSettingsMap that is used to manage joining between files.

Figure 26: Mapping

b. Map each field in (1) to a field in (3).

Page 23: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 23 of 24

c. Map each field in (1) to a field in (4), then click on tSettingsMap (5) and select True for “Catch lookup inner join reject”.

d. Map each key field in (1) to its counter field in (2), then click on tSettingsMap (5) and click on Join Model and select “Inner Join”

e. Click OK.

- tSchemaComplianceCheck:a. Select the tSchemaComplianceCheck component. The settings section

for this component will be displayed at the bottom panel.

Figure 27: tSchemaComplianceCheck Settings

b. Click on Sync Columns.c. Check the Check all columns from schema checkbox.d. Click on Edit Schema, and the following popup will be displayed:

Figure 28: Edit Schema

e. Adjust the maximum field length.

Page 24: En.safeWatch Profiling 2.0- ETL Technical Guide

Page 24 of 24

f. Click on OK.

- tFileOutputDelimited: a. Select the tFileOutputDelimited component. The settings section for this

component will be displayed at the bottom panel.

Figure 29: tFileOutputDelimited Settings

b. Enter the File Path in the File Name field.c. Specify the Row Separator.d. Specify the Field Separator.g. Click on Sync Columns to sync columns.e. Click on Edit Schema to edit the selected schema.

6. Now you can run and test the job.

Note:- For more information on Talend Open Studio, refer to the following:

TalendOpenStudio_Components_RG_51b_EN (Components Reference)

TalendOpenStudio_DI_UG_51b_EN (User Guide)