Data Transfer Using eTL

14
A Technical Paper on Name Account / Business Group Author(s) PRASENJIT GHOSH [email protected] Emerson / Manufacturing Reviewed by Sudarsana Raju Sangaraju [email protected] Emerson / Manufacturing ‘EAI – eTL implementation using JCAPS’

description

Data transfer

Transcript of Data Transfer Using eTL

Page 1: Data Transfer Using eTL

A Technical Paper on

Name Account / Business Group

Author(s)

PRASENJIT [email protected] Emerson / Manufacturing

Reviewed bySudarsana Raju Sangaraju

[email protected] Emerson / Manufacturing

‘EAI – eTL implementation using JCAPS’

Page 2: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 2 of 14

Abstract

Extract, Transform, and Load (ETL) is a process in data warehousing that involves following activities

Extraction of data from outside sources Transforming it to fit into business needs Loading the transformed data into the Enterprise data warehouse.

ETL refers to a process that loads any database. ETL can also be used for the integration with legacy systems.

A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.

An ETL process can be created using almost any programming language butcreating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.

Although ETL is a concept that has been adapted by almost all middleware tools that are currently running in the market, this technical paper describes how huge data load can be transferred between source and destination database systems using ETL adapter in Seebeyond Java CAPS.

Intended Audience

This document is intended to assist Developers with working knowledge of JCAPS. Tech Leads with Integration background.

Page 3: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 3 of 14

Table of Contents1. Introduction................................................................................................................. 42. Solution Overview ...................................................................................................... 4

2.1. About the ETL ................................................................................................ 42.2. Where it can be used (Business Scenarios)..................................................... 42.3. How ETL Works............................................................................................. 52.4. ETL Tool Scope and Constraints.................................................................... 52.5. ETL Supported Datatypes............................................................................... 6

3. Sample Integration Solution – Transferring bulk data from Source to Target ........... 73.1. Use Case Scenario 1: connection between two different DB ......................... 73.2. Use Case Scenario2: connection Between DB and BatchLocal or Batch FTP83.2.1. Business Process - bpJdeToFTP ................................................................. 83.2.2. Connectivity Map - cmJdeToFTP............................................................... 93.2.3. Technical specifications.............................................................................. 93.2.4. Data Transfer Logic / Code: ....................................................................... 9

Observations ................................................................................................................. 115 Known issues & work-around / Solution:................................................................. 12

5.1 Data Truncation Problem.............................................................................. 125.2 Automap Inactive for Flat File OTD ............................................................ 125.3 java.lang.OutOfMemoryError: Java heap space Out of Memory................. 125.4 Runtime Output............................................................................................. 13

4. References................................................................................................................. 135. Acronyms and Glossary............................................................................................ 14

Page 4: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 4 of 14

1. IntroductionIn this paper an attempt has been made to explain about the ETL process, the business scenarios requiring ETL. The paper also contains a case study of ETL implementation using Java CAPS ETL tool, to perform extract and transmit bulk/large volumes of data from legacy database systems. The paper presentsthe advantages accrued by using Java CAPS ETL in comparison to conventional modes of data extraction with results. Finally the lessons learned, do’s and don’t and work arounds implemented to overcome the existing ETL tool constraints were also presented in detail.

2. Solution Overview

2.1. About the ETL

Extraction, Transform, and Load (ETL) is a data integration methodology that extracts data from data sources, transforms and cleanses the data, then loads the data in a uniform format into one or more target data sources.

ETL Integrator provides high-volume extraction and loading of tabular data sets for Java CAPS projects. You can use ETL Integrator to acquire a temporary subset of data for reports or other purposes, or acquire a more permanent data set for the population of a data mart or data warehouse. You can also use ETLfor data type conversions, to migrate data from one database/platform to another.

2.2. Where it can be used (Business Scenarios)

Java CAPS ETL tool can be used when a bulk data to be transferred from source database to destination database within a comparatively small amount of time.

High volume file transfer is a universal requirement for every organization, historically satisfied through utilities like FTP and home grown solutions which provided basic capabilities for sending and receiving files.

Businesses today use file transfer in more sophisticated ways and to satisfy many more business requirements, including control, security, integration, and regulatory compliance. Moreover, organizations are deploying file transfer strategically as the fundamental underpinning for the automation of key business processes.

Transfer of weekly or monthly Financial data Transfer of weekly or monthly Business information from source system

to destination system Transfer of Annual report Banking, Healthcare businesses almost in every industry.

Page 5: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 5 of 14

2.3. How ETL Works

ETL can connect two or more databases. This requires creation of otds from source and target database table structures using database otd wizard, then mapping them using ETL collaboration rule editor.

Java CAPS ETL tool uses connection pooling mechanism to optimize database connections. In our current case study we have created otds using prepared statements. The advantages of using connection pooling and prepared statement otds listed below.

Connection poolingIt is a technique used to avoid the overhead of making a new database connection every time an application or server object requires access to a database. To fetch data from the database, the application needs to establish a new data base connection for each request. The database access itself is not the bottleneck, but setting up a new connection for each request often is. A database connection pool avoids this bottleneck.

Prepared Statement otd object

It is an object that represents a precompiled SQL statement. A SQL statement is precompiled and stored in a PreparedStatement object. This object can then be used to efficiently execute this statement multiple times. Sometimes it is more convenient to use a PreparedStatement object for sending SQL statements to the database. This special type of statement is derived from the more general class, Statement. Since ETL implements connection pooling and prepared statements, it is much faster compared to conventional data fetching techniques.

2.4. ETL Tool Scope and Constraints

ScopeThis section enlists in detail what ETL can do

It supports bulk data transfer between multiple databases – Example: Oracle, SQL server, DB2 400, Flat file DB etc. Data transformation is supported through a set of built in

operators/methods.

ConstraintsThis section enlists existing constraints with Current ETL tool

Partially supports data cleansing

Page 6: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 6 of 14

ETL support only 4 data types namely varchar (default), numeric, time, and timestamp (for more information pls refer section 2.3.2). Irrespective of the source or target data type, we have to map through these data types.

2.5. ETL Supported Datatypes

ETL Projects can handle many data types; some data types can be transformed, others are merely passed through without transformation.The list below shows the supported data types for flat file Projects: varchar (default) numeric time timestamp

If a flat file is created using the time or timestamp data type, the data must follow one of these formats:

yyyy-MM-dd HH:mm:ss.SSS yyyy-MM-dd HH:mm:ss yyyy-MM-dd MM-dd-yyyy HH:mm:ss

Page 7: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 7 of 14

3. Sample Integration Solution – Transferring bulk data from Source to Target

3.1. Use Case Scenario 1: connection between two different DB

The interface transferring the data from one database to another database (in this example from Oracle DB to JDE DB), is very easy and fast to built with the ETL.

Let us take an example where ETL is transferring data from Oracle DB (Table: TARGETEMPLOYEE) and JDE DB (Table: F00XACT) with proper transformation of data according to the requirement.

Steps to be followed:

Step1. Build the source and destination system OTD.Step2. Build the ETL collaboration with source and destination OTDs created in step 1.Step3: Map the corresponding fields in the ETL editor. Data can be transformed while mapping with destination system fields.Step4. Build the business process for the same. Step5. Build the collaboration map (CMAP) and deployment profile.Step6: build and deploy.

Page 8: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 8 of 14

The scheduler triggers the trigger message which is received by the business process bpOraToJde and that in turn invokes the ETL eTL service.

The ETL then connects to Outbound Oracle external location and fetches all (or some of the records based on the condition mentioned in the ETL) the records from Oracle.

The ETL then connects to the inbound JDE external system and load all the records onto JDE.

3.2. Use Case Scenario2: connection Between DB and BatchLocal or Batch FTP

As we noticed above that connecting two external databases through ETL is very easy but if we want to transfer the data from DB to Batch Local location or Batch FTP server, the process will become complex. We have to choose flat file DB as intermediate location in order to transfer it to the FTP location as ETLcan connect only two databases.

Below is the example of how that can be implemented:

3.2.1.Business Process - bpJdeToFTP

Page 9: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 9 of 14

3.2.2.Connectivity Map - cmJdeToFTP

The scheduler triggers the trigger message which is received by the business process bpDelvToFlatDB and that in turn invokes the ETLeTLRecvFromJde.

The ETL eTLRecvFromJde then connects to Outbound DB2 location and fetches all (or some of the records based on the condition mentioned in the ETL) the records from JDE.

The ETL then transformed the data accordingly and transfer the same into flat file DB.

After ETL completes its job, bpDelvToFlatDB sends the message to tpcDelvEvent which invokes services svcDelvFTP that in turn invokes the java collaboration jcdDelvFTP.

The collaboration jcdDelvFTP then connects to the batch local file system and reads the .CSV file that was stored by ETL eTLRecvFromJde.

The collaboration then writes the file into the destination batch FTP location.

3.2.3.Technical specifications

3.2.4. Data Transfer Logic / Code:

Collaboration Name jcdDelvFTP

Page 10: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 10 of 14

Web Service Operation Seebeyond.eGate.JMS.receive

OTD Name OTD instance

Seebeyond.eWays.BatcheWay.BatchLocalFile instBatchLocalFile

SeeBeyond.eWays.BatcheWay.BatchFTP instBatchFTP

Seebeyond.eGate.JMS instJMS

Business Logic

Try {

Subscribe to the incoming trigger message from the JMS topic tpcDelvEvent.

Connect to the Outbound Local File Location.

Connect to the inbound FTP location.

Get target file name from Batch Local Location.

Set target file name for inbound Batch FTP location.

Chop the file into different streams.

Publish append the streams into the inbound batch FTP location.

} catch(Exception) {

Print Exception message.

}

Code

instBatchLocalFile.getConfiguration().getTargetFileName());

// Get target file name from Batch Local Location.

instBatchFTP.getConfiguration().setTargetFileName( "ABC.CSV" );

// Set target file name for inbound Batch FTP location.

instBatchFTP.getConfiguration().setAppend( true );

//set append as true

instBatchFTP.getClient().setInputStreamAdapter(instBatchLocalFile.getClient().getInputStreamAdapter() );

// chop the file into streams

instBatchFTP.getClient().put();

// placed the streams into inbound FTP location

Page 11: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 11 of 14

4 Comparison of ETL Transfers with eGate5.1 and monk transfer methods

The table below contains the data extract/load times and their comparisons.

ETL Transfer

eGate505 java collaboration for extract and Load

eGate 4.5.2 monk Transfer

Observations

1. In ICAN egate505 jcd, eGate 4.5.2 monk approaches, IQ’s were used for data persistence, which increases the no of hops for the data transfers

2. In ICAN eGate505 jcd approach: project Used XML, CME to Marshall and Unmarshall data between receive and delivery service, that will also add to little overhead

3. In ETL there is a configuration option to truncate data before loading. In eGate we are using DELETE prepared statement before loading. For deleting total records from a table, TRUNCATE is a better option from the performance stand point

4. Even though there is no intermediate XML (CME) used in case of eGate 4.5.2 monk approach, the data extraction and load times were much higher compared to eGate505 jcd and ETL approaches.

Time Taken using

S.NoSource System

# Columns

Extracted/Loaded

(Records)ETL

eGate 505

(Jcd)

eGate 4.5.2 (Monk)

1 DB2/AS400 6 543849 20.58 min 2 hrs > 6 hrs

2DB2/AS400 51 111583 22.29 min

4.25 hrs

> 6 hrs

Page 12: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 12 of 14

5 Known issues & work-around / Solution:

5.1 Data Truncation Problem

When data is being fetched from source database in the format YYYY-MM-DD HH:mm:ss.SSSSSS is getting truncated to YYYY-MM-DD HH:mm:ss.SSS if the last 3 digits of millisecond are zeros. This is happening even after the data type has been changed to varchar at target side.

For example, suppose the data in the source database for a particular field is “2007-07-07 23:12:25:012365”; in this case exact data will be transferred at the target database.

But if data is like “2007-07-07 23:12:25:123000”, the data will automatically be truncated and the data that will be sent to target database is “2007-07-07 23:12:25:123”.

Work-around/Solution:

No resolution was found for this issue but business will not be affected for this as it can be ensured that no valid data will be lost.

5.2 Automap Inactive for Flat File OTD

While mapping the source fields with the target database fields, the auto map facility does not work for mapping target and source fields even if all the field names in target and source side are matching. This issue comes while working with flat file OTD.

Work-around/Solution: Sun Seebeyond is aware of the problem and a patch is expected soon.

5.3 java.lang.OutOfMemoryError: Java heap space Out of Memory

ETL is used for transferring a huge data load between source and destination. For that, ETL transfer the records in Batches (of 5000, 10000 etc). In spite of that, developer may get this OutOfMemory Exception. And the domain will move into hung state

Work-around/Solution:

Our administrator raised a ticket with Sun Seebeyond team for the same. After working around they send a patch to our team; after applying the patch to our environment the problem got resolved.

If the patch is not available with user, user can follow the below steps in order to avoid this exception:

Page 13: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 13 of 14

Login to http://logicalhost:portnumber with uid "Administrator" and go to JVM setting > JVM options and request seebeyond administrator to change two parameters:

Xmx value to 1024mMaxPermSize to 512m

If the exception is still experienced the Xmx value needs to be increased.

Note: Xmx can not be increased more than 2048. Then restart the domain and redeploy the project.

5.4 Runtime Output

There is problem while generating runtime output from ETL. The runtime output can generate 4 arguments:

I. StatusII. CountIII. Start timeIV. End time

Developers have no control on these start time and end time. These times are generated based on source system time. Developers can not change the same.

Work-around/Solution:

Raise a ticket with Sun seebeyond team mentioning the problem.

Note:

One point should be mentioned in this context (although this does not belong to the problem of ETL) is, transferring the decimal data through ETL.

In ETL decimal data type can be mapped with Numeric data types (as there is no data type as decimal; check 2.1.1 for more info) and developers must set the scale explicitly according to the requirements. For example, if source DB is holding the value 1258.2584 then the scale should set at 4 while building the flat file DB for ETL.

4. References From Wikipedia, the free encyclopedia.

Sun SeeBeyond ETL(TM) Integrator User's Guide available at http://docs.sun.com/app/docs/doc/819-6857

Page 14: Data Transfer Using eTL

Data Transfer Using eTL

Wipro – Confidential Page 14 of 14

5. Acronyms and Glossary

JCAPS Java Composite Application Platform Suite

parallel processing The simultaneous use of more than one CPU to execute a program. Ideally, parallel processing makes a program run faster because there are more engines (CPUs) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs can execute different portions without interfering with each other.

Parallel processing differs from multitasking, in which a single CPU executes several programs at once.

Data cleansing Data cleansing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set