ETL Design Challenges and Solutions Through Informatica

12
ETL Design Challenges and Solutions through Informatica Akshayananda Maiti

description

o.k.

Transcript of ETL Design Challenges and Solutions Through Informatica

Page 1: ETL Design Challenges and Solutions Through Informatica

ETL Design Challenges and Solutions through InformaticaAkshayananda Maiti

Page 2: ETL Design Challenges and Solutions Through Informatica

Akshayananda Maiti

ETL Design Challenges and Solutions through Informatica

Page 3: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential3

ETL Design Challenges and Solutions through Informatica

Contents

1 Incremental load mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

2 Incremental load when Source record has no timestamp . . . . . . . . . . . . . . . . . . . . . . .62.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

3 Duplicate records from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

4 Two database technologies in a single mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

5 How to optimize source read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

6 Source fact record does not have a dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Page 4: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential4

ETL Design Challenges and Solutions through Informatica

1 Incremental load mechanismIf the mappings run daily then simple solution to create incremental load would be to

Select source records From Source TableWhere timestamp > sysdate – 1

However, this solution has some limitation. It only works when

1) Source system table does not get updated during a specifi c time window and the mapping should be scheduled inside this window.

2) The mappings should run everyday without fail.

To make your system more robust where you don’t have to adhere to above two constraints, you need a more sophisticated solution.

1.1 SolutionUsually any ETL system will have some control table to store the daily run information like “run start time”, “run end time” and “run success fl ag”.

Utilize that table to store the timestamp information of last record extracted from source e.g. “30th March 2008 9:00 am”. During next run extract records where timestamp > that stored value of previous run “30th March 2008 9:00 am”.

In our case, we have created two tables (having same structure as shown below):

DSR_STG_LD_STS �

DSR_LD_STS �

Table Name: DSR_LD_STS

Field Name Data Type Description

DSR_LD_STS_ID Number(10) Number generated by an Oracle sequence object

LD_START_TMSTMP Date Timestamp of Data load start

LD_END_TMSTMP Date Timestamp of Date load end

LD_CMPRSN_START_TMSTMP Date Starting timestamp of the range for delta load

LD_CMPRSN_END_TMSTMP Date Ending timestamp of the range for delta load

LD_STS_CD Char(1) ‘S’ if Load completed successfully

‘F’ if Load failed

DSR_TRNSCTN_TMSTMP Date SYSDATE

DSR_TRNSCTN_TYP_CD Varchar2(1) ‘I’ If Insert

‘U’ If Update

DSR_TRNSCTN_USER_ID Varchar2(30) ‘DSR_INFORMATICA_USER’

DSR_EFCTV_END_DT Date ‘12-31-9999’

Page 5: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential5

ETL Design Challenges and Solutions through Informatica

Before the data loading into the datamart starts, two columns (LD_CMPRSN_START_TMSTMP and LD_CMPRSN_END_TMSTMP as highlighted above) in the loading status tables are set to identify the time window for source records. Any source record created/updated inside this time window will be extracted through the datamart mappings. Following steps will give details of the mechanism:

DSR_STG_LD_STS stores latest workfl ow run record (Current) �

DSR_LD_STS stores one record for each run of the workfl ow (History + Current). �

Mapping m_USD_Load_Status_ins_at_start runs before the datamart load. �

Mapping m_USD_Load_Status_upd_at_end runs after the datamart load. �

Mapping m_USD_Load_Status_ins_at_start inserts a new record in DSR_LD_STS and inserts the same record in DSR_STG_LD_STS after truncating DSR_STG_LD_STS. After this mapping runs, the single record in DSR_STG_LD_STS looks like as follows:

Column Name Value

DSR_LD_STS_ID “previous id” + 1

LD_START_TMSTMP Sysdate

LD_END_TMSTMP NULL (run end time not known at this point)

LD_CMPRSN_START_TMSTMP 1-1-1900 if this is the fi rst time load else previous run’s end timestamp

LD_CMPRSN_END_TMSTMP Sysdate

LD_STS_CD ‘F’ (As run is not completed till now)

DSR_TRNSCTN_TMSTMP Sysdate

DSR_TRNSCTN_TYP_CD ‘I’

DSR_TRNSCTN_USER_ID DSR_INFORMATICA_USER

DSR_EFCTV_END_DT ‘12-31-9999’

Mapping m_USD_Load_Status_upd_at_end does same update (highlighted columns �

in below table) on single record of table DSR_STG_LD_STS and latest record of table DSR_LD_STS. After this mapping runs, the single record in DSR_STG_LD_STS looks like as follows:

Page 6: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential6

ETL Design Challenges and Solutions through Informatica

Column Name Value

DSR_LD_STS_ID “previous id” + 1

LD_START_TMSTMP Sysdate

LD_END_TMSTMP Sysdate

LD_CMPRSN_START_TMSTMP 1-1-1900 if this is the fi rst time load else previous run’s end timestamp

LD_CMPRSN_END_TMSTMP Sysdate

LD_STS_CD ‘S’ (As run is completed now)

DSR_TRNSCTN_TMSTMP Sysdate

DSR_TRNSCTN_TYP_CD ‘I’

DSR_TRNSCTN_USER_ID DSR_INFORMATICA_USER

DSR_EFCTV_END_DT ‘12-31-9999’

2 Incremental Load When Source Record has no Timestamp

To extract and load data incrementally one needs timestamp information in the source records. However, many a times you may come across to a source system, which does not have timestamp. We had two such source systems. We had to handle extra complexity as described below:

Source system table did not have any record_status_code information to indicate whether a record is active or not. All inactive records get physically deleted from source table. However, in the datamart we followed better way of handling deletion, i.e. soft delete. That means we had kept a record_status_code column (in our tables), which gets populated with value “D” once the record gets deleted from business.

In summary we have got a source which does not have timestamp and which does physical delete of records. From this we have to populate a datamart incrementally and also we have to adhere to soft delete policy.

2.1 SolutionIn most of the data warehousing scenarios there will be a staging area between source and datamart. Utilize that staging area to build timestamp information and record status information (Staging area table will have same structure as source table and extra 2 columns i.e. TIME_STMP and RCRD_STS_CD). Only way to achieve this is to compare today’s snapshot of source table with yesterday’s snapshot of the same table.

Page 7: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential7

ETL Design Challenges and Solutions through Informatica

One simple mechanism for comparison would be to maintain a temporary table, which will store yesterday’s snapshot of the source table. However this solution demands lot of overhead in terms of replicating the whole source system database.

There are better solutions as we have built one in our case as described below:

2.1.1 For each staging area table, create two stored procedures to execute from Informatica mapping (source pre load and target post load). The functionality of each stored procedure is as follows:

2.1.2 Target pre load procedure makes those RCRD_STS_CD column values ‘T’ for which the value is not equal to ‘D’. For example: procedure name = sp_upd_pre.

2.1.3 Target post load procedure makes those RCRD_STS_CD column values ‘D’, which have value ‘T’ and puts sysdate in timestamp column. For example: procedure name = sp_upd_post.

2.1.4 Lets take a case where you have more than one mapping that loads data into single table

2.1.5 The fi rst mapping calls procedure sp_upd_pre through a stored procedure transformation, which does “target pre load”.

2.1.6 All the mappings either insert new records or update existing ones and set the RCRD_STS_CD to ‘A’. The timestamp value is set to sysdate for the new record. But while updating the existing record the timestamp value remains unchanged.

2.1.7 The last mapping calls procedure sp_upd_post through a stored procedure transformation, which does “target post load”.

2.1.8 Once all mappings runs successfully two columns of the table will be updated as per below logic.

Type of source Record Rcrd_sts_cd before load

Rcrd_sts_cd after load

Timestamp before load

Timestamp after load

New record A Sysdate

Old record A A Old date Old date

Old record deleted from source after previous load run

A D Old date Sysdate

Old record deleted from source before previous load run

D D Old date Old date

2.1.9 At the next part of the workfl ow, the mappings which loads data from staging area to datamart, will extract only 1st and 3rd type of records as described in the table above. That means, the mappings will extract only the new records or those records, which has been deleted recently (after previous load run).

Page 8: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential8

ETL Design Challenges and Solutions through Informatica

3 Duplicate Records from SourceA number of times you will face data quality issues like duplicate data available from source. There are various solutions available depending on the client agreed treatment of duplicate records. In our case the treatment agreed with client was that “pick up any one record from the set of duplicate records”.

A simple solution in this scenario would be to use DISTINCT in the source qualifi er query. However this solution has two limitations:

1) Using DISTINCT in a query degrades query performance heavily. 2) If the source is not a RDBMS you will not be able to use source qualifi er query.

In our case we had designed a better solution.

3.1 SolutionWe can use sorter transformation (with distinct option checked) to remove duplicate records. Please note in the below diagram that the option “distinct” to be checked to get distinct records.

Verify distinct option is checked

There could be a more complex scenario where duplicate means there is more than one record for a set of columns (= the business key, which is supposed to be unique) of the records. It need not be necessary that all those records having same value in other columns too.

In this case use aggregator transformation (by putting group by on the business key columns)

Page 9: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential9

ETL Design Challenges and Solutions through Informatica

Check the group by check box, to make Informatica select the last record from the set of records having same delivery number (business key). Rest of the records will not be processed further.

4 Two Database Technologies in a Single MappingIn our case we have one database as Oracle and other is SQL Server. There was a unique challenge to extract incremental records from SQL Server while control information for incremental mechanism is stored in Oracle table.

4.1 SolutionWe can create this mechanism by using variables of Informatica, by creating 2 threads inside one mapping.

First thread extracts comparison_start_timestamp of last load from Oracle table and set a variable (say $$STARTTIME) with that timestamp.

Second thread extracts all records from worldlink SQL Server tables which has timestamp greater than $$STARTTIME. A sample source qualifi er code with the variable.

Page 10: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential10

ETL Design Challenges and Solutions through Informatica

SELECT

Packages.PackageId,PackageDetails.ItemNumber,PackageDetails.Quantity,PackageDetails.UOM,PackageDetails.POLineNumber,PackageDetails.DeleteFlag,Sites.CrossReference

FROM

Packages,PackageDetails,Sites,Shipments

WHERE

PackageDetails.PackageId = Packages.IdANDShipments.SiteId=Sites.IdANDPackages.ShipmentId=Shipments.IdANDPackageDetails.UpdateDateTime >= Convert(datetime ,’$$STARTTIME’)

Note: Ensure that you have put quote (‘) before and after the variable ($$STARTTIME) when you use it in the sql query even if it is a datetime type variable.

5 How to Optimize Source ReadIn a situation when two mappings are taking different information from same very large source table, we can optimize the total load time by merging two mappings where the source table will be read only once and both the target tables will be loaded.

Here we will talk about an example we have handled where there was an added complexity. The two target tables are fact and related dimension table where fact table has the foreign key to the dimension table.

5.1 SolutionSimultaneous loading in dim and fact table is technically not feasible if foreign key constraint in fact table is enabled. To overcome this challenge we used 2 threads inside the mapping.

1st ThreadAfter the source qualifi er, the record fl ow will split out to two branches. 1st branch will load target dimension table. 2nd branch handles fact records.

Page 11: ETL Design Challenges and Solutions Through Informatica

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical, photocopying, recording, or otherwise – without the permission of Tata Consultancy Services.

TCS Confi dential11

ETL Design Challenges and Solutions through Informatica

2nd branch of 1st Thread(for fact records)2nd branch again split out to 3rd and 4th branch. Here, we have introduced a temporary fact (say fact_tmp) table, which does not have any referential integrity constraint. In 3rd branch all new fact records (instead of inserting them to the fact table) will populate fact_tmp. In 4th branch all old fact records get directly updated in fact table.Note: All the branches described above are inside the 1st thread and they have a single source qualifi er.

2nd ThreadOnce 1st thread is completed (more specifi cally, loading in dimension table is completed) all the records from fact_tmp table are inserted in fact table. Then fact_tmp table is truncated.

The optimization in performance is derived from the usage of single source qualifi er (that means very large source table is being read only once instead of twice) to fi nally load both the target tables (that is dimension and fact table). Following attached acrobat document shows one such example pictorially.

Diagram for loadingdim & fact in same

mapping.pdf

6 Source Fact Record does not have a DimensionNormally a source system will provide all information to build all the dimension keys for a fact record. However sometimes data quality issues may arise where few fact records may have dimension information missing (NULL value in related columns of source). Due to referential integrity of fact table a direct insert of such records in fact table is not feasible.

6.1 SolutionTo achieve this we have to create “dummy” dimension keys as described below.

The relevant dimension for which few source fact records has the value as NULL, has to be populated with one extra entry called “dummy” with the id ‘-9999’ (that is what we had done in our case).

If there is any fact record coming from source which do not have that specifi c dimension information, then the record will still be inserted into the fact table with dimension_key = ‘-9999’.

Page 12: ETL Design Challenges and Solutions Through Informatica
Page 13: ETL Design Challenges and Solutions Through Informatica

1 ~

1 1--

f wi

~(r~

1-

~ ~

t~+Q

ir~

~)rlf~

I

~

t~~~~

t~- -

t

-£1

~

IIIII

fI111III1I\IIIIt.,

\ "" "

\"-

"-"

\\

~ ~ ~~ \r~dJ~1;:1 r1 ~-~, ~,,--,1~, - 9-=~~,

' Jl.P" . 1

~/ ~ 1~ ~ I

1:,1

1. I

ot~~~4"

~~ f~ t{

~ ~ ~~~~

~~ I~

I

~

~~~~ ~ ~~~~; ~~\ 6-~ ~- ~ -:

t"\ ~ t! ~

}1

r