Best Informatica Interview Questions

Best Informatica Interview Questions & Answers

Deleting duplicate row using Informatica

Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in

the Target System eliminating the duplicate rows. What will be the approach?

Ans.

Let us assume that the source system is a Relational Database . The source table is having duplicate rows.

Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source

table and load the target accordingly.

Informatica Join Vs Database Join

Which is the fastest? Informatica or Oracle?

In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and

found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will

look into the JOIN operation, not only because JOIN is the single most important data set operation but also

because performance of JOIN can give crucial data to a developer in order to develop proper push down

optimization manually.

Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises

worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other

hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from

1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the

technologies that they support. But when it comes to the application development, developers often face

challenge to strike the right balance of operational load sharing between these systems. This article will help

them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?

As an application developer, you have the choice of either using joining syntaxes in database level to join

your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question

is – which system performs this faster?

Test Preparation

We will perform the same test with 4 different data points (data volumes) and log the results. We will start

with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4

http://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.html

http://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.html

http://www.dwbiconcepts.com/tutorial/10-interview/44-important-practical-interview-questions.html

million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data

volumes. Here are the details of the setup we will use,

1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size

We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.

The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in

database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in

informatica level. We have executed these mappings with different data points and logged the result.

Further to the above test we will execute m_db_side_join mapping once again, this time with proper

database side indexes and statistics and log the results.

Result

The following graph shows the performance of Informatica and Database in terms of time taken by each

system to sort data. The average time is plotted along vertical axis and data points are plotted along

horizontal axis.

Data Points Master Table Record Count Detail Table Record Count

1 0.1 M 1 M

2 0.2 M 2 M

3 0.4 M 4 M

4 0.6 M 6 M

Verdict

In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index

Assumption

1. Average server load remains same during all the experiments

2. Average network speed remains same during all the experiments

Note

1. This data can only be used for performance comparison but cannot be used for performance

benchmarking.

2. This data is only indicative and may vary in different testing conditions.

In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns

to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to

application developer in order to take informed decision regarding performance tuning.

Comparing Performance of SORT operation (Order By) in Informatica and Oracle









challenge to strike the right balance of operational load sharing between these systems.

Think about a typical ETL operation often used in enterprise level data integration. A lot of data processing

can be either redirected to the database or to the ETL tool. In general, both the database and the ETL tool

are reasonably capable of doing such operations with almost same efficiency and capability. But in order to

achieve the optimized performance, a developer must carefully consider and decide which system s/he

should be trusting with for each individual processing task.

In this article, we will take a basic database operation – Sorting, and we will put these two systems to test in

order to determine which does it faster than the other, if at all.

Which sorts data faster? Oracle or Informatica?

As an application developer, you have the choice of either using ORDER BY in database level to sort your

data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The question is –

which system performs this faster?

Test Preparation

We will perform the same test with different data points (data volumes) and log the results. We will start with

1 million records and we will be doubling the volume for each next data points. Here are the details of the

setup we will use,

1. Oracle 10g database as relational source and target

2. Informatica PowerCentre 8.5 as ETL tool

3. Database and Informatica setup on different physical servers using HP UNIX

4. Source database table has no constraint, no index, no database statistics and no partition

5. Source database table is not available in Oracle shared pool before the same is read

6. There is no session level partition in Informatica PowerCentre

7. There is no parallel hint provided in extraction SQL query

8. The source table has 10 columns and first 8 columns will be used for sorting

9. Informatica sorter has enough cache size


The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in

database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica

level. We have executed these mappings with different data points and logged the result.

Result


system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.

Verdict

The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.

Assumption



Note

This data can only be used for performance comparison but cannot be used for performance benchmarking.

To know the Informatica and Oracle performance comparison for JOIN operation, please click here

In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre

8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application

developer in order to take informed decision regarding performance tuning.


In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and

found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will

look into the JOIN operation, not only because JOIN is the single most important data set operation but also


http://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.html

because performance of JOIN can give crucial data to a developer in order to develop proper push down

optimization manually.






challenge to strike the right balance of operational load sharing between these systems. This article will help

them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?

As an application developer, you have the choice of either using joining syntaxes in database level to join

your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question

is – which system performs this faster?

Test Preparation

We will perform the same test with 4 different data points (data volumes) and log the results. We will start

with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4

million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data

volumes. Here are the details of the setup we will use,

1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size


The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in

database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in

informatica level. We have executed these mappings with different data points and logged the result.

Further to the above test we will execute m_db_side_join mapping once again, this time with proper

database side indexes and statistics and log the results.

Result


system to sort data. The average time is plotted along vertical axis and data points are plotted along

horizontal axis.

Data Points Master Table Record Count Detail Table Record Count

1 0.1 M 1 M

2 0.2 M 2 M

3 0.4 M 4 M

4 0.6 M 6 M

Verdict

In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index

Assumption



Note

1. This data can only be used for performance comparison but cannot be used for performance

benchmarking.

2. This data is only indicative and may vary in different testing conditions.

Informatica Reject File - How to Identify rejection reason

When we run a session, the integration service may create a reject file for each target instance in the

mapping to store the target reject record. With the help of the Session Log and Reject File we can identify

the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads

in the subsequent session runs. If theInformatica Writer or the Target Database rejects data due to any valid

reason the integration service logs the rejected records into the reject file. Every time we run the session the

integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files

By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable

directory. It writes the entire reject record row in the bad file although the problem may be in any one of the

Columns. The reject files have a default naming convention like [target_instance_name].bad . If we open

the reject file in an editor we will see comma separated values having some tags/ indicator and some data

values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is

the Column Indicator .

For reading the bad file the best method is to copy the contents of the bad file and saving the same as a

CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The

firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for

insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row.

When the Commit Type of the session is configured as User-defined the row indicator indicates whether

the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target

connection group.

List of Values of Row Indicators:

Row Indicator Indicator Significance Rejected By

0 Insert Writer or target

1 Update Writer or target

2 Delete Writer or target

3 Reject Writer

http://www.dwbiconcepts.com/basic-concept/3-etl/32-informatica-reject-or-bad-files.html

4 Rolled-back insert Writer

5 Rolled-back update Writer

6 Rolled-back delete Writer

7 Committed insert Writer

8 Committed update Writer

9 Committed delete Writer

Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.

List of Values of Column Indicators:

>

Column

IndicatorType of data Writer Treats As

DValid data or

Good Data.

Writer passes it to the target database.

The target accepts it unless a database

error occurs, such as finding a duplicate

key while inserting.

O

Overflowed

Numeric

Data.

Numeric data exceeded the specified

precision or scale for the column. Bad

data, if you configured the mapping

target to reject overflow or truncated

data.

N Null Value.

The column contains a null value. Good

data. Writer passes it to the target,

which rejects it if the target database

does not accept null values.

TTruncated

String Data.

String data exceeded a specified

precision for the column, so the

Integration Service truncated it. Bad

data, if you configured the mapping

target to reject overflow or truncated

data.

Also to be noted that the second column contains column indicator flag value 'D' which signifies that the

Row Indicator is valid.

Now let us see how Data in a Bad File looks like:

Implementing Informatica Incremental Aggregation

Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate calculations in a session. If the source changes incrementally and we can capture the changes, then we can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to delete previous loads data, process the entire source data and recalculate the same data each time you run the session.

Using Informatica Normalizer Transformation

Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement.

Learn how to effectively use normalizer here.

Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate

data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns

from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in

columns to rows.

Normalizer effectively does the opposite of what Aggregator does!

Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store and we need to create a row for each

sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter

like below..

The following source rows contain four quarters of sales by store:

Source Table

Store Quarter1 Quarter2 Quarter3 Quarter4

Store1 100 300 500 700

http://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.html

http://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.html

Store2 250 450 650 850

The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that

identifies the quarter number:

Target Table

Store Sales Quarter

Store 1 100 1

Store 1 300 2

Store 1 500 3

Store 1 700 4

Store 2 250 1

Store 2 450 2

Store 2 650 3

Store 2 850 4

How Informatica Normalizer Works

Suppose we have the following data in source:

Name Month Transportation House Rent Food

Sam Jan 200 1500 500

John Jan 300 1200 300

Tom Jan 300 1350 350

Sam Feb 300 1550 450

John Feb 350 1200 290

Tom Feb 350 1400 350

and we need to transform the source data and populate this as below in the target table:

Name Month Expense Type Expense

Sam Jan Transport 200

Sam Jan House rent 1500

Sam Jan Food 500

John Jan Transport 300

John Jan House rent 1200

John Jan Food 300

Tom Jan Transport 300

Tom Jan House rent 1350

Tom Jan Food 350

.. like this.

Now below is the screen-shot of a complete mapping which shows how to achieve this result using

Informatica PowerCenter Designer. Image: Normalization Mapping Example 1

I will explain the mapping further below.

Setting Up Normalizer Transformation Property

First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of

the Normalizer transformation, since we have Food,Houserent and Transportation.

Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and

Month

In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab.

Interestingly we will observe two new columns namely,

GK_EXPENSEHEAD

GCID_EXPENSEHEAD

GK field generates sequence number starting from the value as defined in Sequence field while GCID holds

the value of the occurence field i.e. the column no of the input Expense head.

Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows.

Below is the screen-shot of the expression to handle this GCID efficiently:

Image: Expression to handle GCID

This is how we will accomplish our task!

Informatica Dynamic Lookup Cache

A LookUp cache does not change once built. But what if the underlying lookup table changes the data after

the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the

underlying table changes?

Dynamic Lookup Cache

Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you

have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You

may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no "big

deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so

http://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.html

whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup

cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.

But what if you want your Lookup cache to get updated as and when the target table is changing? What if

you want your lookup cache to always show the exact snapshot of the data in your target table at that point

in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic

cache to handle this.

But why anyone will need a dynamic cache?

To understand this, let's first understand a static cache scenario.

Informatica Dynamic Lookup Cache - What is Static Cache

STATIC CACHE SCENARIO

Let's suppose you run a retail business and maintain all your customer information in a customer master

table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a

Customer Dimension table in your data warehouse. Your source customer table is a transaction system

table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address,

the old address is updated with the new address. But your data warehouse table stores the history (may be

in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table.

Typically you do a Lookup on target (static cache) and check with your every incoming customer record to

determine if the customer is already existing in target or not. If the customer is not already existing in target,

you conclude the customer is new and INSERT the record whereas if the customer is already existing, you

may want to update the target record with this new record (if the record is updated). This is illustrated

below, You don't need dynamic Lookup cache for this

Image: A static Lookup Cache to determine if a source record is new or updatable


Informatica Dynamic Lookup Cache - What is Dynamic Cache

DYNAMIC LOOKUP CACHE SCENARIO

Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that

your source table does not have any duplicate record.

But, What if you had a flat file as source with many duplicate records?

Would the scenario be same? No, see the below illustration.

Image: A Scenario illustrating the use of dynamic lookup cache

Here are some more examples when you may consider using dynamic lookup,

Updating a master customer table with both new and updated customer information coming

together as shown above

Loading data into a slowly changing dimension table and a fact table at the same time. Remember,

you typically lookup the dimension while loading to fact. So you load dimension table before loading fact

table. But using dynamic lookup, you can load both simultaneously.

Loading data from a file with many duplicate records and to eliminate duplicate records in target by

updating a duplicate row i.e. keeping the most recent row or the initial row

Loading the same data from multiple sources using a single mapping. Just consider the previous

Retail business example. If you have more than one shops and Linda has visited two of your shops for the

first time, customer record Linda will come twice during the same load.

Informatica Dynamic Lookup Cache - How does dynamic cache work

So, How does dynamic lookup work?




When the Integration Service reads a row from the source, it updates the lookup cache by performing one of

the following actions:

Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service

inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags

the row as insert.

Updates the row in the cache: If the row exists in the cache, the Integration Service updates the

row in the cache based on the input ports. The Integration Service flags the row as update.

Makes no change to the cache: This happens when the row exists in the cache and the lookup is

configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to

update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes.

The Integration Service flags the row as unchanged.

Notice that Integration Service actually flags the rows based on the above three conditions.

And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different

logic. This flag port is called

NewLookupRow

Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to use

a Router or Filter transformation followed by an Update Strategy.

Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:

0 = Integration Service does not update or insert the row in the cache.

1 = Integration Service inserts the row into the cache.

2 = Integration Service updates the row in the cache.

When the Integration Service reads a row, it changes the lookup cache depending on the results of the

lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the

NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Informatica Dynamic Lookup Cache - Dynamic Lookup Mapping Example

Example of Dynamic Lookup Implementation

Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of

the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.



And here I provide you the screenshot of the lookup below. Lookup ports screen shot first,Image: Dynamic Lookup Ports Tab

And here is Dynamic Lookup Properties Tab

http://cdn2.dwbiconcepts.com/images/stories/dynamic1.jpg

If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE

group. The router screenshot is also given below. New records are routed to the INSERT group and existing

records are routed to the UPDATE group.

Router Transformation Groups Tab

Informatica Dynamic Lookup Cache - Dynamic Lookup Sequence ID

While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or

a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the

lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the

lookup condition.

When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence

ID for each row it inserts into the lookup cache.

When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache

associated with any port using a sequence ID and it generates a key for the port by incrementing the

greatest sequence ID existing value by one, when the inserting a new row of data into the cache.



When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at

one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the

Integration Service runs out of unique sequence ID numbers, the session fails.

Informatica Dynamic Lookup Cache - Dynamic Lookup Ports

About the Dynamic Lookup Output Port

The lookup/output port output value depends on whether we choose to output old or new values when the

Integration Service updates a row:

Output old values on update: The Integration Service outputs the value that existed in the cache

before it updated the row.

Output new values on update: The Integration Service outputs the updated value that it writes in

the cache. The lookup/output port value matches the input/output port value.

Note: We can configure to output old or new values using the Output Old Value On Update transformation

property.

Informatica Dynamic Lookup Cache - NULL handling in LookUp

Handling NULL in dynamic LookUp

If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input

port, the input value does not equal the lookup value or the value out of the input/output port. When you

select theIgnore Null property, the lookup cache and the target table might become unsynchronized if you

pass null values to the target. You must verify that you do not pass null values to the target.

When you update a dynamic lookup cache and target table, the source data might contain some null values.

The Integration Service can handle the null values in the following ways:

Insert null values: The Integration Service uses null values from the source and updates the

lookup cache and target table using all values from the source.

Ignore Null inputs for Update property : The Integration Service ignores the null values in the

source and updates the lookup cache and target table using only the not null values from the source.

If we know the source data contains null values, and we do not want the Integration Service to update the

lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding

lookup/output port.

When we choose to ignore NULLs, we must verify that we output the same values to the target that the

Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want

the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that

lookup cache and the target table might not become unsynchronized

New values. Connect only lookup/output ports from the Lookup transformation to the target.



Old values. Add an Expression transformation after the Lookup transformation and before the

Filter or Router transformation. Add output ports in the Expression transformation for each port in the target

table and create expressions to ensure that we do not output null input values to the target.

Informatica Dynamic Lookup Cache - Other Details

When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in

all lookup ports with the values in their associated input ports by default.

It compares the values to determine whether or not to update the row in the lookup cache. When a value in

an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.

But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to

ignore when it compares ports. The Designer only enables this property for lookup/output ports when the

port is not used in the lookup condition. We can improve performance by ignoring some ports during

comparison.

We might want to do this when the source data includes a column that indicates whether or not the row

contains data we need to update. Select the Ignore in Comparison property for all lookup ports except

the port that indicates whether or not to update the row in the cache and target table.

Note: We must configure the Lookup transformation to compare at least one port else the Integration

Service fails the session when we ignore all ports.

Links

Pushdown Optimization In Informatica

Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance

data transformation load among servers. This article describes pushdown techniques.

What is Pushdown Optimization?

Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance.

Veteran ETL developers often come across issues when they need to determine the appropriate place to

perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either

do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica

Filter transformation. Sometimes, we can even "push" some transformation logic to the target database

instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is

crucial for overall ETL performance.

http://www.dwbiconcepts.com/advance/4-etl/21-pushdown-optimization-in-informatica.html


How does Push-Down Optimization work?

One can push transformation logic to the source or target database using pushdown optimization. The

Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the

source or the target database which executes the SQL queries to process the transformations. The amount

of transformation logic one can push to the database depends on the database, transformation logic, and

mapping and session configuration. The Integration Service analyzes the transformation logic it can push to

the database and executes the SQL statement generated against the source or target tables, and it

processes any transformation logic that it cannot push to the database.

Pushdown Optimization In Informatica - Using Pushdown Optimization

Using Pushdown Optimization

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the

Integration Service can push to the source or target database. You can also use the Pushdown Optimization

Viewer to view the messages related to pushdown optimization.

Let us take an example: Image: Pushdown Optimization Example 1

Filter Condition used in this mapping is: DEPTNO>40

Suppose a mapping contains a Filter transformation that filters out all employees except those with a

DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It

generates the following SQL statement to process the transformation logic:

INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO)

SELECT

EMP_SRC.EMPNO,

EMP_SRC.ENAME,

EMP_SRC.SAL,

EMP_SRC.COMM,

EMP_SRC.DEPTNO

FROM EMP_SRC

WHERE (EMP_SRC.DEPTNO >40)

The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE

clause. The Integration Service does not extract data from the database at this time.

We can configure pushdown optimization in the following ways:

Using source-side pushdown optimization:



The Integration Service pushes as much transformation logic as possible to the source database. The

Integration Service analyzes the mapping from the source to the target or until it reaches a downstream

transformation it cannot push to the source database and executes the corresponding SELECT statement.

Using target-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the target database. The

Integration Service analyzes the mapping from the target to the source or until it reaches an upstream

transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE

statement based on the transformation logic for each transformation it can push to the database and

executes the DML.

Using full pushdown optimization:

The Integration Service pushes as much transformation logic as possible to both source and target

databases. If you configure a session for full pushdown optimization, and the Integration Service cannot

push all the transformation logic to the database, it performs source-side or target-side pushdown

optimization instead. Also the source and target must be on the same database. The Integration Service

analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it

analyzes the target. When it can push all transformation logic to the database, it generates an INSERT

SELECT statement to run on the database. The statement incorporates transformation logic from all the

transformations in the mapping. If the Integration Service can push only part of the transformation logic to

the database, it does not fail the session, it pushes as much transformation logic to the source and target

database as possible and then processes the remaining transformation logic.

For example, a mapping contains the following transformations:

SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn

SUM(SAL), SUM(COMM) Group by DEPTNO

RANK PORT on SAL

TOTAL = SAL+COMM

Image: Pushdown Optimization Example 2

The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown

optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator

transformation to the source, processes the Rank transformation, and pushes the Expression transformation

and target to the target database.

When we use pushdown optimization, the Integration Service converts the expression in the transformation

or in the workflow link by determining equivalent operators, variables, and functions in the database. If there

is no equivalent operator, variable, or function, the Integration Service itself processes the transformation

logic. The Integration Service logs a message in the workflow log and the Pushdown Optimization Viewer

when it cannot push an expression to the database. Use the message to determine the reason why it could

not push the expression to the database.

Pushdown Optimization In Informatica - Pushdown Optimization in Integration ServicePage 3 of 6

How does Integration Service handle Push Down Optimization?

To push transformation logic to a database, the Integration Service might create temporary objects in the

database. The Integration Service creates a temporary sequence object in the database to push Sequence

Generator transformation logic to the database. The Integration Service creates temporary views in the

database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to

the database, an unconnected relational lookup, filtered lookup.

1. To push Sequence Generator transformation logic to a database, we must configure the session

for pushdown optimization with Sequence.

2. To enable the Integration Service to create the view objects in the database we must configure the

session forpushdown optimization with View.

2. After the database transaction completes, the Integration Service drops sequence and view objects

created for pushdown optimization.

Pushdown Optimization In Informatica - Configuring Pushdown Optimization

Configuring Parameters for Pushdown Optimization

Depending on the database workload, we might want to use source-side, target-side, or full pushdown

optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The

settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session

properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for

Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.

The possible values may be,

1. none i.e the integration service itself processes all the transformations,

2. Source [Seq View],

3. Target [Seq View],

4. Full [Seq View]

Pushdown Optimization In Informatica - Using Pushdown Optimization Viewer

Pushdown Optimization Viewer







Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database.

Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the

corresponding SQL statement that is generated for the specified selections. When we select a pushdown

option or pushdown group, we do not change the pushdown configuration. To change the configuration, we

must update the pushdown option in the session properties.

Database that supports Informatica Pushdown Optimization

We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2,

Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.

When we use native drivers, the Integration Service generates SQL statements using native database SQL.

When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The

Integration Service can generate more functions when it generates SQL statements using native language

instead of ANSI SQL.

Pushdown Optimization In Informatica - Pushdown Optimization Error Handling

Handling Error when Pushdown Optimization is enabled

When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in

the database.

When the Integration Service runs a session configured for full pushdown optimization and an error occurs,

the database handles the errors. When the database handles errors, the Integration Service does not write

reject rows to the reject file.

If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot

perform incremental recovery because the database processes the transformations. Instead, the database

rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the

Integration Service fails, the database server rolls back the transaction.

Links

Informatica Tuning - Step by Step ApproachThis is the first of the number of articles on the series of Data Warehouse Application performance tuning scheduled to come every week. This one is on Informatica performance tuning.

Please note that this article is intended to be a quick guide. A more detail Informatica performance tuning

guide can be found here: Informatica Performance Tuning Complete Guide

Source Query/ General Query Tuning

http://dwbiconcepts.com/advance/4-etl/58-informatica-performance-tuning-complete-guide.html

http://www.dwbiconcepts.com/advance/4-etl/17-informatica-tuning-step-by-step-approach.html



1.1 Calculate original query cost1.2 Can the query be re-written to reduce cost?- Can IN clause be changed with EXISTS?- Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query?- Is there a redundant table join that can be avoided?- Can we include additional WHERE clause to further limit data volume?- Is there a redundant column used in GROUP BY that can be removed?- Is there a redundant column selected in the query but not used anywhere in mapping?1.3 Check if all the major joining columns are indexed1.4 Check if all the major filter conditions (WHERE clause) are indexed- Can a function-based index improve performance further?1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query

Tuning Informatica LookUp

1.1 Redundant Lookup transformation - Is there a lookup which is no longer used in the mapping? - If there are consecutive lookups, can those be replaced inside a single lookup override?1.2 LookUp conditions - Are all the lookup conditions indexed in database? (Uncached lookup only) - An unequal condition should always be mentioned after an equal condition 1.3 LookUp override query - Should follow all guidelines from 1. Source Query part above 1.4 There is no unnecessary column selected in lookup (to reduce cache size) 1.5 Cached/Uncached - Carefully consider whether the lookup should be cached or uncached - General Guidelines - Generally don't use cached lookup if lookup table size is > 300MB - Generally don't use cached lookup if lookup table row count > 20,000,00 - Generally don't use cached lookup if driving table (source table) row count < 1000 1.6 Persistent Cache - If found out that a same lookup is cached and used in different mappings, Consider persistent cache

1.7 Lookup cache building - Consider "Additional Concurrent Pipeline" in session property to build cache concurrently "Prebuild Lookup Cache" should be enabled, only if the lookup is surely called in the mapping

Tuning Informatica Joiner

3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner

3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If SorterTransformation is used then make sure Sorter has enough cache to perform 1-pass sort) 3.3 Smaller of two joining tables should be master

Tuning Informatica Aggregator

4.1 When possible, sort the input for aggregator from database end (Order By Clause) 4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records.

Tuning Informatica Filter

5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible

Tuning Informatica Sequence Generator

6.1 Cache the sequence generator

Setting Correct Informatica Session Level Properties

7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points) 7.2 Use "Terse" mode for tracing level 7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2) (Also remember increasing partitions will multiply the cache memory requirement accordingly)

Tuning Informatica Expression

8.1 Use Variable to reduce the redundant calculation 8.2 Remove Default value " ERROR('transformation error')" for Output Column. 8.3 Try to reduce the Code complexity like Nested If etc. 8.4 Try to reduce the Unneccessary Type Conversion in Calculation

Implementing Informatica Partitions

Why use Informatica Pipeline Partition?

Identification and elimination of performance bottlenecks will obviously optimize session performance. After

tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number

of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the

system hardware while processing the session.

PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions

We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key

range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the

transformations and the target. When the Integration Service runs the session, it can achieve higher

performance by partitioning the pipeline and performing the extract, transformation, and load for each

partition in parallel.

A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number

of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration

http://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.html

Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we

can configure multiple partitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions, and the partition types. In the

session properties we can add or edit partition points. When we change partition points we can define the

partition type and add or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:

Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a

section of a pipeline between any two partition points. The Integration Service redistributes rows of data at

partition points. When we add a partition point, we increase the number of pipeline stages by one.

Increasing the number of partitions or partition points increases the number of threads. We cannot create

partition points at Source instances or at Sequence Generator transformations.

Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the

Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we

increase the number of processing threads, which can improve session performance. We can define up to

64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at

any partition point, the Workflow Manager increases or decreases the number of partitions at all partition

points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration

Service runs the partition threads concurrently.

Partition types: The Integration Service creates a default partition type at each partition point. If we have

the Partitioning option, we can change the partition type. The partition type controls how the Integration

Service distributes data among partitions at partition points. We can define the following partition types:

Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database

partitioning: The Integration Service queries the database system for table partition information. It reads

partitioned data from the corresponding nodes in the database.

Pass-through: The Integration Service processes data without redistributing rows among partitions. All

rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-

through partitioning when we want to create an additional pipeline stage to improve performance, but do not

want to change the distribution of data across partitions.

Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin

partitioning where we want each partition to process approximately the same numbers of rows i.e. load

balancing.

Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The

Integration Service groups the data based on a partition key. The Integration Service uses all grouped or

sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter,

and unsorted Aggregator transformations.

Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We

define the number of ports to generate the partition key.

Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as

the partition key. For each port, we define a range of values. The Integration Service uses the key and

ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in

the pipeline are partitioned by key range.

We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning.

Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a

session in Workflow Manager.

The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data

processing. This option provides a thread-based architecture and automatic data partitioning that optimizes

parallel processing on multiprocessor and grid-based hardware environments.

Implementing Informatica Persistent Cache

You must have noticed that the "time" Informatica takes to build the lookup cache can be too much

sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your

time. This article describes how to do it.

What is Persistent Cache?

Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire

data of the lookup table from database server to Informatica Server as a part of lookup cache building

activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider

this scenario - what if you are looking up to the same table different times using different lookups in different

mappings? Do you want to spend the time of building the lookup cache again and again for each lookup?

Off course not! Just use persistent cache option!

Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes

lookup cache files after a successful session run based on whether the Lookup cache is checked as

persistent or not.

Where and when we shall use persistent cache:

Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is

used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to

populate the surrogate key in the fact tables based on their source system keys. Now if we cache the same

Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading

timeline.

There can be some functional reasons also for selecting to use persistent cache. Please read the

article Advantage and Disadvantage of Persistent Cache Lookup to know how persistent cache can be used

to ensure data integrity in long running ETL sessions where underlying tables are also changing.

So the solution is to use Named Persistent Cache.

In the first mapping we will create the Named Persistent Cache file by setting three properties in the

Properties tab of Lookup transformation.

http://www.dwbiconcepts.com/advance/4-etl/56-the-benefit-and-disadvantage-of-informatica-persistent-cache-lookup.html

http://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.html

Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used.

Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will

be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx

or .dat

Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or

refreshed with the current data of the lookup table.

Next in all the mappings where we want to use the same already built Named Persistent Cache we need to

set two properties in the Properties tab of Lookup transformation.

Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is

already saved in Cache Directory and if the cache file is not there the session will not fail it will just create

the cache file instead.

Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that

was defined in the mapping where the persistent cache file was created.

Note:

If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even

also an extra blank space will fail the session that is using the already built persistent cache file.

So if the incoming source data volume is high, the lookup table’s data volume that need to be cached is also

high, and the same lookup table is used in many mappings then the best way to handle the situation is to

use one-time build, already created persistent named cache.

Aggregation with out Informatica Aggregator

Since Informatica process data row by row, it is generally possible to handle data aggregation operation

even without an Aggregator Transformation. On certain cases, you may get huge performance gain using

this technique!

General Idea of Aggregation without Aggregator Transformation

Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the

Employee Table. The SQL query for this would be:

SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;

If we need to implement this in Informatica, it would be very easy as we would obviously go for an

Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as

SUM(SALARY the problem can be solved easily.

Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use

the very funda of the expression transformation of holding the value of an attribute of the previous tuple over

here.

But wait... why would we do this? Aren't we complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is

already sorted or when you know input data will not violate the order, like you are loading daily data and

want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation

operation. This needs time and cache space and this also voids the normal row by row processing in

Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease

out row by row processing. The mapping below will show how to do this

Image: Aggregation with Expression and Sorter 1

Sorter (SRT_SAL) Ports Tab

http://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.html

http://www.addthis.com/bookmark.php?v=250&username=xa-4bc2f37319cd6ca8

Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties

Sorter (SRT_SAL1) Ports Tab

Expression (EXP_SAL2) Ports Tab

Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica aggregator transformation. Hope you

liked it!

What are the differences between Connected and Unconnected Lookup?

Connected Lookup Unconnected Lookup

Connected lookup participates in dataflow and

receives input directly from the pipeline

Unconnected lookup receives input values from

the result of a LKP: expression in another

transformation

Connected lookup can use both dynamic and

static cache

Unconnected Lookup cache can NOT be

dynamic

Connected lookup can return more than one

column value ( output port )

Unconnected Lookup can return only one

column value i.e. output port

Connected lookup caches all lookup columns Unconnected lookup caches only the lookup

output ports in the lookup conditions and the

return port

Supports user-defined default values (i.e. value

to return when lookup conditions are not

satisfied)

Does not support user defined default values

What is the difference between Router and Filter?

Router Filter

Router transformation divides the incoming

records into multiple groups based on some

condition. Such groups can be mutually

inclusive (Different groups may contain same

record)

Filter transformation restricts or blocks the

incoming record set based on one given

condition.

Router transformation itself does not block any

record. If a certain record does not match any

of the routing conditions, the record is routed to

default group

Filter transformation does not have a default

group. If one record does not match filter

condition, the record is blocked

Router acts like CASE.. WHEN statement in

SQL (Or Switch().. Case statement in C)Filter acts like WHERE condition is SQL.

What can we do to improve the performance of Informatica Aggregator Transformation?

Aggregator performance improves dramatically if records are sorted before passing to the aggregator and

"sorted input" option under aggregator properties is checked. The record set should be sorted on those

columns that are used in Group By operation.

It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier

transformation, unless there is a chance that already sorted records from source qualifier can again become

unsorted before reaching aggregator

What are the different lookup cache?

Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static

cache is one which does not modify the cache once it is built and it remains same during the session run.

On the other hand, Adynamic cache is refreshed during the session run by inserting or updating the

records in cache based on the incoming source data.

A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains

the cache even after session run is complete or not respectively

http://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.html



How can we update a record in target table without using Update strategy?

A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the

target table in Informatica level and then we need to connect the key and the field we want to update in the

mapping Target. In the session level, we should set the target property as "Update as Update" and check

the "Update" check-box.

Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and

"Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we

have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID

and Customer Address fields in the mapping. If the session properties are set correctly as described above,

then the mapping will only update the customer address field for all matching customer IDs.

Deleting duplicate row using Informatica

Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in

the Target System eliminating the duplicate rows. What will be the approach?

Ans.

Let us assume that the source system is a Relational Database . The source table is having duplicate rows.

Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source

table and load the target accordingly.

Source Qualifier Transformation DISTINCT clause

But what if the source is a flat file? How can we remove the duplicates from flat file source?

To know the answer of this question and similar high frequency Informatica questions, please continue to,

Best Informatica Interview Questions

Documents

Transcript of Best Informatica Interview Questions