Best Informatica Practices26064718

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-1

Best Practices: Table of Contents

Best Practices BP-2

Configuration Management BP-2

Database Sizing BP-2

Migration Procedures BP-5

Development Techniques BP-36

Data Cleansing BP-36

Data Connectivity using PowerCenter Connect for BW Integration Server BP-42

Data Connectivity using PowerCenter Connect for MQSeries BP-47

Data Connectivity using PowerCenter Connect for SAP BP-52

Data Profiling BP-60

Data Quality Mapping Rules BP-63

Deployment Groups BP-69

Designing Analytic Data Architectures BP-72

Developing an Integration Competency Center BP-82

Development FAQs BP-90

Key Management in Data Warehousing Solutions BP-99

Mapping Design BP-103

Mapping Templates BP-107

Naming Conventions BP-110

Performing Incremental Loads BP-120

Real-Time Integration with PowerCenter BP-126

Session and Data Partitioning BP-136

Using Parameters, Variables and Parameter Files BP-141

Using PowerCenter Labels BP-157

Using PowerCenter Metadata Reporter and Metadata Exchange Views for Quality Assurance BP-162

Using PowerCenter with UDB BP-164

Using Shortcut Keys in PowerCenter Designer BP-170

PAGE BP-2 BEST PRACTICES INFORMATICA CONFIDENTIAL

Web Services BP-178

Working with PowerCenter Connect for MQSeries BP-183

Error Handling BP-190

A Mapping Approach to Trapping Data Errors BP-190

Error Handling Strategies BP-194

Error Handling Techniques using PowerCenter 7 (PC7) and PowerCenter Metadata Reporter (PCMR)

BP-205

Error Management in a Data Warehouse Environment BP-212

Error Management Process Flow BP-220

Metadata and Object Management BP-223

Creating Inventories of Reusable Objects & Mappings BP-223

Metadata Reporting and Sharing BP-227

Repository Tables & Metadata Management BP-239

Using Metadata Extensions BP-247

Operations BP-250

Daily Operations BP-250

Data Integration Load Traceability BP-252

Event Based Scheduling BP-259

High Availability BP-262

Load Validation BP-265

Repository Administration BP-270

SuperGlue Repository Administration BP-273

Third Party Scheduler BP-278

Updating Repository Statistics BP-282

PowerAnalyzer Configuration and Performance Tuning BP-288

Deploying PowerAnalyzer Objects BP-288

Installing PowerAnalyzer BP-299

PowerAnalyzer Security BP-306

Tuning and Configuring PowerAnalyzer and PowerAnalyzer Reports BP-314

Upgrading PowerAnalyzer BP-332

PowerCenter Configuration and Performance Tuning BP-334

Advanced Client Configuration Options BP-334

Advanced Server Configuration Options BP-339

Causes and Analysis of UNIX Core Files BP-344

Determining Bottlenecks BP-347

Managing Repository Size BP-352

Organizing and Maintaining Parameter Files & Variables BP-354

Performance Tuning Databases (Oracle) BP-359

Performance Tuning Databases (SQL Server) BP-371


Performance Tuning Databases (Teradata) BP-377

Performance Tuning UNIX Systems BP-380

Performance Tuning Windows NT/2000 Systems BP-387

Platform Sizing BP-390

Recommended Performance Tuning Procedures BP-393

Tuning Mappings for Better Performance BP-396

Tuning Sessions for Better Performance BP-409

Tuning SQL Overrides and Environment for Better Performance BP-417

Understanding and Setting UNIX Resources for PowerCenter Installations BP-431

Upgrading PowerCenter BP-435

Project Management BP-441

Assessing the Business Case BP-441

Defining and Prioritizing Requirements BP-444

Developing a Work Breakdown Structure (WBS) BP-447

Developing and Maintaining the Project Plan BP-449

Developing the Business Case BP-451

Managing the Project Lifecycle BP-453

Using Interviews to Determine Corporate Analytics Requirements BP-456

PWX Configuration & Tuning BP-463

PowerExchange Installation (for Mainframe) BP-463

Recovery BP-469

Running Sessions in Recovery Mode BP-469

Security BP-476

Configuring Security BP-476

SuperGlue BP-491

Custom XConnect Implementation BP-491

Customizing the SuperGlue Interface BP-496

Estimating SuperGlue Volume Requirements BP-503

SuperGlue Metadata Load Validation BP-506

SuperGlue Performance & Tuning BP-510

Using SuperGlue Console to Tune the XConnects BP-510


Database Sizing

Challenge

Database sizing involves estimating the types and sizes of the components of a data architecture. This is important for determining the optimal configuration for your database servers in order to support your operational workloads. Individuals involved in a sizing exercise may be data architects, database administrators, and/or business analysts.

Description

The first step in database sizing is to review system requirements to define such things as:

• expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?)

• expected source data volume • data granularity and periodicity • load frequency and method (full refresh? incremental updates?) • estimated growth rates over time and retained history

Determining Growth Projections

One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart you can use the number of sales transactions to be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact table records.

Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. You may want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Baseline Volumetric


Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache.

Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size.

The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database.

Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a fraction of the actual data and is used only to gather basic sizing statistics. You will then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating

When there is not enough information to calculate an estimate as described above, use educated guesses and “rules of thumb” to develop as reasonable an estimate as possible.

• If you don’t have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics.

• If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV:

o If it includes staging areas: add another SDV for any source subject area that you will stage multiplied by the number of loads you’ll retain in staging.

o If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping 1 year’s worth of monthly loads = 12 x SDV)

o Data warehouse architectures? based on the periodicity and granularity of the DW, this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time)

o If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be


aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent).

o Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart.

o Be sure to consider the growth projections over time and the history to be retained in all of your calculations.

And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.


Migration Procedures

Challenge

Develop a migration strategy that ensures clean migration between development, test, QA, and production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description

Ensuring that an application has a smooth migration process between development, quality assurance (QA), and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on several factors.

1. How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases.

2. How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project.

Informatica PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command.

This Best Practice document is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use as well as its disadvantages.

REPOSITORY ENVIRONMENTS

The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any


methods described in the Standalone section may also be used in a Distributed environment.

STANDALONE REPOSITORY ENVIRONMENT

In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata store. Separate folders are used to represent the development, QA, and production workspaces and segregate work. This type of architecture within a single repository ensures seamless migration from development to QA, and from QA to production.

The following example shows a typical architecture. In this example, the company has chosen to create separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

Proposed Migration Process – Single Repository

DEV to TEST – Object Level Migration

Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings to test, and then eventually to production.

After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. This can be done using one of two methods:

• The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e. Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer.


• The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third party versioning tool, if your organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories will be covered later in this document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer will prompt the user to choose the correct shortcut folder that you created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7, you can export multiple objects into a single XML file, and then also import them at the same time.

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to step you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks.

1. The wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the wizard prompts you to rename it


or replace it. If no such workflow exists, a default name will be used. Then click “Next” to continue the copy process.

2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click “Next.”

3. Next, the wizard prompts you to select the mapping associated with each

session task in the workflow. Select the mapping and continue by clicking “Next.”

4. If connections exist in the target repository, the wizard will prompt you to select the connection to use for the source and target. If no connections exist, the default settings will be used. When this step is completed, click Finish and save the work.

Initial Migration – New Folders Created


The move to production is very different for the initial move than for subsequent changes to mappings and workflows. Since the repository only contains folders for development and test, we need to create two new folders to house the production-ready objects. You will create these folders after testing of the objects in SHARED_MARKETING_TEST and MARKETING_TEST has been approved.

The following steps outline the creation of the production folders and, at the same time, address the initial test to production migration.

1. Open the PowerCenter Repository Manager client tool and log into the repository 2. To make a shared folder for the production environment, highlight the

SHARED_MARKETING_TEST folder, drag it, and drop it on the repository name.

The Copy Folder Wizard will appear and step you through the copying process

The first wizard screen asks if we want to use the typical folder copy options or the advanced options. In this example, you will be using the advanced options.


The second wizard screen prompts the user to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as “SHARED_MARKETING_PROD.”

The third wizard screen prompts the user to select a folder to override. Because this is the first time you are transporting the folder, you won’t need to select anything.


The final screen begins the actual copy process. Click Finish when it is complete.

Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that was just created.

At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.

Incremental Migration – Object Copy Example

Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the folder.


Any time an object is modified, it must be re-tested and migrated into production for the actual change to occur. These types of changes in production take place on a case-by-case or periodically scheduled basis. The following steps outline the process of moving these objects individually.

1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window.

2. Because this is a modification to an object that already exists in the destination folder, Designer will prompt you to choose whether to Rename or Replace the object (as shown below). Choose the option to replace the object.

3. Beginning with PowerCenter 7, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intended. See below for an example of the mapping compare window.


4. After the object has been successfully copied, save the folder so the changes can take place.

5. The newly copied mapping is now tied to any sessions the replaced mapping was tied to.

6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update itself with the changes.

Standalone Repository Example

In this example, we will look at moving development work to the QA phase and then from QA to production. In this example, we use multiple development folders for each developer, with the test and production folders divided into the data mart they represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and workflows to the new area.

Follow these steps to copy a mapping from Development to QA:

1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2 o Copy the tested objects from the SHARED_MARKETING_DEV folder to the

SHARED_MARKETING_TEST folder. o Drag all of the newly copied objects from the SHARED_MARKETING_TEST

folder to MARKETING_TEST. o Save your changes.

2. Copy the mapping from Development into Test.


o In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping from each development folder into the MARKETING_TEST folder.

o When copying each mapping in PowerCenter Designer will prompt to Replace the object, Rename the object, Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.

o Save your changes. 3. If a reusable session task is being used follow these steps. Otherwise, skip to

step four. o In the PowerCenter Workflow Manager, open the MARKETING_TEST folder,

and drag and drop each reusable session from the developers’ folders into the MARKETING_TEST folder. A Copy Session Wizard will step you through the copying process.

o Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment.

o Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct.

o Save your changes. 4. While the MARKETING_TEST folder is still open, copy each workflow from

Development to Test. o Drag each workflow from the development folders into the

MARKETING_TEST folder. The Copy Workflow Wizard will appear. Follow the same steps listed above to copy the workflow to the new folder.

o As mentioned above, the copy wizard now allows conflicts to be compared from within Workflow Manager to ensure that the correct migrations are being made.

o Save your changes. 5. Implement the appropriate security.

o In Development, the owner of the folders should be a user(s) in the development group.

o In Test, change the owner of the Test folder to a user(s) in the Test group. o In Production, change the owner of the folders to a user in the Production

group. o Revoke all rights to Public other than Read for the Production folders.

Disadvantages of a Single Repository Environment

The most significant disadvantage of a single repository environment is performance. Having a development, QA, and production environment within a single repository can cause degradation in production performance as the production environment shares CPU and memory resources with the development and test environments. Although these environments are stored in separate folders, they all reside within the same database table space and on the same server.

For example, if development or test loads are running simultaneously with production loads, the server machine may reach 100 percent utilization and production performance will suffer.


A single repository structure also can create more confusion as the same users and groups exist in all environments and the number of folders could exponentially increase.

DISTRIBUTED REPOSITORY ENVIRONMENT

A distributed repository environment maintains separate, independent repositories, hardware, and software for development, test, and production environments. Separating repository environments is preferable for handling development to production migrations. Because the environments are segregated from one another, work performed in development cannot impact QA or production.

With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture.

There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages. In the following pages, we discuss each of the migration options:

• Repository Copy • Folder Copy • Object Copy • Deployment Groups

Repository Copy

So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and through object XML import/export. This section of the document will cover migrations in a distributed repository environment through repository copies.


The main advantages of this approach are:

• The ability to copy all objects (mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another.

• The ability to automate this process using pmrep commands. This eliminates much of the manual processes that users typically perform.

• Everything can be moved without breaking or corrupting any of the objects.

This approach also involves a few disadvantages.

• The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved, ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings.

• This leads to the second disadvantage, the maintenance required to remove any unwanted or excess objects.

• Another disadvantage is the need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place.

• Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.

Now that we've discussed the advantages and disadvantages, we will look at three ways to accomplish the Repository Copy method:

• Copying the Repository • Repository Backup and Restore • PMREP

Copying the Repository

Copying the Test repository to Production through the GUI client tools is the easiest of all the migration methods. The task is very simple. First, ensure that all users are logged out of the destination repository, then open the PowerCenter Repository Administration Console client tool (as shown below).


1. If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must stop it. Right-click the Production Repository and choose “Stop.” You can delete the Production repository by selecting it and choosing “Delete” from the context menu. You will want to actually delete the repository, not just remove it from the server cache.

2. Now, create the Production repository connection by highlighting the Repositories folder in the Navigator view and choosing “New Repository.” Enter the connection information for the Production repository. Make sure to choose the “Do not create any content option.”


3. Right-click the Production Repository and choose “All Tasks” -> “Copy from.”

4. In the dialog window, choose the name of the Test repository from the drop down menu. Enter the username and password of the Test repository.


5. Click Ok and the copy process will begin. 6. When you've successfully copied the repository to the new location, exit from

the Repository Server Administration Console. 7. In the Repository Manager, double click on the newly copied repository and log

in with a valid username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For

example, rename the MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD.

9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process.

10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.

Repository Backup and Restore

Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the repository to a binary file that can be restored to any new location. This method is preferable to the repository copy process because if any type of error occurs, the file is backed up to the binary file on the repository server.

The following steps outline the process of backing up and restoring the repository for migration.

1. Launch the PowerCenter Repository Server Administration Console client, connect to the repository server, and highlight the Test repository.

2. Select Action -> All Tasks -> Backup from the menu. A screen will appear and prompt you to supply a name for the backup file as well as the Administrator username and password. The file will be saved to the Backup directory within the repository server’s home directory.


3. After you've selected the location and file name, click the OK button and the backup process will begin.

The backup process will create a .rep file containing all repository information. Stay logged into the Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (Production repository), or create the connection if it does not already exist. Follow these steps to complete the repository restore:

1. Right-click destination repository and click the “All Tasks” -> “Restore.”


2. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click the Restore button.

When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.

PMREP

Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server.

Refer to the Repository Manager Guide for a list of PMREP commands.

The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using the code example below as a model, you can write scripts to be run on a daily basis to perform functions such as connect, backup, restore, etc:

backupproduction.bat

REM This batch file uses pmrep to connect to and back up the repository Production on the server Central


@echo off

echo Connecting to Production repository...

“C:\Program Files\Informatica PowerCenter 7.1.1\RepositoryServer\bin\pmrep” connect -r INFAPROD -n Administrator -x Adminpwd –h infarepserver –o 7001

echo Backing up Production repository...

“C:\Program Files\Informatica PowerCenter 7.1.1\RepositoryServer\bin\pmrep” backup -o c:\backup\Production_backup.rep

Post-Repository Migration Cleanup

After you have used one of the repository migration procedures described above to migrate into Production, follow these steps to convert the repository to Production:

1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.

o Disable the workflows not being used in the Workflow Manager by opening the workflow properties, and then checking the Disabled checkbox under the General tab.

o Delete the tasks not being used in the Workflow Manager and the mappings in the Designer.

2. Modify the database connection strings to point to the production sources and targets.

o In the Workflow Manager, select Relational connections from the Connections menu.

o Edit each relational connection by changing the connect string to point to the production sources and targets.

o If using lookup transformations in the mappings and the connect string is anything other than $SOURCE or $TARGET, then the connect string will need to be modified appropriately.

3. Modify the pre- and post-session commands and SQL as necessary. o In the Workflow Manager, open the session task properties, and from the

Components tab make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as: o In Development, ensure that the owner of the folders is a user in the

development group. o In Test, change the owner of the test folders to a user in the test group. o In Production, change the owner of the folders to a user in the production


FOLDER COPY

Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed


environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied.

The following examples step through a sample folder copy process using three separate repositories (one each for Development, Test, and Production) and using two repositories (one for development and test, one for production).

The three advantages of using the folder copy method are:

• The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it.

• If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder.

• All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Please keep in mind that a locked repository means than NO jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments.

The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production.

1. If using shortcuts, follow these sub steps; otherwise skip to step 2: o Open the Repository Manager client tool. o Connect to both the Development and Test repositories. o Highlight the folder to copy and drag it to the Test repository. o The Copy Folder Wizard will appear and step you through the copy process. o When the folder copy process is complete, open the newly copied folder in

both the Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test

• If you skipped step 1, follow these sub-steps o Open the Repository Manager client tool. o Connect to both the Development and Test repositories.

• Highlight the folder to copy and drag it to the Test repository. • The Copy Folder Wizard will appear.


• Follow these steps to ensure that all shortcuts are reconnected. o Use the advanced options when copying the folder across. o Select Next to use the default name of the folder

• If the folder already exists in the destination repository, choose to replace the folder.

• The following screen will appear prompting you to select the folder where the new shortcuts are located.


• In an situation where the folder names do not match, a folder compare will take place. The Copy Folder wizard will then complete the folder copy process. Rename the folder to the appropriate name and implement the security.

3. When testing is complete, repeat the steps above to migrate to the Production repository.

When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.

Object Copy

Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment.

One advantage of Object Copy in a distributed environment is that it provides more granular control over objects.

Two distinct disadvantages of Object Copy in a distributed environment are:

• Much more work to deploy an entire group of objects • Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment:

1. If using shortcuts, follow these sub-steps, otherwise skip to step 2: o In each of the distributed repositories, create a common folder with the

exact same name and case.


o Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same name.

2. Copy the mapping from the Test environment into Production. o In the Designer, connect to both the Test and Production repositories and

open the appropriate folders in each. o Drag-and-drop the mapping from Test into Production. o During the mapping copy, PowerCenter 7 allows a comparison of this

mapping to an existing copy of the mapping already in Production. Also, in PowerCenter 7, the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository).

o If copying the workflow, follow the Copy Wizard. o If creating the workflow, add a session task that points to the mapping and

enter all the appropriate information. 4. Implement appropriate security.

o In Development, ensure the owner of the folders is a user in the development group.

o In Test, change the owner of the test folders to a user in the test group. o In Production, change the owner of the folders to a user in the production


Deployment Groups

For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, the user has the flexibility of migrating individual objects as you would in an object copy migration, but also has the convenience of a repository or folder level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, a user can set up a dynamic deployment group which allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually, therefore creating additional convenience. Lastly, since deployment groups are available on versioned repositories, they also have the capability to be rolled back, reverting to the previous versions of the objects, when necessary.

Creating a Deployment Group

Below are the steps to create a deployment group:

1. Launch the Repository Manager client tool and log in to the source repository. 2. Expand the repository, right-click on “Deployment Groups” and choose “New

Group.”


3. In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Choose “OK.”

Adding Objects to a Static Deployment Group

Below are the steps to add objects to a static deployment group:


1. In Designer, Workflow Manager, or Repository Manger right-click an object that you want to add to the deployment group and choose “Versioning” -> “View History.” The “View History” window will be displayed.

2. In the “View History” window, right-click the object and choose “Add to Deployment Group.”


3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and choose “OK.”

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Choose “OK.”


NOTE: The “All Dependencies” should be used for any new code that is migrating forward. However, it can cause issues when moving existing code forward. The reason is that “All Dependencies” flags Shortcuts also. During the deployment, Informatica will try to re-insert or replace the Shortcuts. This does not work, and it will cause your deployment to fail.

5. The object will be added to the deployment group at this time.

Although the deployment group allows the most flexibility at this time, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group

Dynamic Deployment groups are similar to static deployment groups in their function, but differ based on how objects are added to the deployment group. In a static deployment group, objects are manually added to the deployment group one by one. In a dynamic deployment group, the contents of the deployment group are defined by a repository query. Don’t worry about the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface.

Below are the steps to add objects to a dynamic deployment group:

1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the “Queries” button.


2. The “Query Browser” window is displayed. Choose “New” to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop down list of parameters allows a user to choose from 23 predefined metadata categories. In this case, the developers have assigned the “RELEASE_20050130” label to all objects that need to be migrated, so the query is defined as “Label Is Equal To ‘RELEASE_20050130’”. The creation and application of labels are discussed in a separate Velocity Best Practice.


4. Save the Query and exit the Query Editor. Choose “OK” on the Query Browser window, and Choose on the Deployment Group editor window.

Executing a Deployment Group Migration

A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep command line utility. In the client tool, a user simply drags the deployment group from the source repository and drops it on the destination repository. This prompts the Copy Deployment Group wizard which will walk a user through the step-by-step options for executing the deployment group.

Rolling back a Deployment

In order to roll back a deployment, one must locate the Deployment via the TARGET Repositories menu bar. Deployments -> History -> View History -> Rollback button.

Automated Deployments

For the optimal migration method, users can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which will execute a deployment group migration without human interaction. This is ideal as the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight having minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.

Recommendations


Informatica recommends using the following process when running in a three-tiered environment with Development, Test, and Production servers.

Non-Versioned Repositories

For migrating from Development into Test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest Development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories

For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the most level of flexibility in that you can promote any object from within a Development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method will result in automated migrations that can be executed without manual intervention.

THIRD PARTY VERSIONING

Some organizations have standardized on a third party version control software package. In these cases, PowerCenter’s XML import/export functionality offers integration with those software packages and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository.

The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7, the export/import functionality was


enhanced to allow the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export.

The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:

EXPORTING

1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported.

2. Select Repository -> Export Objects 3. The system will prompt you to select a directory location on the local

workstation. Choose the directory to save the file. Using the default name for the XML file is generally recommended.

4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7.x\Client directory. (This may vary depending on where you installed the client tools.)

5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML file.

6. Together, these files are now ready to be added to the version control software

IMPORTING

1. Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported.

2. Select Repository -> Import Objects. 3. The system will prompt you to select a directory location and file to import into

the repository. 4. The following screen will appear with the steps for importing the object.

• Select the mapping and add it to the Objects to Import list.


• Click the Next, and then click the Import button. Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder.

It is important to note that the pmrep command line utility has been greatly enhanced in PowerCenter 7 such that the activities associated with XML import/export can be automated through pmrep.


Data Cleansing

Challenge

Accuracy is one of the biggest obstacles to the success of many data warehousing projects. If users discover data inconsistencies, they may lose faith in the entire warehouse's data. However, it is not unusual to discover that as many as half the records in a database contain some type of information that is incomplete, inconsistent, or incorrect. The challenge is, therefore, to cleanse data online, at the point of entry into the data warehouse or operational data store (ODS), to ensure that the warehouse/ODS provides consistent and accurate data for business decision-making.

A significant portion of time in the development process should be set-aside for setting up the data quality assurance process and implementing whatever data cleansing is needed. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are added to the integrated environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable.

Description

Informatica has several partners in the data-cleansing arena. Rapid implementation, tight integration, and a fast learning curve are the key differentiators for picking the right data-cleansing tool to for your project.

Informatica’s data-quality partners provide quick start templates for standardizing, address correcting, and matching records for best-build logic, which can be tuned to your business rules. Matching and consolidating is the most crucial step to guarantee a single view of subject area (e.g., Customers) so everyone in the enterprise can make better business decisions.

Concepts

Following is a list of steps to organize and implement a good data quality strategy. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness.

Parsing – the process of extracting individual elements within the records, files, or data entry forms to check the structure and content of each field. For example, name, title, company name, phone number, and SSN.


Correction –the process of correcting data using sophisticated algorithms and secondary data sources, to check and validate information. Example validating addresses with postal directories.

Standardize – arranging information in a consistent manner and preferred format. Examples include removal of dashes from phone numbers or SSN.

Enhancement – adding useful, but optional, information to supplement existing data or complete data. Examples may include sales volume, number of employees for a given business.

Matching – once a high-quality record exists, then eliminate any redundancies. Use match standards and specific business rules to identify records that may refer to the same customer.

Consolidate – using the data found during matching to combine all of the similar data into a single consolidated view. Examples are building best record, master record, or house holding.

Partners

Following is list of data quality partners and their respective tools:

DataMentors - Provides tools that are run before the data extraction and load process to clean source data. Available tools are:

• DMDataFuseTM - a data cleansing and house-holding system with the power to accurately standardize and match data.

• DMValiDataTM - an effective data analysis system that profiles and identifies inconsistencies between data and metadata.

• DMUtils - a powerful non-compiled scripting language that operates on flat ASCII or delimited files. It is primarily used as a query and reporting tool. Additionally, it also provides a way to reformat and summarize files.

FirstLogic - FirstLogic offers direct interfaces to PowerCenter during the extract and load process, as well as providing pre-data extraction data cleansing tools like DataRight, ACE (address correction and enhancement), and Match and Consolidate (formally Merge/Purge). The data cleansing interfaces as transformation components, using the PowerCenter External Procedures or Advanced External Procedure calls. Thus, these transformations can be dragged and dropped seamlessly into a PowerCenter mapping for parsing, standardization, cleansing, enhancement, and matching of the names, business, and address information during the PowerCenter ETL process of building a data mart or data warehouse.

Paladyne - The flagship product, Datagration is an open, flexible data quality system that can repair any type of data (in addition to its name and address) by incorporating custom business rules and logic. Datagration's Data Discovery Message Gateway feature assesses data cleansing requirements using automated data discovery tools that identify data patterns. Data Discovery enables Datagration to search through a field of free form data and re-arrange the tokens (i.e., words, data elements) into a


logical order. Datagration supports relational database systems and flat files as data sources and any application that runs in batch mode.

Trillium - Trillium's eQuality customer information components (a web enabled tool) are integrated with the PowerCenter Transformation Exchange modules and reside on the same server as the PowerCenter transformation engine. As a result, Informatica users can invoke Trillium's four data quality components through an easy-to-use graphical desktop object. The four components are:

• Converter: data analysis and investigation module for discovering word patterns and phrases within free-form text.

• Parser: processing engine for data cleansing, elementizing, and standardizing customer data.

• Geocoder: an internationally-certified postal and census module for address verification and standardization.

• Matcher: a module designed for relationship matching and record linking.

Innovative Systems - The i/Lytics™ solution operates within PowerMart and PowerCenter version 6.x to provide smooth, seamless project flow. Using its unique knowledgebase of more than three million words and word patterns, i/Lytics cleanses, standardizes, links, and house holds customer data to create a complete and accurate customer profile each time a record is added or updated.

iORMYX International Inc. - iORMYX's df Informatica Adapter™ allows you to use Dataflux data quality capabilities directly within PowerCenter and PowerMart using advanced External Procedures. Within PowerCenter you can easily drag Dataflux transformations into the workflows. Additionally by utilizing the data profiling capabilities of Dataflux, you can design ETL workflows that are successful from the start and build targeted, data-specific business rules that enhance data quality from within PowerCenter. The integrated solution significantly improves accuracy and effectiveness of business intelligence and enterprise systems by providing standardized and accurate data.

Integration Examples

The following sections describe how to integrate two of the tools with PowerCenter.

FirstLogic - ACE

The following graphic illustrates a high level flow diagram of the data cleansing process.


Use the Informatica Advanced External Transformation process to interface with the FirstLogic module by creating a "Matching Link" transformation. That process uses the Informatica Transformation Developer to create a new Advanced External Transformation, which incorporates the properties of the FirstLogic Matching Link files. Once a Matching Link transformation has been created in the Transformation Developer, users can incorporate that transformation into any of their project mappings; it's reusable from the repository.

When an PowerCenter session starts, the transformation is initialized. The initialization sets up the address processing options, allocates memory, and opens the files for processing. This operation is only performed once. As each record is passed into the transformation, it is parsed and standardized. Any output components are created and passed to the next transformation. When the session ends, the transformation is terminated. The memory is once again available and the directory files are closed.

The available functions / processes are as follows.

ACE Processing

There are four ACE transformations to choose from. Three base transformations parse, standardize, and append address components using FirstLogic's ACE Library. The transformation choice depends on the input record layout. The fourth transformation can provide optional components. This transformation must be attached to one of the three base transformations.

The four transformations are:

1. ACE_discrete - where the input address data is presented in discrete fields. 2. ACE_multiline - where the input address data is presented in multiple lines (1-

6). 3. ACE_mixed - where the input data is presented with discrete city/state/zip and

multiple address lines (1-6). 4. Optional transform - which is attached to one of the three base transforms

and outputs the additional components of ACE for enhancement.

All records input into the ACE transformation are returned as output. ACE returns Error/Status Code information during the processing of each address. This allows the end user to invoke additional rules before the final load has completed.

TrueName Process

TrueName mirrors the ACE base transformations with discrete, multi-line, and mixed transformations. A fourth and optional transformation available in this process can be attached to one of the three base transformations to provide genderization and match standards enhancements. TrueName generates error and status codes. Similar to ACE, all records entered as input into the TrueName transformation can be used as output.

Matching Process

The matching process works through one transformation within the Informatica architecture. The input data is read into the PowerCenter data flow similar to a batch


file. All records are read, the break groups created and, in the last step, matches are identified. Users set up their own matching transformation through the PowerCenter Designer by creating an advanced external procedure transformation. Users can select which records are outputs from the matching transformations by editing the initialization properties of the transformation.

All matching routines are predefined and, if necessary, the configuration files can be accessed for additional tuning. The five predefined matching scenarios include: individual, family, household (the only difference between household and family, is the household doesn't match on last name), firm individual, and firm. Keep in mind that the matching does not do any data parsing; this must be accomplished prior to using this transformation. As with ACE and TrueName, error and status codes are reported.

Trillium

Integration to Trillium's data cleansing software is achieved through the Informatica Trillium Advanced External Procedures (AEP) interface.

The AEP modules incorporate the following Trillium functional components.

• Trillium Converter - The Trillium Converter facilitates data conversion such as EBCDIC to ASCII, integer to character, character length modification, literal constant, and increasing values. It can also be used to create unique record identifiers, omit unwanted punctuation, or translate strings based on actual data or mask values. A user-customizable parameter file drives the conversion process. The Trillium Converter is a separate transformation that can be used standalone or in conjunction with the Trillium Parser module.

• Trillium Parser - The Trillium Parser identifies and/or verifies the components of free-floating or fixed field name and address data. The primary function of the Parser is to partition the input address records into manageable components in preparation for postal and census geocoding. The parsing process is highly table driven to allow for customization of name and address identification to specific requirements.

• Trillium Postal Geocoder - The Trillium Postal Geocoder matches an address database to the ZIP+4 database of the U.S. Postal Service (USPS).

• Trillium Census Geocoder - The Trillium Census Geocoder matches the address database to U.S. Census Bureau information.

Each record that passes through the Trillium Parser external module is first parsed then, optionally postal geocoded and census geocoded. The level of geocoding performed is determined by a user-definable initialization property.

• Trillium Window Matcher - The Trillium Window Matcher allows the PowerCenter Server to invoke Trillium's de-duplication and house holding functionality. The Window Matcher is a flexible tool designed to compare records to determine the level of similarity between them. The result of the comparisons is considered a passed, a suspect, or a failed match depending upon the likeness of data elements in each record, as well as a scoring of their exceptions.

Input to the Trillium Window Matcher transformation is typically the sorted output of the Trillium Parser transformation. Another method to obtain sorted information is to


use the sorter transformation, which became available in the PowerCenter 6.0 release. Other options for sorting include:

• Using the Informatica Aggregator transformation as a sort engine. • Separate the mappings whenever a sort is required. The sort can be run as a

pre/post session command between mappings. Pre/post session commands are configured in the Workflow Manager.

• Build a custom AEP Transformation to include in the mapping.


Data Connectivity using PowerCenter Connect for BW Integration Server

Challenge

Understanding how to use PowerCenter Connect for SAP BW to load data into the SAP BW (Business Information Warehouse).

Description

The PowerCenter Connect for SAP BW supports the SAP Business Information Warehouse as both a source and target.

Extracting Data from BW

PowerCenter Connect for SAP BW lets you extract data from SAP BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP BW integrates with the Open Hub Service (OHS), SAP’s framework for extracting data from BW. OHS uses data from multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke programs, which extract data from BW and write the output to SAP transparent tables.

Loading Data into BW

PowerCenter Connect for SAP BW lets you import BW target definitions into the Designer and use the target in a mapping to load data into BW. PowerCenter Connect for SAP BW uses Business Application Program Interface (BAPI), to exchange metadata and load data into BW.

PowerCenter can use SAP’s business content framework to provide a high-volume data warehousing solution or SAP’s Business Application Program Interface (BAPI), SAP’s strategic technology for linking components into the Business Framework, to exchange metadata with BW.

PowerCenter extracts and transforms data from multiple sources and uses SAP’s high-speed bulk BAPIs to load the data into BW, where it is integrated with industry-specific models for analysis through the SAP Business Explorer tool.

Using PowerCenter with PowerCenter Connect to Populate BW


The following paragraphs summarize some of the key differences in using PowerCenter with the PowerCenter Connect to populate a SAP BW rather than working with standard RDBMS sources and targets.

• BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAP’s Remote Function Call (RFC) protocol.

• The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three of the PowerCenter product suite use this API. The PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures. The PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions. The PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW.

• Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW.

• BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW.

• Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW.

• BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.

Steps for Extracting Data from BW

The process of extracting data from SAP BW is quite similar to extracting data from SAP. Similar transports are used on the SAP side, and data type support is the same as that supported for SAP PowerCenter Connect.

The steps required for extracting data are:

1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from the BW database and write it to either a database table or a file output target.

2. Import the ABAP program. Import the ABAP program Informatica provides that calls the workflow created in the Workflow Manager.

3. Create a mapping. Create a mapping in the Designer that uses the database table or file output target as a source.

4. Create a workflow to extract data from BW. Create a workflow and session task to automate data extraction from BW.

5. Create a Process Chain.A BW Process Chain links programs together to run in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together.

6. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.

Steps To Load Data into BW

1. Install and Configure PowerCenter Components.


The installation of the PowerCenter Connect for BW includes a client and a server component. The Connect server must be installed in the same directory as the PowerCenter Server. Informatica recommends installing Connect client tools in the same directory as the PowerCenter Client. For more details on installation and configuration refer to the PowerCenter and the PowerCenter Connect installation guides.

2. Build the BW Components.

To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench:

• Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.

• The InfoSource represents a provider structure. Create the InfoSource in the BW Administrator Workbench and import the definition into the PowerCenter Warehouse Designer.

• Assign the InfoSource to the PowerCenter source system. After you create an InfoSource, assign it to the PowerCenter source system.

• Activate the InfoSource. When you activate the InfoSource, you activate the InfoObjects and the transfer rules.

3. Configure the sparfc.ini file.

Required for PowerCenter and Connect to connect to BW.

PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:

• Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server.

• Type R. Used by the PowerCenter Connect for BW. Specifies the external program, which is registered at the SAP gateway.

Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

4. Start the Connect for BW server

Start Connect for BW server only after you start PowerCenter Server and before you create InfoPackage in BW.

5. Build mappings

Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target.

The following restrictions apply to building mappings with BW InfoSource target:


• You cannot use BW as a lookup table. • You can use only one transfer structure for each mapping. • You cannot execute stored procedure in a BW target. • You cannot partition pipelines with a BW target. • You cannot copy fields that are prefaced with /BIC/ from the InfoSource definition

into other transformations. • You cannot build an update strategy in a mapping. BW supports only inserts; it

does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.

6. Load data

To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured.

Use the following steps to load data into BW:

• Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition.

• Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource.

When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow.

The PowerCenter Server validates the workflow name in the repository and the workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.

Supported Datatypes

The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag.

BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR,CUKY,CURR,DATS,NUMC,TIMS,UNIT

All other datatypes result in the following error in BW:

Invalid data type (data type name) for source system of type BAPI.


Date/Time Datatypes

The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.

Binary Datatypes

BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.

Numeric Datatypes

PowerCenter does not support the INT1 datatype.

Performance Enhancement for Loading into SAP BW

If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15-20MB to enhance performance. You can put 5,000-10,000 rows per block, so you can calculate the buffer block size needed with the following formula:

Row size x Rows per block = Default Buffer Block size

For example, if your target row size is 2 KB: 2 KB x 10,000 = 20 MB.


Data Connectivity using PowerCenter Connect for MQSeries

Challenge

Understanding how to use MQSeries Applications in PowerCenter mappings.

Description

MQSeries Applications communicate by sending each other messages rather than by calling each other directly. Applications can also request data using a "request message" on a message queue. Because no open connections are needed between systems, they can run independently of one another. MQSeries enforces No Structure on the content or format of the message; this is defined by the application.

The following features and functions are not available to PowerCenter when using MQSeries:

• Lookup transformations can be used in an MQSeries mapping, but lookups on MQSeries sources are not allowed.

• Certain considerations are necessary when using AEPs, aggregators, custom transformations, joiners, sorters, rank, or transaction control transformations because they can only be performed on one queue, as opposed to a full data set.

MQSeries Architecture

IBM MQSeries is a messaging and queuing application that permits programs to communicate with one another across heterogeneous platforms and network protocols using a consistent application-programming interface.

PowerCenter Connect for MQSeries architecture has three parts:

• Queue Manager, which provides administrative functions for queues and messages.

• Message Queue, which is a destination to which messages can be sent. • MQSeries Message, which incorporates a header and a data component.

Queue Manager


• Informatica connects to Queue Manager to send and receive messages. • Every message queue belongs to a Queue Manager. • Queue Manager administers queues, creates queues, and controls queue

operation.

MQSeries Message

• MQSeries header contains data about the queue. Message header data includes the message identification number, message format, and other message descriptor data. In PowerCenterRT, MQSeries sources and dynamic MQSeries targets automatically incorporate MQSeries message header fields.

• MQSeries data component contains the application data or the "message body." The content and format of the message data is defined by the application that uses the message queue.

Extraction from a Queue

In order for PowerCenter to extract from a queue, the message must be in a form of COBOL, XML, flat file or binary. When extracting from a queue, you need to use either of two source qualifiers: MQ Source Qualifier (MQ SQ) or Associated Source Qualifier (SQ).

You must use MQ SQ to read data from an MQ source, but you cannot use MQ SQ to join to MQ sources. MQ SQ is predefined and comes with 29 message-headed fields. MSGID is the primary key. After extracting from a queue, you can use a Midstream XML Parser transformation to parse XML in a pipeline.

MQ SQ can perform the following tasks:

• Select Associated Source Qualifier - this is necessary if the file is not binary. • Set Tracing Level - verbose, normal, etc. • Set Message Data Size - default 64,000; used for binary. • Filter Data - set filter conditions to filter messages using message header ports,

control end of file, control incremental extraction, and control syncpoint queue clean up.

• Use mapping parameters and variables

In addition, you can enable message recovery for sessions that fail when reading messages from an MQSeries source, as well as use the Destructive Read attribute to both remove messages from the source queue at synchronization points and evaluate filter conditions when enabling message recovery.

With Associated SQ, either an Associated SQ (XML, flat file) or normalizer (COBOL) is required if the data is not in binary. If you use an Associated SQ, be sure to design the mapping as if it were not using MQ Series, and then add the MQ Source and Source Qualifier after testing the mapping logic, joining them to the associated source qualifier. When the code is working correctly, test by actually pulling data from the queue.

Loading to a Queue


Two types of MQ Targets can be used in a mapping: Static MQ Targets and Dynamic MQ Targets. However, you can use only one type of MQ Target in a single mapping. You can also use a Midstream XML Generator transformation to create XML inside a pipeline.

• Static MQ Targets - Used for loading message data (instead of header data) to the target. A static target does not load data to the message header fields. Use the target definition specific to the format of the message data (i.e., flat file, XML, COBOL). Design the mapping as if it were not using MQ Series, then configure the target connection to point to a MQ message queue in the session when using MQSeries.

• Dynamic - Used for binary targets only, and when loading data to a message header. Note that certain message headers in an MQSeries message require a predefined set of values assigned by IBM.

Creating and Configuring MQSeries Sessions

After you create mappings in the Designer, you can create and configure sessions in the Workflow Manager.

Configuring MQSeries Sources

The MQSeries source definition represents the metadata for the MQSeries source in the repository. Unlike other source definitions, you do not create an MQSeries source definition by importing the metadata from the MQSeries source. Since all MQSeries messages contain the same message header and message data fields, the Designer provides an MQSeries source definition with predefined column names.

MQSeries Mappings

MQSeries mappings cannot be partitioned if an associated source qualifier is used.

For MQ Series sources, set the Source Type to the following:

• Heterogeneous when there is an associated source definition in the mapping. This indicates that the source data is coming from an MQ source, and the message data is in flat file, COBOL or XML format.

• Message Queue when there is no associated source definition in the mapping.

Note that there are two pages on the Source Options dialog: XML and MQSeries. You can alternate between the two pages to set configurations for each.

Configuring MQSeries Targets

For Static MQSeries Targets, select File Target type from the list. When the target is an XML file or XML message data for a target message queue, the target type is automatically set to XML.

1. If you load data to a dynamic MQ target, the target type is automatically set to Message Queue.

2. On the MQSeries page, select the MQ connection to use for the source message queue, and click OK.


3. Be sure to select the MQ checkbox in Target Options for the Associated file type. Then click Edit Object Properties and type:

o the connection name of the target message queue. o the format of the message data in the target queue (ex. MQSTR). o the number of rows per message (only applies to flat file MQ targets).

TIP

Sessions can be run in real-time by using the ForcedEOQ(n) function ( or similar functions like Idle(n), FlushLatency(n)) in a filter condition and configuring the workflow to run continuously.

When the FOrcedEOQ(n) function is used, the PowerCenter Server stops reading messages from the source at the end of the ForcedEOQ(n) period; because it is set to run continuously, the session will automatically be restarted. If the session needs to be run without stopping, then use the following filter condition:

Idle(100000) && FlushLatency(3)

Appendix Information

PowerCenter uses the following datatypes in MQSeries mappings:

• IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a mapping.

• Native datatypes. Flat file, XML, or COBOL datatypes associated with an MQSeries message data. Native datatypes appear in flat file, XML and COBOL source definitions. Native datatypes also appear in flat file and XML target definitions in the mapping.

• Transformation datatypes. Transformation datatypes are generic datatypes that PowerCenter uses during the transformation process. They appear in all the transformations in the mapping.

IBM MQSeries Datatypes

MQSeries Datatypes Transformation Datatypes

MQBYTE BINARY MQCHAR STRING MQLONG INTEGER MQHEX

Values for Message Header Fields in MQSeries Target Messages

MQSeriesMessage Header

Description


MQSeriesMessage Header

Description

StrucId Structure identifier Version Structure version number Report Options for report messages MsgType Message type Expiry Message lifetime Feedback Feedback or reason code Encoding Data encoding CodedCharSetId Coded character set identifier Format Format name Priority Message priority Persistence Message persistence MsgId Message identifier CorrelId Correlation identifier BackoutCount Backout counter ReplytoZ Name of reply queue ReplytoQMgr Name of reply queue manager UserIdentifier Defined by the environment. If the

MQSeries server cannot determine this value, the value for the field is null.

AccountingToken Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is MQACT_NONE.

ApplIdentityData Application data relating to identity. The value for ApplIdentityData is null.

PutApplType Type of application that put the message on queue. Defined by the environment.

PutApplName Name of application that put the message on queue. Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null.

PutDate Date when the message arrives in the queue.

PutTime Time when the message arrives in queue.

ApplOrigData Application data relating to origin. Value for ApplOriginData is null.

GroupId Group identifier MsgSeqNumber Sequence number of logical messages

within group. Offset Offset of data in physical message

from start of logical message. MsgFlags Message flags OrigialLength Length of original message


Data Connectivity using PowerCenter Connect for SAP

Challenge

Understanding how to install PowerCenter Connect for SAP R/3, extract data from SAP R/3, build mappings, run sessions to load SAP R/3 data and load data to SAP R/3.

Description

SAP R/3 is a software system that integrates multiple business applications, such as financial accounting, materials management, sales and distribution, and human resources. The R/3 system is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP.

PowerCenter Connect for SAP R/3 provides the ability to integrate SAP R/3 data into data warehouses, analytic applications, and other applications. All of this is accomplished without writing complex ABAP code. PowerCenter Connect for SAP R/3 generates ABAP programs on the SAP R/3 server. PowerCenter Connect for SAP R/3 extracts data from transparent tables, pool tables, cluster tables, hierarchies (Uniform & Non Uniform), SAP IDocs and ABAP function modules.

When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect for SAP R/3 can also extract data from R/3 using outbound IDocs (Intermediate Documents) in real time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates the controlled exchange of business data messages while ensuring data consistency across loosely coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database. PowerCenter Connect for SAP R/3 can change data in R/3, as well as load new data into R/3 using direct RFC/BAPI function calls. It can also load data into SAP R/3 using inbound IDocs.

The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.

Communication Interfaces


TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include:

Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to connect to the R/3 system when running stream mode sessions.

Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs, and running file mode sessions.

Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to another system. There are two situations when the transport system is needed:

• PowerCenter Connect for SAP R/3 installation. • Transport ABAP programs from development to production.

Note: if the ABAP programs are installed in the $TMP class, they cannot be transported from development to production.

Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.

Integration Feature Authorization Object

Activity

Import Definitions, Install Programs

S_DEVELOP All activities. Also need to set Development Object ID to PROG

Extract Data S_TABU_DIS

READ

Run File Mode Sessions S_DATASET WRITE

Submit Background Job S_PROGRAM BTCSUBMIT, SUBMIT Release Background Job S_BTCH_JOB DELE, LIST, PLAN, SHOW

Also need to set Job Operation to RELE

Run Stream Mode Sessions S_CPIC All activities Authorize RFC privileges S_RFC All activities


You also need access to the SAP GUI, as described in following SAP GUI Parameters table:

Parameter Feature references to this variable

Comments

User ID $SAP_USERID Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions:

- SE12

- SE15

- SE16

- SPRO Password $SAP_PASSWORD Identify the password for the

above user System Number $SAP_SYSTEM_NUMBER Identify the SAP system number Client Number $SAP_CLIENT_NUMBER Identify the SAP client number Server $SAP_SERVER Identify the server on which this

instance of SAP is running

Key Capabilities of PowerCenter Connect for SAP R/3

Some key capabilities of PowerCenter Connect for SAP R/3 include:

• Extract data from R/3 systems using ABAP, SAP's proprietary 4GL. • Extract data from R/3 using outbound IDocs or write data to R/3 using

inbound IDocs through integration with R/3 using ALE. You can extract data from R/3 using outbound IDocs in real time.

• Extract data from R/3 and load new data into R/3 using direct RFC/BAPI function calls.

• Migrate data from any source into R/3. You can migrate data from legacy applications, other ERP systems, or any number of other sources into SAP R/3.

• Extract data from R/3 and write it to a target data warehouse. PowerCenter Connect for SAP R/3 can interface directly with SAP to extract internal data from SAP R/3 and write it to a target data warehouse. You can then use the data warehouse to meet mission critical analysis and reporting needs.

• Support for calling BAPI as well as RFC functions dynamically from PowerCenter for data integration. PowerCenter Connect for SAP R/3 can make BAPI as well as RFC function calls dynamically from mappings to extract data from an R/3 source, transform data in the R/3 system, or load data into an R/3 system.

• Support for data integration using ALE. PowerCenter Connect for SAP R/3 can capture changes to the master and transactional data in SAP R/3 using ALE. PowerCenter Connect for SAP R/3 can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT.

• Analytic Business Components for SAP R/3 (ABC) ABC is a set of business content that enables rapid and easy development of the data warehouse based on R/3


data. ABC business content includes mappings, mapplets, source objects, targets, and transformations.

• Metadata Exchange PowerCenter Connect for SAP R/3 Metadata Exchange extracts metadata from leading data modeling tools and imports it into PowerCenter repositories through MX SDK.

• Import SAP function in the Source Analyzer. • Import IDocs. PowerCenter Connect for SAP R/3 can create a transformation to

process outbound IDocs and generate inbound IDocs. PowerCenter Connect for SAP R/3 can edit the transformation to modify the IDoc segments you want to include. PowerCenter Connect for SAP R/3 can reorder and validate inbound IDocs before writing them to the SAP R/3 system. PowerCenter Connect for SAP R/3 can set partition points in a pipeline for outbound and inbound IDoc sessions and sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files.

• Insert ABAP Code Block to add more functionality to the ABAP program flow. • Use of outer join when two or more sources are joined in the ERP source qualifier. • Use of static filters to reduce return rows. (e.g. MARA = MARA-MATNR = 189) • Customization of the ABAP program flow with joins, filters, SAP functions, and

code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order.

• Creation of ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP program

• Removal of ABAP program information from SAP R/3 and the repository when a folder is deleted.

• Enhanced Platform support. PowerCenter Connect for SAP R/3 can be run on 64-bit AIX and HP-UX (Itanium). You can install PowerCenter Connect for SAP R/3 for the PowerCenter Server and Repository Server on SuSe Linux. PowerCenter Connect for SAP R/3 can be installed for the PowerCenter Server and Repository Server on Red Hat Linux.

• PowerCenter Connect for SAP R/3 can be connected with SAP's business content framework to provide a high-volume data warehousing solution.

Installation and Configuration Steps

PowerCenter Connect for SAP R/3 setup programs install components for PowerCenter Server, Client, and repository server. These programs install drivers, connection files, and a repository plug-in XML file that enables integration between PowerCenter and SAP R/3. Setup programs can also install PowerCenter Connect for SAP R/3 Analytic Business Components, and PowerCenter Connect for SAP R/3 Metadata Exchange.

The Power Center Connect for SAP R/3 repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be registered in the PowerCenter repository.

For SAP R/3

Informatica provides a group of customized objects required for R/3 integration. These objects include tables, programs, structures, and functions that PowerCenter Connect for SAP exports to data files The R/3 system administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The transport process creates a development class called ZERP. The SAPTRANS directory

contains “data” and “co” files. The “data” files are the actual transport objects. The “co” files are control files containing information about the transport request.

The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3 for integration involves the following tasks:

• Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system.

• Run the transport program that generate unique Ids. • Establish profiles in the R/3 system for PowerCenter users. • Create a development class for the ABAP programs that PowerCenter installs on

the SAP R/3 system.

For PowerCenter

The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks:

• Run installation programs on PowerCenter Server and Client machines. • Configure the connection files:

o The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system. Following are the required parameters for sideinfo :

DEST logical name of the R/3 system

TYPE set to A to indicate connection to specific R/3 system.

ASHOST host name of the SAP R/3 application server.

SYSNR system number of the SAP R/3 application server.

o The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameterts for sideinfo are:

DEST logical name of the R/3 system

LU host name of the SAP application server machine

TP set to sapdp<system number>

GWHOST host name of the SAP gateway machine.

GWSERV set to sapgw<system number>

PROTOCOL set to I for TCP/IP connection.

Following is the summary of required steps:

1. Install PowerCenter Connect for SAP R/3 on PowerCenter. 2. Configure the sideinfo file. 3. Configure the saprfc.ini 4. Set the RFC_INI environment variable. 5. Configure an application connection for SAP R/3 sources in the Workflow

Manager. 6. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs

generated by the SAP R/3 system. 7. Configure the FTP connection to access staging files through FTP. 8. Install the repository plug-in in the PowerCenter repository.

Configuring the Services File

Windows

If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries are made similar to the following:

sapdp<system number> <port number of dispatcher service>/tcp

sapgw<system number> <port number of gateway service>/tcp

SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system to use the ABAP workbench and to view table contents.

Unix

Services file is located in /etc

• sapdp<system number> <port# of dispatcher service>/TCP • sapgw<system number> <port# of gateway service>/TCP

The system number and port numbers are provided by the BASIS administrator.

Configure Connections to Run Sessions

Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server.

• Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but it does use more CPU cycles on the R/3 system.

• File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server.

If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly unlikely.


If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:

• Provide the login and password for the UNIX account used to run the SAP R/3 system.

• Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system.

• Create a directory on the machine running SAP R/3, and run “chmod g+s” on that directory. Provide the login and password for the account used to create this directory.

Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.

Extraction Process

R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a four-step process:

Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions.

Note: If you plan to join two or more than two tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table.) There is a significant difference in performance if the joins are properly defined.

Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program.

Generate and install ABAP program. You can install two types of ABAP programs for each mapping:

• File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount.

• Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication.

You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).

Create session and run workflow


• Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received.

• File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.

Data Integration Using RFC/BAPI Functions

PowerCenter Connect for SAP R/3 can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesn’t have to generate and install the ABAP program for data extraction.

Data Integration Using ALE

PowerCenter Connect for SAP R/3 can integrate PowerCenter with SAP R/3 using ALE. With PowerCenter Connect for SAP R/3, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it doesn’t have to generate and install the ABAP program for data extraction.

Analytical Business Components

Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3 data. It works in conjunction with PowerCenter and PowerCenter Connect for SAP R/3 to extract master data, perform lookups, provide documents, and other fact and dimension data from the following R/3 modules:

• Financial Accounting • Controlling • Materials Management • Personnel Administration and Payroll Accounting • Personnel Planning and Development • Sales and Distribution

Refer to ABC Guide for complete installation and configuration information.


Data Profiling

Challenge

Data profiling is an option in PowerCenter version 7.0 and above that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions and workflows. This Best Practice is intended to provide some introduction on usage for new users.

Description

Creating a Custom or Auto Profile

The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data.

An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time.

A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard

To customize the profile wizard for your preferences:

• Open the Profile Manager and choose Tools > Options. • If you are profiling data using a database user that is not the owner of the tables

to be sourced, check the “Use source owner name during profile mapping generation” option.

• If you are in the analysis phase of your project, choose “Always run profile interactively” since most of your data-profiling tasks will be interactive. (In later


phases of the project, uncheck this option since more permanent data profiles are useful in these phases.)

Running and Monitoring Profiles

Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure Session” on the "Function-Level Operations” tab of the wizard.

• Use Interactive to create quick, single-use data profiles. The sessions will be created with default configuration parameters.

• For data-profiling tasks that will be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

Generating And Viewing Profile Reports in PowerCenter/PowerAnalyzer

Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.

For greater flexibility, you can also use PowerAnalyzer to view reports. Each PowerCenter client includes a PowerAnalyzer schema and reports xml file. The xml files can be found in the \Extensions\DataProfile\IPAReports subdirectory of the client installation.

You can create additional metrics, attributes, and reports in PowerAnalyzer to meet specific business requirements. You can also schedule PowerAnalyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques

Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique Description Usage No sampling Uses all source data Relatively small data sources Automatic random sampling

PowerCenter determines the appropriate percentage to sample, then samples random rows.

Larger data sources where you want a statistically significant data analysis

Manual random sampling

PowerCenter samples random rows of the source data based on a user-specified percentage.

Samples more or fewer rows than the automatic option chooses.

Sample first N rows Samples the number of user-

selected rows Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration


Updating Data Profiling Repository Statistics

The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the following query below as appropriate for your database type. Then capture the script that is generated and run it.

ORACLE

select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%';

select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2

select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP%'

TERADATA

select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name'

where database_name is the name of the repository database.

Purging Old Data Profiles

Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.


Data Quality Mapping Rules

Challenge

Use PowerCenter to create data quality mapping rules to enhance the usability of the data within your system.

Description

This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets.

Basic Methodology

The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users. The data quality problems often arise from a breakdown in overall process rather than a specific issue that can be resolved by a single software package.

Some of the principles applied to data quality improvements are borrowed from manufacturing where they were initially designed to reduce the costs of manufacturing processes. A number of methodologies evolved from these principles, all centered around the same general process: Define, Discover, Analyze, Improve, and Combine. Reporting is a crucial part of each process step, helping to guide the users through the process. Together, these steps offer businesses an iterative approach to improving data quality.

• Define – This is the first step of any data quality exercise, and also the first step to data profiling. Users must first define the goals of the exercise. Some questions that should arise may include: 1) what are the troublesome data types and in what domains do they reside? 2) what data elements are of concern? 3) where do those data elements exist? 4) how are correctness and consistency measured? and 5) are metadata definitions complete and consistent? This step is often supplemented by a metadata solution that allows knowledgeable users to see specific data elements across the enterprise. It also addresses the question of where the data should be fixed, and how to ensure that the data is fixed at the correct place. This step also helps to define the rules that users subsequently employ to create data profiles.


• Discover – Since data profiling is a collection of statistics about existing data, the next step in the process is to use the information gathered in the first step to evaluate the actual data. This process should quantify how correct the data is with regards to the predefined rules. These rules can be stored so that the process can be performed iteratively and provide feedback on whether the data quality is improving. Refer to the Best Practice on Data Profiling for a complete description of how to use PowerCenter’s built-in data profiling capabilities.

• Analyze – The Analyze step takes the results of the Discover step and attempts to identify the root causes of any data problems. Depending on the project, this step may need to incorporate knowledge users from many different teams. This step may also take more man hours than the other steps since much of the work needs to be done by Business Analysts and/or Subject Matter Experts. The issues should be prioritized so that the project team can address small chunks of poor data quality at a time to ensure success. Since this process can be repeated, there is no need to try and tackle the whole data quality problem in one big bang.

• Improve – After the root causes of the data have been determined, steps should be taken to scrub and clean the data. This step can be facilitated through the use of specialized software packages that automate the clean-up process. This includes the standardizing names, addresses, and formats. Data cleansing software often uses other standard data sets provided by the software vendor to match real addresses with people or companies, instead of relying on the original values. Formats can also be cleaned up to remove any inconsistencies introduced by multiple data entry clerks, such as those often found in product IDs, telephone numbers, and other generic IDs or codes. Consistency rules can be defined within the software package. It is sometimes advisable to profile the data after the cleansing is complete to ensure that the software package has effectively resolved the quality issues.

• Combine – Many enterprises have embarked on methods to identify a customer or client identifier throughout the company. Since the data has been profiled and cleansed at this point, the enterprise is now ready to start linking data elements across source systems in order to reduce redundancy and increase consistency. The rules that were defined in the Define step and leveraged in the Analyze step play a big role here as well because master records are needed when removing duplications.

Common Questions to Consider

Data integration/warehousing projects often encounter general data problems outside the scope of a full-blown data quality project, but which also need to be addressed. The remainder of this document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter.

The quality of data is important in all types of projects, whether it be data warehousing, data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the project’s requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze phases of the project because they can require a significant amount of re-coding if identified later.

Some of the areas to consider are:


Text formatting

The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its “raw” format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems.

One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the “raw” data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication.

Another possibility is to explain to the users that “raw” data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement.

This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1 character flag fields.


Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.). Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. On very large tables, caching is not always realistic or feasible.

Datatype conversions

It is advisable to use explicit tool functions when converting the data type of a particular data value.

[In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits will be carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or maintenance headaches.]

Dates

Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format.

[Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost].

An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, ‘YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL)

If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date).

The NULL in the example above could be changed to one of the standard default dates described here.

Decimal precision

With numeric data columns, developers must determine the expected or required precisions of the columns. [By default (to increase performance), PowerCenter treats all


numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.]

If it is determined that precision of a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS’. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking data for completeness before loading

When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. Assuming that this can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing rules during mapping

Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table.

A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded.

Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this person’s responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.


Another solution applicable in cases with a small number of code values is to try and anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension not found while loading fact

The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed or hidden to/from the users.

One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date.

Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time.

A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully as it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs.

It is imperative that all of these solutions be discussed with the users before making any decisions, as they eventually will be the ones making decisions based on the reports.


Deployment Groups

Challenge

Deployment groups is a versatile feature that offers an improved method of migrating work completed in one repository to another repository. This Best Practice describes ways deployment groups can be used to simplify migrations.

Description

Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e. the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration “rollbacks” if necessary. Migrating a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group allows you to specify individual objects to copy, rather than the entire contents of a folder.

There are two types of deployment groups: static and dynamic.

• Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group.

• Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e. object versions in the repository) are then selected and copied to the target repository.

Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. See the Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further information. When generating a query to for deployment groups with mappings and mapplets that contain non reusable objects there is a query conditions that must be used in addition to any specific selection criteria. The query must include a condition for Is Reusable and use the qualifier one of Reusable and Non Reusable. Without this the deployment may encounter errors if there are non reusable objects held within the mapping or mapplet.


A deployment group exists in a specific repository. It can be used to move items to any other accessible repository. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e. it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group “undo” the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment.

It is important to note that the deployment group only migrates the objects it contains to the target repository. It does not, itself, move to the target repository. It still resides in the source repository.

Deploying via the GUI

Migrations can be performed via the GUI or the command line (pmrep). To migrate objects via the GUI, a user simply drags a deployment group from the repository it resides in, onto the target repository where the objects it references are to be moved. The Deployment Wizard appears, stepping the user through the deployment process. The user can match folders in the source and target repositories so objects are moved into the proper target folders, reset sequence generator values, etc. Once the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line

The PowerCenter pmrep command can be used to automate both Folder Level deployments (e.g. in a non-versioned repository) and deployments using Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are used respectively for these purposes. Whereas deployment via the GUI requires the user to step through a wizard to answer the various questions to deploy, command-line deployment requires the user to provide an XML control file, containing the same information that is required by the wizard. This file must be present before the deployment is executed.

Further Considerations for Deployment and Deployment Groups

Simultaneous Multi-Phase Projects

If there are multiple phases of a project being developed simultaneously in separate folders, it is possible to consolidate them by mapping folders appropriately through the deployment group migration wizard. When migrating with deployment groups in this way, the override buttons in the migration wizard are used to select specific folder mapping.

Rolling Back a Deployment

Deployment groups help to ensure that you have a back-out methodology. You can rollback the latest version of a deployment. To do this:


In the target repository (where the objects were migrated to), go to Versioning>>Deployment>>History>>View History>>Rollback.

The rollback purges all objects (of the latest version) that were in the deployment group. You can initiate a rollback on a deployment as long as you roll back only the latest versions of the objects. The rollback ensures that the check-in time for the repository objects is the same as the deploy time.

Managing Repository Size

As you check in objects and deploy objects to target repositories, the number of object versions in those repositories increases, and thus, the size of the repositories also increases.

In order to manage repository size, use a combination of Check-in Date and Latest Status (both are query parameters) to purge the desired versions from the repository and retain only the very latest version. You could also choose to purge all the deleted versions of the objects, which reduces the size of the repository.

If you want to keep more than the latest version, you can also include labels in your query. These labels are ones that you have applied to the repository for the specific purpose of identifying objects for purging.

Off-Shore On-Shore Migration

In an off-shore development environment to an on-shore migration situation, other aspects of the computing environment may make it desirable to generate a dynamic deployment group. Instead of migrating the group itself to the next repository, you can use a query to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment though alternative methods. If the on-shore repository is versioned, it will activate the import wizard as if a deployment group was being received.

Migrating to a Non Versioned Repository

In some instances, it may be desirable to migrate to a non-versioned repository from a versioned repository. It should be noted that this changes the wizards used when migrating in this manner, and that the export from the versioned repository has to take place using XML export. Note that certain repository objects (e.g. connections) cannot be automatically migrated, which may invalidate objects such as sessions. These objects (i.e. connections) should be set up first in the receiving repository. The XML import wizard will advise of any invalidations that occur.


Designing Analytic Data Architectures

Challenge

Develop a sound data architecture that can serve as a foundation for an analytic solution that may evolve over many years.

Description

Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it.

Taking an enterprise-wide, architect stance in developing analytic solutions provides many advantages, including:

• A sound architectural foundation ensures the solution can evolve and scale with the business over time.

• Proper architecture can isolate the application component (business context) of the analytic solution from the technology.

• Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of analytic solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of analytic solutions and their predecessors is warranted.

Historical Perspective

Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the day-to-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions:

Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues:


The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other.

Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance.

Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems.

• Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above.

• Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments

The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources.

The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution.

As individual departments pursued their own data and analytical needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the analytical tasks varied widely, resulting in a single enterprise evaluating, purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization


did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of an analytical solution.

Centralized Data Warehouse

The first approach to gain popularity was the centralized data warehouse. Designed to solve the decision support needs for the entire enterprise at one time, with one effort, the data integration process extracts the data directly from the operational systems. It transforms the data according to the business rules and loads it into a single target database serving as the enterprise-wide data warehouse.

Advantages

The centralized model offers a number of benefits to the overall architecture, including:

• Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse.

• Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources.

• Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists.


• High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.

Disadvantages

Of course, the centralized data warehouse also involves a number of drawbacks, including:

• Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues.

• Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business.

• Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure.

• Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems.

Independent Data Mart

The second warehousing approach is the independent data mart, which gained popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the same principles as the centralized approach, but it scales down the scope from solving the warehousing needs of the entire company to the needs of a single department or workgroup.

Much like the centralized data warehouse, an independent data mart extracts data directly from the operational sources, manipulates the data according to the business rules, and loads a single target database serving as the independent data mart. In some cases, the operational data may be staged in an Operational Data Store (ODS) and then moved to the mart.


Advantages

The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart:

• Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed.

• Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company.

• Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced.

• Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.

Disadvantages

Of course, independent data marts also have some significant disadvantages:

• Lack of centralized control . Because several independent data marts are needed to solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location.

• Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such


as inconsistent definitions of the same data object or timing differences that make reconciliation impossible.

• Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects.

• Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.

Dependent Data Marts (Federated Data Warehouses)

The third warehouse architecture is the dependent data mart approach supported by the hub-and-spoke architecture of PowerCenter and PowerMart. After studying more than one hundred different warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the centralized data warehouse and independent data mart.

The more general term being adopted to describe this approach is the "federated data warehouse." Industry analysts have recognized that, in many cases, there is no "one size fits all" solution. Although the goal of true enterprise architecture, with conformed dimensions and strict standards, is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated data warehouse was born. It allows for the relatively independent development of data marts, but leverages a centralized PowerCenter repository for sharing transformations, source and target objects, business rules, etc.

Recent literature describes the federated architecture approach as a way to get closer to the goal of a truly centralized architecture while allowing for the practical realities of most organizations. The centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the organization can develop semi-autonomous data marts, so long as they subscribe to a common view of the business. This common business model is the fundamental, underlying basis of the federated architecture, since it ensures consistent use of business terms and meanings throughout the enterprise.

With the exception of the rare case of a truly independent data mart, where no future growth is planned or anticipated, and where no opportunities for integration with other business areas exist, the federated data warehouse architecture provides the best framework for building an analytic solution.

Informatica's PowerCenter and PowerMart products provide an essential capability for supporting the federated architecture: the shared Global Repository. When used in conjunction with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing body, providing a common understanding of core business concepts that can be shared across the semi-autonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.


This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally.

A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system.

Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.

Advantages

The Federated architecture brings together the best features of the centralized data warehouse and independent data mart:

• Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in.


• Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the analytic environment are applied across the data marts, easing the system management task.

• Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata.

• Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted).

• High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data.

• Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages

Disadvantages of the federated approach include:

• Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved.

• Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.

Operational Data Store

Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is not organized by subject area and is not customized for viewing by end users or even for reporting. The primary focus of the ODS is in providing a clean, consistent set of operational data for creating and refreshing data marts. Separating out this function allows the ODS to provide more reliable and flexible support.

Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance).

The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers.

Characteristics of the Operational Data Store are:

• Normalized


• Detailed (not summarized) • Integrated • Cleansed • Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways:

• Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system.

• Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems.

• Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of analytical reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats.

The following table compares the key differences in the three architectures:

Architecture Centralized Data Warehouse

Independent Data Mart

Federated Data Warehouse

Centralized Control

Yes No Yes

Consistent Metadata

Yes No Yes

Cost effective No Yes Yes Enterprise View Yes No Yes Fast Implementation

No Yes Yes

High Data Integrity

Yes No Yes

Immediate ROI No Yes Yes Repeatable Process

No Yes Yes

The Role of Enterprise Architecture

The federated architecture approach allows for the planning and implementation of an enterprise architecture framework that addresses not only short-term departmental needs, but also the long-term enterprise requirements of the business. This does not mean that the entire architectural investment must be made in advance of any application development. However, it does mean that development is approached within the guidelines of the framework, allowing for future growth without significant technological change. The remainder of this chapter will focus on the process of


designing and developing an analytic solution architecture using PowerCenter as the platform.

Fitting Into the Corporate Architecture

Very few organizations have the luxury of creating a "green field" architecture to support their decision support needs. Rather, the architecture must fit within an existing set of corporate guidelines regarding preferred hardware, operating systems, databases, and other software. The Technical Architect, if not already an employee of the organization, should ensure that he/she has a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will eliminate the possibility of developing an elegant technical solution that will never be implemented because it defies corporate standards.


Developing an Integration Competency Center

Challenge

With increased pressure on IT productivity, many companies are rethinking the “independence” of data integration projects that has resulted in inefficient, piecemeal or silo-based approach to each new project. Furthermore, as each group within a business attempts to integrate its data, it unknowingly duplicates effort the company has already invested-not just in the data integration itself, but also the effort spent on developing practices, processes, code, and personnel expertise.

An alternative to this expensive redundancy is to create some type of “integration competency center” (ICC). An ICC is an IT approach that provides teams throughout an organization with best practices in integration skills, processes, and technology so that they can complete data integration projects consistently, rapidly, and cost-efficiently.

What types of services should your ICC offer? This BP provides an overview of offerings to help you consider the appropriate structure for your ICC.

Description

Objectives

Typical ICC objectives include:

• Promoting data integration as a formal discipline • Developing a set of experts with data integration skills and processes, and

leveraging their knowledge across the organization • Building and developing skills, capabilities, and best practices for integration

processes and operations • Monitoring, assessing, and selecting integration technology and tools • Managing integration pilots • Leading and supporting integration projects with the cooperation of subject matter

experts • Reusing development work such as source definitions, application interfaces, and

codified business rules

Benefits


Although a successful project that shares its lessons learned with other teams can be a great way to begin developing organizational awareness of the value of an ICC, to set up a more formal ICC will require upper management buy-in and funding. Here are some of the typical benefits that can be realized from doing so:

• Rapid development of in-house expertise through coordinated training and shared knowledge

• Leverage of shared resources and “best practice” methods and solutions • More rapid project deployments • Higher quality / reduced risk data integration projects • Reduced costs of project development and maintenance

When examining the move toward an ICC model that optimizes and in certain situations centralizes integration functions, consider two things: the problems, costs and risks associated with a project silo-based approach, and the potential benefits of an ICC environment.

What Services should be in Your ICC?

The common services provided by ICCs can be divided into 4 major categories:

• Knowledge Management • Environment • Development Support • Production Support

Detailed Services Listings by Category

Knowledge Management

• Training o Standards Training (Training Coordinator)

Training of best practices, including but not limited to, naming conventions, unit test plans, configuration management strategy, and project methodology.

o Product Training (Training Coordinator) Co-ordination of vendor-offered or internally-sponsored training of specific technology products.

• Standards o Standards Development (Knowledge Coordinator)

Creating best practices, including but not limited to, naming conventions, unit test plans, and coding standards.

o Standards Enforcement (Knowledge Coordinator) Enforcing development teams to use documented best practices through


formal development reviews, metadata reports, project audits or other means.

o Methodology (Knowledge Coordinator) Creating methodologies to support development initiatives. Examples include methodologies for rolling out data warehouses and data integration projects. Typical topics in a methodology include, but are not limited to: § Project Management § Project Estimation § Development Standards § Operational Support

o Mapping Patterns (Knowledge Coordinator) Developing and maintaining mapping patterns (templates) to speed up development time and promote mapping standards across projects.

• Technology o Emerging Technologies (Technology Leader )

Assessing emerging technologies and determining if/where they fit in the organization and policies around their adoption/use

o Benchmarking (Technology Leader) Conducting and documenting tests on hardware and software in the organization to establish performance benchmarks

• Metadata o Metadata Standards (Metadata Administrator)

Creating standards for capturing and maintaining metadata. (Example: database column descriptions will be captured in ErWin and pushed to PowerCenter via Metadata Exchange)

o Metadata Enforcement (Metadata Administrator)

Enforcing development teams to conform to documented metadata standards

o Data Integration Catalog (Metadata Administrator) Tracking the list of systems involved in data integration efforts, the integration between systems, and the use of/subscription to data integration feeds. This information is critical to managing the interconnections in the environment in order to avoid duplication of integration efforts. The Calalog will also assist in understanding when particular integration feeds are no longer needed.

Environment

• Hardware


o Vendor Selection and Management (Vendor Manager) Selecting vendors for the hardware tools needed for integration efforts that may span Servers, Storage and network facilities

o Hardware Procurement (Vendor Manager) Responsible for the purchasing process for hardware items that may include receiving and cataloging the physical hardware items.

o Hardware Architecture (Technical Architect) Developing and maintaining the physical layout and details of the hardware used to support the Integration Competency Center

o Hardware Installation (Product Specialist) Setting up and activating new hardware as it becomes part of the physical architecture supporting the Integration Competency Center

o Hardware Upgrades (Product Specialist) Managing the upgrade of hardware including operating system patches, additional cpu/memory upgrades, replacing old technology etc.

• Software o Vendor Selection and Management (Vendor Manager)

Selecting vendors for the software tools needed for integration efforts. Activities may include formal RFP’s, vendor presentation reviews, software selection criteria, maintenance renewal negotiations and all activities related to managing the software vendor relationship.

o Software Procurement (Vendor Manager) Responsible for the purchasing process for software packages and licenses

o Software Architecture (Technical Architect) Developing and maintaining the architecture of the software package(s) used in the competency center. This may include flowcharts and decision trees of what software to select for specific tasks.

o Software Installation (Product Specialist) Setting up and installing new software as it becomes part of the physical architecture supporting the Integration Competency Center

o Software Upgrade (Product Specialist) Managing the upgrade of software including patches and new releases. Depending on the nature of the upgrade, significant planning and rollout efforts may be required during upgrades. (Training, testing, physical installation on client machines etc)

o Compliance (Licensing) (Vendor Manager) Monitoring and ensuring proper licensing compliance across development teams. Formal audits or reviews may be scheduled. Physical documentation should be kept matching installed software with purchased licenses.

• Professional Services


o Vendor Selection and Management (Vendor Manager) Selecting vendors for professional services efforts related to integration efforts. Activities may include managing vendor rates and bulk discount negotiations, payment of vendors, reviewing past vendor work efforts, managing list of ‘preferred’ vendors etc.

o Vendor Qualification (Vendor Manager) Conducting formal vendor interviews as consultants/ contracts are proposed for projects, checking vendor references and certifications, formally qualifying selected vendors for specific work tasks (i.e., Vendor A is qualified for Java development while Vendor B is qualified for ETL and EAI work)

• Security o Security Administration (Security Administrator)

Providing access to the tools and technology needed to complete data integration development efforts including software user id’s, source system user id/passwords, and overall data security of the integration efforts. Ensures enterprise security processes are followed.

o Disaster Recovery (Technical Architect) Performing risk analysis in order to develop and execute a plan for disaster recovery including repository backups, off-site backups, failover hardware, notification procedures and other tasks related to a catastrophic failure (ie server room fire destroys dev/prod servers).

• Financial o Budget (ICC Manager)

Yearly budget management for the Integration Competency Center. Responsible for managing outlays for services, support, hardware, software and other costs.

o Departmental Cost Allocation (ICC Manager) For clients where shared services costs are to be spread across departments/ business units for cost purposes. Activities include defining metrics uses for cost allocation, reporting on the metrics, and applying cost factors for billing on a weekly/monthly or quarterly basis as dictated.

• Scalability/Availability o High Availability (Technical Architect)

Designing and implementing hardware, software and procedures to ensure high availability of the data integration environment.

o Capacity Planning (Technical Architect) Designing and planing for additional integration capacity to address the growth in size and volume of data integration in the future for the organization.

Development Support


• Performance o Performance and Tuning (Product Specialist)

Providing targeted performance and tuning assistance for integration efforts. Providing on-going assessments of load windows and schedules to ensure service level agreements are being met.

• Shared Objects o Shared Object Quality Assurance (Quality Assurance)

Providing quality assurance services for shared objects so that objects conform to standards and do not adversely affect the various projects that may be using them.

o Shared Object Change Management (Change Control Coordinator) Managing the migration to production of shared objects which may impact multiple project teams. Activities include defining the schedule for production moves, notifying teams of changes, and coordinating the migration of the object to production.

o Shared Object Acceptance (Change Control Coordinator) Defining and documenting the criteria for a shared object and officially certifying an object as one that will be shared across project teams.

o Shared Object Documentation (Change Control Coordinator) Defining the standards for documentation of shared objects and maintaining a catalog of all shared objects and their functions.

• Project Support o Development Helpdesk (Data Integration Developer)

Providing a helpdesk of expert product personnel to support project teams. This will provide project teams new to developing data integration routines with a place to turn to for experienced guidance.

o Software/Method Selection (Technical Architect) Providing a workflow or decision tree to use when deciding which data integration technology to use for a given technology request.

o Requirements Definition (Business/Technical Analyst) Developing the process to gather and document integration requirements. Depending on the level of service, activity may include assisting or even fully gathering the requirements for the project.

o Project Estimation (Project Manager) Developing project estimation models and provide estimation assistance for data integration efforts.

o Project Management (Project Manager) Providing full time management resources experienced in data integration to ensure successful projects.

o Project Architecture Review (Data Integration Architect) Providing project level architecture review as part of the design process


for data integration projects. Helping ensure standards are met and the project architecture fits within the enterprise architecture vision.

o Detailed Design Review (Data Integration Developer) Reviewing design specifications in detail to ensure conformance to standards and identifying any issues upfront before development work is begun.

o Development Resources (Data Integration Developer) Providing product-skilled resources for completion of the development efforts.

o Data Profiling (Data Integration Developer) Providing data profiling services to identify data quality issues. Develop plans for addressing issues found in data profiling.

o Data Quality (Data Integration Developer) Defining and meeting data quality levels and thresholds for data integration efforts.

• Testing o Unit Testing (Quality Assurance )

Defining and executing unit testing of data integration processes. Deliverables include documented test plans, test cases and verification against end user acceptance criteria.

o System Testing (Quality Assurance) Defining and performing system testing to ensure that data integration efforts work seamlessly across multiple projects and teams.

• Cross Project Integration o Schedule Management/Planning (Data Integration Developer)

Providing a single point for managing load schedules across the physical architecture to make best use of available resources and appropriately handle integration dependencies.

o Impact Analysis (Data Integration Developer) Providing impact analysis on proposed and scheduled changes that may impact the integration environment. Changes include but are not limited to system enhancements, new systems, retirement of old systems, data volume changes, shared object changes, hardware migration and system outages.

Production Support

• Issue Resolution o Operations Helpdesk (Production Operator)

First line of support for operations issues providing high level issue


resolution. Helpdesk would field support cases and issues related to scheduled jobs, system availability and other production support tasks.

o Data Validation (Quality Assurance) Providing data validation on integration load tasks. Data may be ‘held’ from end user access until some level of data validation has been performed. It might be manual review of load statistics - to automated review of record counts including grand total comparisons, expected size thresholds or any other metric an organization may define to catch potential data inconsistencies before reaching end users.

• Production Monitoring o Schedule Monitoring (Production Operator)

Nightly/daily monitoring of the data integration load jobs. Ensuring jobs are properly initiated, are not being delayed, and ensuring successful completion. May provide first level support to the load schedule while escalating issues to the appropriate support teams.

o Operations Metadata Delivery (Production Operator) Responsible for providing metadata to system owners and end users regarding the production load process including load times, completion status, known issues and other pertinent information regarding the current state of the integration job stream.

• Change Management o Object Migration (Change Control Coordinator)

Coordinating movement of development objects and processes to production. May even physically control migration such that all migration is scheduled, managed, and performed by the ICC.

o Change Control Review (Change Control Coordinator) Conducting formal and informal reviews of production changes before migration is approved. At this time, standards may be enforced, system tuning reviewed, production schedules updated, and formal sign off to production changes is issued.

o Process Definition (Change Control Coordinator) Developing and documenting the change management process such that development objects are efficiently and flawlessly migrated into the production environment. This may include notification rules, schedule migration plans, emergency fix procedures etc.


Development FAQs

Challenge

Using the PowerCenter product suite to effectively develop, name, and document components of the analytic solution. While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions that are commonly raised by project teams. It provides answers in a number of areas, including Scheduling, Backup Strategies, Server Administration, and Metadata. Refer to the product guides supplied with PowerCenter for additional information.

Description

The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)

In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e. what is the impact of having multiple targets populated by a single map?)

With PowerCenter, it is possible to design a mapping with multiple targets. You can then load the targets in a specific order using Target Load Ordering. The recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk seeks and applies to a session writing to multiple targets, and to multiple sessions running simultaneously.

Q: What are some considerations for determining how many objects and transformations to include in a single mapping?


There are several items to consider when building a mapping. The business requirement is always the first consideration, regardless of the number of objects it takes to fulfil the requirement. The most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping.

Log File Organization

Q: Where is the best place to maintain Session Logs?

One often-recommended location is the default "SessLogs" folder in the PowerCenter directory, keeping all log files in the same directory.

Q: What documentation is available for the error codes that appear within the error log files?

Log file errors and descriptions appear in Appendix C of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?

Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets.

Workflows can be created to run sequentially or concurrently, or have tasks in different paths doing either.

• A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc.

• A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' Symmetric Multi-Processing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. However, this capability allows for the


creation of very complex and flexible workflow streams without the use of a third-party scheduler.

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure?

No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions.

Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?

Workflow Execution needs to be planned around two main constraints:

• Available system resources • Memory and processors

The number of sessions that can run at one time depends on the number of processors available on the server. The load manager is always running as a process. As a general rule, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. Also, some sessions require more I/O, so they use less processor time. Generally, a session needs about 120 percent of a processor for the DTM, reader, and writer in total.

For concurrent sessions:

One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.

The sessions should run at "off-peak" hours to have as many available resources as possible.

Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running.

The first step is to estimate memory usage, accounting for:

• Operating system kernel and miscellaneous processes • Database engine • Informatica Load Manager

The DTM process creates threads to initialize the session, read, write and transform data, and handle pre- and post-session operations.

• More memory is allocated for lookups, aggregates, ranks, sorters and heterogeneous joins in addition to the shared memory segment.

At this point, you should have a good idea of what is left for concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently.

Load Order Dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times, some network links may become saturated if overloaded, and some target tables may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator?

The application level of event notification can be accomplished through post-session email. Post-session email allows you to create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:

Email Variable Description %s Session name %l Total records loaded %r Total records rejected %e Session status %t Table details, including read throughput in bytes/second and write

throughput in rows/second %b Session start time %c Session completion time %i Session elapsed time (session completion time-session start time) %g Attaches the session log to the message %m Name and version of the mapping used in the session %d Name of the folder containing the session %n Name of the repository containing the session %a<filename> Attaches the named file. The file must be local to the Informatica

Server. The following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>

On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a non-text file, the send may fail.

Note: The filename cannot include the Greater Than character (>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email.

To verify the rmail tool is accessible:

1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server.

2. Type rmail <fully qualified email address> at the prompt and press Enter. 3. Type '.' to indicate the end of the message and press Enter. 4. You should receive a blank email from the PowerCenter user's email account. If

not, locate the directory where rmail resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send

post-session email.

The output should look like the following:

Session complete. Session name: sInstrTest Total Rows Loaded = 1 Total Rows Rejected = 0 Completed

Rows Loaded

Rows Rejected

ReadThroughput (bytes/sec)

WriteThroughput (rows/sec)

Table Name

Status 1 0 30 1 t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0:00:10 (h:m:s)

This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the backup or from a prior version?

At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then manually copy the individual objects back into the main repository.

Another option is to export individual objects to XML files. This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc.


Refer to Migration Procedures for details on promoting new or changed objects between development, test, QA, and production environments.

Server Administration

Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs?

The Repository Server can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Server can be used to send notification messages about repository objects that are created, modified or deleted by another user. Notification messages are received through the PowerCenterClient tools.

Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?

The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.

Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:

• CPID - Creator PID (process ID) • LPID - Last PID that accessed the resource • Semaphores - used to sync the reader and writer • 0 or 1 - shows slot in LM shared memory

(See Chapter 16 in the PowerCenter Repository Guide for additional details.)

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation.

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?

If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Metadata

Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that might be extracted from the PowerCenter repository and used in others?

With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All


information about column size and scale, datatypes, and primary keys are stored in the repository.

The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata will be required by the systems that use the metadata.

There are some time saving tools that are available to better manage a metadata strategy and content, such as third party metadata software and, for sources and targets, data modeling tools.

Q: What procedures exist for extracting metadata from the repository?

Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.

Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.

Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have been created to provide access to the metadata stored in the repository.

Additional products, such as Informatica's Metadata Reporter and PowerAnalyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

Q: How can I keep multiple copies of the same object within PowerCenter?

A: With PowerCenter 7.x, you can use version control to maintain previous copies of every changed object.

You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects.

When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status.

You can perform the following tasks when you work with a versioned object:


• View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing.

• Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time.

• Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users.

• Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects?

A: Yes there is.

You can create Deployment Groups that allow you to group versioned objects for migration to a different repository

You can create the following types of deployment groups:

• Static. You populate the deployment group by manually selecting objects. • Dynamic. You use the result set from an object query to populate the deployment

group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group.

If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

Q: Can I do load balance PowerCenter sessions?

A: The current latest version 7of PowerCenter allows you to set up a Server Grid.

When you create a server grid, you can add PowerCenter Servers to the grid. When you run a workflow against a PowerCenter Server in the grid, that server becomes the master server for the workflow. The master server runs all non-session tasks and


assigns session tasks to run on other servers that are defined in the grid. The other servers become worker servers for that workflow run.

You can add servers to a server grid at any time. When a server starts up, it connects to the grid and can run sessions from master servers and distribute sessions to worker servers in the grid. The Workflow Monitor communicates with the master server to monitor progress of workflows, get session statistics, retrieve performance details, and stop or abort the workflow or task instances.

If a PowerCenter Server loses its connection to the grid, it tries to re-establish a connection. You do not need to restart the server for it to connect to the grid. If a PowerCenter Server is not connected to the server grid, the other PowerCenter Servers in the server grid do not send it tasks.

Q: How does Web Services Hub work in version 7 of PowerCenter?

A: The Web Services Hub is a PowerCenter Service gateway for external clients. It exposes PowerCenter functionality through a service-oriented architecture. It receives requests from web service clients and passes them to the PowerCenter Server or the Repository Server. The PowerCenter Server or Repository Server processes the requests and send a response to the web service client through the Web Services Hub.

The Web Services Hub hosts Batch Web Services, Metadata Web Services, and Real-time Web Services.

Install the Web Services Hub on an application server and configure information such as repository login, session expiry and log buffer sizes.

The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.


Key Management in Data Warehousing Solutions

Challenge

Key management refers to the technique that manages key allocation in a decision support RDBMS to create a single view of reference data from multiple sources. Informatica recommends a concept of key management that ensures loading everything extracted from a source system into the data warehouse.

This Best Practice provides some tips for employing the Informatica-recommended approach of key management, an approach that deviates from many traditional data warehouse solutions that apply logical and data warehouse (surrogate) key strategies where errors are loaded and transactions rejected from referential integrity issues.

Description

Key management in a decision support RDBMS comprises three techniques for handling the following common situations:

• Key merging/matching • Missing keys • Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.

Key Merging/Matching

When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.

A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains


universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization.

The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.

Missing Keys

A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved.

The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time.

The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed.

Simple Example:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 1 35,000

In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 9999999 1 35,000 8888888

The related sales rep record may look like this:

REP CODE REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith


9999999 Missing Rep

An error log entry to identify the missing key on this transaction may look like:

ERROR CODE TABLE NAME KEY NAME KEY

MSGKEY ORDERS SALES REP 8888888

This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.

Unknown Keys

Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found.

Simple example:

The sales rep reference data record might look like the following:

DWKEY REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith

9999999 Missing Rep

A transaction comes into ODS with the record below:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 2424242 1 35,000

In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully.


2424242 Unknown

A data warehouse key (8888889) is also added to the transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY


Audi TT18 Doe10224 2424242 1 35,000 8888889

Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log.


NEWROW SALES REP SALES REP 2424242

A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.


UNKNKEY ORDERS SALES REP 8888889

As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting.

Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed.

This would result in the reference data entry looking complete.


2424242 David Digby Mark Smith

Employing the Informatica recommended key management strategy produces the following benefits:

• All rows can be loaded into the data warehouse • All objects are allocated a unique key • Referential integrity is maintained • Load dependencies are removed


Mapping Design

Challenge

Optimizing PowerCenter to create an efficient execution environment.

Description

Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.

General Suggestions for Optimizing

1. Reduce the number of transformations. There is always overhead involved in moving data between transformations.

2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice.

3. Calculate once, use many times. o Avoid calculating or testing the same value over and over. o Calculate it once in an expression, and set a True/False flag. o Within an expression, use variable ports to calculate a value than can be

used multiple times within that transformation. 4. Only connect what is used.

o Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier.

o This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect.

o In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.

5. Watch the data types. o The engine automatically converts compatible types. o Sometimes data conversion is excessive. Data types are automatically

converted when types are different between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping.

6. Facilitate reuse.


o Plan for reusable transformations upfront. o Use variables. Use both mapping variables as well as ports that are

variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times

o Use mapplets to encapsulate multiple reusable transformations. o Use mapplets to leverage the work of critical developers and minimize

mistakes when performing similar functions. 7. Only manipulate data that needs to be moved and transformed.

o Reduce the number of non-essential records that are passed through the entire mapping.

o Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible).

o Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.

8. Utilize single-pass reads. o Redesign mappings to utilize one Source Qualifier to populate multiple

targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier.

o Remove or reduce field-level stored procedures. o If you use field-level stored procedures, the PowerCenter server has to

make a call to that stored procedure for every row, slowing performance.

Lookup Transformation Optimizing Tips

1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent.

2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then the 500k rows will have to be adjusted down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often still faster on very large lookup tables.

3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab.

4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables(i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls.

5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an

unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent

cache. Cache the entire table to a persistent file on the first run, enable the update else insert option on the dynamic cache and the engine will never have


to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains.

8. Review complex expressions.

• Examine mappings via Repository Reporting and Dependency Reporting within the mapping.

• Minimize aggregate function calls. • Replace Aggregate Transformation object with an Expression Transformation

object and an Update Strategy Transformation for certain types of Aggregations.

Operations and Expression Optimizing Tips

1. Numeric operations are faster than string operations. 2. Optimize char-varchar comparisons (i.e., trim spaces before comparing). 3. Operators are faster than functions (i.e., || vs. CONCAT). 4. Optimize IIF expressions. 5. Avoid date comparisons in lookup; replace with string. 6. Test expression timing by replacing with constant. 7. Use flat files.

• Using flat files located on the server machine loads faster than a database located in the server machine.

• Fixed-width files are faster to load than delimited files because delimited files require extra parsing.

• If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure.

9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations.

10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used, even if the original source cannot be ordered.

11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target.

12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows.

13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source.

14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update will be much faster than the non-indexed lookup override.

Suggestions for Using Mapplets


A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as necessary. Use the Mapplet Designer to create mapplets.

1. Create a mapplet when you want to use a standardized set of transformation logic in several mappings. For example, if you have several fact tables that require a series of dimension keys, you can create a mapplet containing a series of Lookup transformations to find each dimension key. You can then use the mapplet in each fact table mapping, rather than recreate the same lookup logic in each mapping.

2. To create a mapplet, add, connect, and configure transformations to complete the desired transformation logic. After you save a mapplet, you can use it in a mapping to represent the transformations within the mapplet. When you use a mapplet in a mapping, you use an instance of the mapplet. All uses of a mapplet are tied to the parent mapplet. Hence, all changes made to the parent mapplet logic are inherited by every child instance of the mapplet. When the server runs a session using a mapplet, it expands the mapplet. The server then runs the session as it would any other session, passing data through each transformation in the mapplet as designed.

3. A mapplet can be active or passive depending on the transformations in the mapplet. Active mapplets contain at least one active transformation. Passive mapplets only contain passive transformations. Being aware of this property when using mapplets can save time when debugging invalid mappings.

4. Unsupported transformations that should not be used in a mapplet include: COBOL source definitions, normalizer, non-reusable sequence generator, pre- or post-session stored procedures, target definitions, and PowerMart 3.5-style lookup functions.

5. Do not reuse mapplets if you only need one or two transformations of the mapplet while all other calculated ports and transformations are obsolete.

6. Source data for a mapplet can originate from one of two places:

• Sources within the mapplet. Use one or more source definitions connected to a Source Qualifier or ERP Source Qualifier transformation. When you use the mapplet in a mapping, the mapplet provides source data for the mapping and is the first object in the mapping data flow.

• Sources outside the mapplet. Use a mapplet Input transformation to define input ports. When you use the mapplet in a mapping, data passes through the mapplet as part of the mapping data flow.

7. To pass data out of a mapplet, create mapplet output ports. Each port in an Output transformation connected to another transformation in the mapplet becomes a mapplet output port.

• Active mapplets with more than one Output transformations. You need one target in the mapping for each Output transformation in the mapplet. You cannot use only one data flow of the mapplet in a mapping.

• Passive mapplets with more than one Output transformations. Reduce to one Output Transformation; otherwise you need one target in the mapping for each Output transformation in the mapplet. This means you cannot use only one data flow of the mapplet in a mapping.


Mapping Templates

Challenge

Mapping Templates demonstrate proven solutions for tackling challenges that commonly occur during data integration development efforts. Mapping Templates can be used to make the development phase of a project more efficient. Mapping Templates can also serve as a medium to introduce development standards into the mapping development process that developers need to follow.

A wide array of Mapping Template examples can be obtained for the most current PowerCenter version from the Informatica Customer Portal. As "templates," each of the objects in Informatica's Mapping Template Inventory illustrates the transformation logic and steps required to solve specific data integration requirements. These sample templates, however, are meant to be used as examples, not as means to implement development standards.

Description

Reuse Transformation Logic

Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques

When the process is not simple enough to allow usage based on the need to duplicate transformation logic to load the same target, Mapping Templates can help to reproduce transformation techniques. In this case, the implementation process requires more than just replacing source/target transformations. This scenario is most useful when certain logic (i.e., logical group of transformations) is employed across mappings. In many instances this can be further simplified by making use of mapplets.


Transport mechanism

Once Mapping Templates have been developed, they can be distributed by any of the following procedures:

• Copy mapping from development area to the desired repository/folder • Export mapping template into XML and import to the desired repository/folder.

Mapping template examples

The following Mapping Templates can be downloaded from the Informatica Customer Portal and are listed by subject area:

Common Data Warehousing Techniques

• Aggregation using Sorted Input • Tracking Dimension History • Constraint-Based Loading • Loading Incremental Updates • Tracking History and Current • Inserts or Updates

Transformation Techniques

• Error Handling Strategy • Flat File Creation with Headers and Footers • Removing Duplicate Source Records • Transforming One Record into Multiple Records • Dynamic Caching • Sequence Generator Alternative • Streamline a Mapping with a Mapplet • Reusable Transformations (Customers) • Using a Sorter • Pipeline Partitioning Mapping Template • Using Update Strategy to Delete Rows • Loading Heterogenous Targets • Load Using External Procedure

Advanced Mapping Concepts

• Aggregation Using Expression Transformation • Building a Parameter File • Best Build Logic • Comparing Values Between Records

Source-Specific Requirements

• Processing VSAM Source Files • Processing Data from an XML Source • Joining a Flat File with a Relational Table


Industry-Specific Requirements

• Loading SWIFT 942 Messages.htm • Loading SWIFT 950 Messages.htm


Naming Conventions

Challenge

Choosing a good naming standard for use in the repository and adhering to it.

Description

Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key.

Having a good naming convention will facilitate smooth migration and improve readability for anyone reviewing or carrying out maintenance on the repository objects by helping them easily understand the processes being effected. If consistent names and descriptions are not used, more time will be taken to understand the working of mappings and transformation objects. If there is no description, a developer will have to spend considerable time going through an object or mapping to understand its objective.

The following pages offer some suggestions for naming conventions for various repository objects. Whatever convention is chosen, it is important to do this very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding process to check conventions to test plans and test execution documents.

Suggested Naming Conventions

Transformation Objects Suggested Naming Conventions Application Source Qualifier ASQ_TransformationName _SourceTable1_SourceTable2.

Represents data from application source. Expression Transformation: EXP_Function exp that leverages the expression

and/or a name that describes the processing being done. Custom CT_Tranformation name that describe the processing being

done. Sequence Generator Transformation

SEQ_Descriptor if using keys for a target table entity, then refer to that

Lookup Transformation: LKP_ Use lookup table name or item being obtained by


Transformation Objects Suggested Naming Conventions lookup since there can be different lookups on a single table.

Source Qualifier Transformation:

SQ_{SourceTable1}_{SourceTable2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product – SQ_SALES_INSURANCE_PRODUCTS.

Aggregator Transformation: AGG_{function} that leverages the expression or a name that describes the processing being done.

Filter Transformation: FIL_ or FILT_ {function } that leverages the expression or a name that describes the processing being done.

Update Strategy Transformation:

UPD_{TargetTableName(s)} that leverages the expression or a name that describes the processing being done. If the update is carryout inserts or updates only, then add insert /ins update / upd to the name. E.g., UPD_UPDATE_EXISTING_EMPLOYEES

MQ Source Qualifier SQ_MQ_Descriptor defines the messaging being selected. Normalizer Transformation: NRM_{TargetTableName(s)} that leverages the expression

or a name that describes the processing being done. Union UN_Descriptor Router Transformation RTR_{Descriptor} XML Generator XMG_Descriptor defines the target message. XML Parser XMP_Descriptor defines the messaging being selected. XML Source Qualifier XMSQ_Descriptor defines the data being selected. Rank Transformation: RNK_{TargetTableName(s)} that leverages the expression

or a name that describes the processing being done. Stored Procedure Transformation:

SP_{StoredProcedureName}

External Procedure Transformation:

EXT_{ProcedureName}

Joiner Transformation: JNR_{SourceTable/FileName1}_ {SourceTable/FileName2} or use more general descriptions for the content in the data flows as joiners are not only used to provide pure joins between heterogeneous source tables and files.

Target TGT_Target_Name Mapplet Transformation: mplt_{description} Mapping Name: m_{target}_{descriptor} Email Object email_{Descriptor}

Port Names

Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name.

When the developer brings a source port into a lookup or expression, the port should be prefixed with IN_. This will help the user immediately identify the ports that are being inputted without having to line up the ports with the input checkbox.


Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer could use the prefix ‘v’, 'var_’ or ‘v_' plus a meaningful name.

The following port standards will be applied when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database.

Other transformations that are not applicable to the port standards are:

• Normalizer: The ports created in the Normalizer are automatically formatted when the developer configures it.

• Sequence Generator: The ports are reserved words. • Router: The output ports are automatically created; therefore prefixing the input

ports with an I_ will prefix the output ports with I_ as well. The port names should not have any prefix.

• Sorter, Update Strategy, Transaction Control, and Filter: The ports are always input and output. There is no need to rename them unless they are prefixed. Prefixed port names should be removed.

• Union: The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with:

• ‘in_’ or ‘i_’for Input ports • ‘o_’ or ‘_out’ for Output ports • ‘io_’ for Input/Output ports • ‘v’,‘v_’ or ‘var_’ for variable ports

They can also:

• Have the Source Qualifier port name. • Be unique. • Be meaningful. • Be given the target port name.

Transformation Descriptions

This section defines the standards to be used for transformation descriptions in the Designer.

Source qualifier description

The description should include the aim of the source qualifier and the data it is intended to select. It should also indicate if any SQL overrides are used. If so, it should describe


the filters used. Some project prefer the SQL statement to be included in the description as well.

Lookup transformation description

Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name].”

Where:

• Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria.

• Lookup table name is the table on which the lookup is being performed. • Lookup attribute name is the name of the attribute being returned from the

lookup. If appropriate, specify the condition when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup.

Expression transformation description

Each Expression transformation description must be in the format:

“This expression … [explanation of what transformation does].”

Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed.

Within each Expression, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

Aggregator transformation descriptions

Each Aggregator transformation description must be in the format:

“This Aggregator … [explanation of what transformation does].”

Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed.

Within each Aggregator, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

Sequence generators transformation descriptions

Each Sequence Generator transformation description must be in the format:


“This Sequence Generator provides the next value for the [column name] on the [table name].”

Where:

• table name is the table being populated by the sequence number and the • column name is the column within that table being populated.

Joiner transformation descriptions

Each Joiner transformation description must be in the format::

“This Joiner uses … [joining field names] from [joining table names].”

Where:

• joining field names are the names of the columns on which the join is done and the

• joining table names are the tables being joined.

Normalizer transformation descriptions

Each Normalizer transformation description must be in the format::

“This Normalizer … [explanation].”

Where explanation is an explanation of what the Normalizer does.

Filter transformation description

Each Filter transformation description must be in the format:

“This Filter processes … [explanation].”

Where explanation is an explanation of what the filter criteria are and what they do.

Stored procedure transformation descriptions

An explanation of the stored procedure’s functionality within the mapping. What does it return in relation to the input ports?

Input transformation descriptions

Describe the input values and their intended use in the mapplet

Output transformation descriptions

Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer


the questions like: is the currency it fixed or is based on other data? What kind of rate is used? is it a fixed inter-company rate? an interbank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency?

Update strategies transformation description

Describe what the Update Strategy does and whether it is fixed in its function or determined by a calculation.

Sorter transformation description

An explanation contains the port(s) that are being sorted and their sort direction.

Router transformation description

An explanation that describes the groups and their function.

Union transformation description

Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping.

Transaction control transformation description

Describe the process behind the transaction control and the function of the control to commit or rollback.

Mapping Comments

Describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with more technical details such as table names. This will help when maintenance has to be carried out or if issues arise that need to be discussed with business analysts.

Mapplet Comments

An explanation of the process that the mapplet carries out. Also see notes for the description for the input and output transformation.

Shared Objects

Any object within a folder can be shared. These objects are sources, targets, mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. Once the folder is shared, users are allowed to create shortcuts to objects in the folder.

If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders


by creating a shortcut to the object. In this case, the naming convention is SC_ for instance SC_mltCREATION_SESSION, SC_DUAL.

Shared Folders

Shared folders are used when objects are needed across folders but the developer wants to maintain them in only one central location. In addition to ease of maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of copies.

Only users with the proper permissions can access these shared folders. It is the responsibility of these users to migrate the folders across the repositories and to maintain the objects within those folders with the help of the developers. For instance, if an object is created by a developer and it is to be shared, the developer will provide details of the object and the level at which the object is to be shared before the Administrator will accept it as a valid entry into the shared folder. The developers, not necessarily the creator, control the maintenance of the object, as they will need to ensure that a change they require will not negatively impact other objects.

Workflow Manager Objects

WorkFlow Objects Suggested Naming Convention Session Name: s_{MappingName} Command Object cmd_{Descriptor} WorkLet Name Wk or Wklt_{Descriptor} Workflow Names: Wkf or wf_{Workflow Descriptor} Email Task: Email_ or eml_{Email Descriptor} Decision Task: dcn_{Condition_Descriptor} Assign Task: asgn_{Variable_Descriptor } Timer Task: Timer_ or tim_{Descriptor} Control Task: ctl_{WorkFlow_Descriptor} Specify when and how the

PowerCenter Server to stop or abort a workflow by using the Control task in the workflow.

Event Wait Task: Wait_ or evtw_{Event_Descriptor} The Event-Wait task waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow.

Event Raise Task: Raise_ or evtr_ {Event_Descriptor} Event-Raise task represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.

ODBC Data Source Names

Be sure to set up all Open Database Connectivity (ODBC) data source names (DSNs) the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC.


Also setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that there is less chance of a discrepancy creeping in among users when they use different (i.e., colleagues') machines and have to recreate a new DSN when they use a separate machine.

If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example, machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2. TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by multiple names, creating confusion for developers, testers, and potentially end users.

Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter will wind up with source objects called dev_db01 in the production repository. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information

Security considerations may dictate that the company name of the database or project be used instead of {user}_{database name} except for developer scratch schemas that are not found in test or production environments. Be careful not to include machine names or environment tokens in the database connection name. Database connection names must be very generic to be understandable and ensure a smooth migration.

The convention should be applied across all development, test, and production environments. This allows seamless migration of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also copied. So, if the developer uses connections with names like Dev_DW in the development repository, they will eventually wind up in the test and even in the production repositories as the folders are migrated. Manual intervention is then necessary to change connection names, user names, passwords, and possibly even connect strings.

Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the development environment to the test environment, the sessions will automatically use the existing connection in the test repository. With the right naming convention, you can migrate sessions from (??OK??) the test repository without manual intervention.

Tip: Have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this document at the beginning of a project and avoid developers creating their own with different conventions and possibly duplicating connections. These connections can then be protected though permission options so that only certain individuals can modify them.

PowerCenter PowerExchange Application/Relational Connections


Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager. When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target databases. Connections are saved in the repository.

For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you configure depends on the type of source data you want to extract and the extraction mode.

Source Type/Extraction Mode

Application Connection/Relational Connection

Connection Type

Recommended Naming Convention

DB2/390 Bulk Mode

Relational PWX DB2390 PWX_batch_DSNName

DB2/390 Change Mode

Application PWX DB2390 CDC Change

PWX_CDC_DSNName

DB2/390 Real Time Mode

Application PWX DB2390 CDC Real Time

PWX_RT_DSNName

DB2/400 Bulk Mode

Relational PWX DB2400 PWX_batch_DSNName

DB2/400 Change Mode

Application PWX DB2400 CDC Change

PWX_CDC_DSNName

DB2/400 Real Time Mode

Application PWX DB2400 CDC Real Time

PWX_RT_DSNName

IMS Batch Mode Application PWX NRDB Batch

PWX_NRDB_Recon_Name

IMS Change Mode Application PWX NRDB CDC Change

PWX_CDC_Recon_Name

IMS Real Time Application PWX NRDB CDC Real Time

PWX_RT_Recon_Name

VSAM Batch Mode Application PWX NRDB Batch

PWX_NRDB_Coll_Identifier Name

VSAM Change Mode

Application PWX NRDB CDC Change

PWX_CDC_Coll_Identifier Name

VSAM Real Time Mode

Application PWX NRDB CDC Real Time

PWX_RT_Coll_Identifier Name

Oracle Real Time Application PWX Oracle CDC Real

PWX_RT_Instance_Name

PowerCenter PowerExchange Target Connections

The connection you configure depends on the type of target data you want to load.

Target Type Connection Type Recommended Naming Convention

DB2/390 PWX DB2390 relational database connection

PWXT_DSNName

DB2/400 PWX DB2400 relational database connection

PWXT_DSNName


Performing Incremental Loads

Challenge

Data warehousing incorporates very large volumes of data. The process of loading the warehouse without compromising its functionality and in a reasonable timescale is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.

Description

As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.

Incremental Aggregation

Incremental aggregation is useful for applying captured changes in the source to aggregate calculations in a session. If the source changes only incrementally, and you can capture changes, you can configure the session to process only those changes. This allows the PowerCenter Server to update your target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.

If the session performs incremental aggregation, the PowerCenter Server saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. Set the “Incremental Aggregation” Session Attribute. For details see Chapter 22 in the Workflow Administration Guide.

Use incremental aggregation under the following conditions:

• Your mapping includes an aggregate function • The source changes only incrementally • You can capture incremental changes (i.e., by filtering source data by timestamp) • You get only delta records (i.e., you may have implemented the CDC (Change

Data Capture) feature of PowerExchange if the source is on a mainframe)

Do not use incremental aggregation in the following circumstances:


• You cannot capture new source data • Processing the incrementally changed source significantly changes the target: If

processing the incrementally changed source alters more than half the existing target, the session may not benefit from using incremental aggregation.

• Your mapping contains percentile or median functions

Conditions that lead to making a decision on an incremental strategy are:

• Error handling and loading and unloading strategies for recovering, reloading, and unloading data

• History tracking, keeping track of what has been loaded and when • Slowly changing dimensions. Informatica Mapping Wizards are a good start to an

incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 14 in the Designer Guide)

Source Analysis

Data sources typically fall into the following possible scenarios:

• Delta records – Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse.

• Record indicator or flags – Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates and deletes.

• Date stamped data – Data is organized by timestamps. Data is loaded into the warehouse based upon the last processing date or the effective date range.

• Key values are present – When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.

• No key values present – Surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.

Identify Which Records Need to be Compared

After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:

• Compare with the target table. When source delta loads are received determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target or filtered out and not added to the target).

• Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for updates or deletes, or that the record can be successfully inserted. More design effort may be needed to manage errors in these situations.


Determine the Method of Comparison

There are three main strategies in mapping design that can be used as a method of comparison:

• Joins of sources to targets - Records are directly joined to the target using Source Qualifier join conditions or using joiner transformations after the source qualifiers (for heterogeneous sources). When using joiner transformations, take care to ensure the data volumes are manageable.

• Lookup on target - Using the lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities.

• Load table log - Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates up to which data has already been loaded into a log table.

Source-Based Load Strategies

Complete incremental loads in a single file/table

The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records.

Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.

Date-stamped data

This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed.

With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component. Placing the load strategy into the other mapping components is much more flexible and controllable by the data Integration developers and by metadata.

Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A router transformation or a filter can be placed after the source qualifier to remove old records.


To compare the effective dates, you can use mapping variables to provide the previous date processed. The alternative is to use control tables to store the dates and update the control table after each load.

For detailed instruction on how to select dates, refer to Using Parameters, Variables and Parameter Files in Chapter 8 of the Designer Guide.

Changed data based on keys or record information

Data that is uniquely identified by keys can be selected based upon selection criteria. For example, records that contain key information such as primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you can also check to see if you need to update these records or discard the source record.

It may be possible to do a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists or not.

Target-Based Load Strategies

Load directly into the target

Loading directly into the target is possible when the data is going to be bulk loaded. The mapping will then be responsible for error control, recovery, and update strategy.

Load into flat files and bulk load using an external loader

The mapping will load data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and also provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse.

Load into a mirror database

The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.

Using Mapping Variables and Parameter Files

You can use a mapping variable to perform incremental loading. The mapping variable is used in the source qualifier or join condition to select only the new data that has been entered based on the create_date or the modify_date, whichever date can be used to identify a newly inserted record. However, the source system must have a reliable date to use.

The steps involved in this method are:


Step 1: Create mapping variable

In the Mapping Designer, choose Mappings-Parameters and Variables. Or, to create variables for a mapplet, choose Mapplet-Parameters and Variables in the Mapplet Designer. Click Add and enter the name of the variable. In this case, make your variable a date/time. For the Aggregation option, select MAX.

In the same screen, state your initial value. This is the date at which the load should start. The date can use any one of these formats:

• MM/DD/RR • MM/DD/RR HH24:MI:SS • MM/DD/YYYY • MM/DD/YYYY HH24:MI:SS

Step 2: Use the mapping variable in the source qualifier

The select statement should look like the following:

Select * from table A

where

CREATE DATE > date(‘$$INCREMENT DATE’. ‘MM-DD-YYYY HH24:MI:SS’)

Step 3: Use the mapping variable in an expression

For the purpose of this example, use an expression to work with the variable functions to set and use the mapping variable.

In the expression, create a variable port and use the SETMAXVARIABLE variable function and do the following:

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

CREATE_DATE is the date for which you want to store the maximum value.

You can use the variable functions in the following transformations:

• Expression • Filter • Router • Update Strategy

The variable constantly holds (per row) the max value between source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved.


When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file.

The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it.

After a successful session run, the PowerCenter Server saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.

Real-Time Integration with PowerCenter

Challenge

Configure PowerCenter to work with PowerCenter Connect to process real-time data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting up a real-time session to work with PowerCenter.

Description

PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific version of PowerCenter Connect. Each PowerCenter Connect version supports a specific industry-standard messaging application, such as PowerCenter Connect for MQSeries, PowerCenter Connect for JMS, and PowerCenter Connect for TIBCO. IBM MQ Series uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic.

Connection Setup

PowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging application and message itself. Each version of PowerCenter Connect supplies its own connection attributes that need to be configured properly before running a real-time session.

PowerCenter Connect for MQ

1. In the Workflow Manager, connect to a repository and choose Connection -> Queue.

2. The Queue Connection Browser appears. Select New -> Message Queue 3. The Connection Object Definition dialog box appears.

You need to specify three attributes in the Connection Object Definition dialog box:

• Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely identified the connection.)

• Queue Manager - the Queue Manager name for the message queue. (in Windows, the default Queue Manager name is QM_<machine name>).

• Queue Name - the Message Queue name


Obtaining the Queue Manager and Message Queue names

• Open the MQ Series Administration Console. The Queue Manager should appear on the left panel.

• Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel.

Note that the Queue Manager’s name and Queue Name are case-sensitive.

PowerCenter Connect for JMS

PowerCenter Connect for JMS can be used to read or write messages from various JMS providers, such as IBM MQ Series JMS, BEA Weblogic Server, and IBM Websphere.

There are two types of JMS application connections:

• JNDI Application Connection, which is used to connect to a JNDI server during a session run.

• JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes:

• Name • JNDI Context Factory • JNDI Provider URL • JNDI UserName • JNDI Password • JMS Application Connection

JMS Application Connection Attributes:

• Name • JMS Destination Type • JMS Connection Factory Name • JMS Destination • JMS UserName • JMS Password

Configuring the JNDI Connection for IBM MQ Series

The JNDI settings for MQ Series JMS can be configured using a file system service or LDAP (Lightweight Directory Access Protocol).

The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the MQSeries Java installation/bin directory.

• If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory

• Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory

Find the PROVIDER_URL settings.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following provider URL setting and provide a value for the JNDI directory.

PROVIDER_URL=file: /<JNDI directory>

<JNDI directory> is the directory where you want JNDI to store the .binding file.

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the provider URL setting and specify a hostname.

#PROVIDER_URL=ldap://<hostname>/context_name

For example, you could specify:

PROVIDER_URL=ldap://<localhost>/o=infa,c=rc

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the following settings and enter a user DN and password:

PROVIDER_USERDN=cn=myname,o=infa,c=rc PROVIDER_PASSWORD=test

The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:

JMSAdmin.config Settings JNDI Application Connection Attribute INITIAL_CONTEXT_FACTORY JNDI Context Factory PROVIDER_URL JNDI Provider URL PROVIDER_USERDN JNDI UserName PROVIDER_PASSWORD JNDI Password

Configuring the JMS Connection for IBM MQ Series

The JMS connection is defined using a tool in JMS called jmsadmin that is available in MQ Series Java installation/bin directory. Use this tool to configure the JMS Connection Factory.

The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.

• When Queue Connection Factory is used, define a JMS queue as the destination. • When Topic Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is:

def qcf(<qcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)

The command to define JMS queue is:

def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name)

The command to define JMS topic connection factory (tcf) is:

def tcf(<tcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)

The command to define the JMS topic is:

def t(<JMS_topic_name>) topic(pub/sub_topic_name)

The topic name must be unique. For example: topic (application/infa)

The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

JMS Object Types JMS Application Connection Attribute QueueConnectionFactory or TopicConnectionFactory

JMS Connection Name

JMS Queue Name or JMS Topic Name

JMS Destination

Configure the JNDI and JMS Connection for IBM Websphere

Configure the JNDI settings for IBM WebSphere to use IBM WebSphere as a provider for JMS sources or targets in a PowerCenterRT session.

JNDI Connection Add the following option to the file JMSAdmin.bat to configure JMS properly:

-Djava.ext.dirs=<WebSphere Application Server>bin

For example:

-Djava.ext.dirs=WebSphere\AppServer\bin

The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin directory.

INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory

PROVIDER_URL=iiop://<hostname>/

For example:

PROVIDER_URL=iiop://localhost/

PROVIDER_USERDN=cn=informatica,o=infa,c=rc PROVIDER_PASSWORD=test

JMS Connection

The JMS configuration is similar to the JMS Connection for IBM MQ Series.

Configure the JNDI and JMS Connection for BEA Weblogic

Configure the JNDI settings for BEA Weblogic to use BEA Weblogic as a provider for JMS sources or targets in a PowerCenterRT session.

PowerCenter Connect for JMS and the JMS hosting WebLogic server do not need to be on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place.

JNDI Connection

The Weblogic Server automatically provides a context factory and URL during the JNDI set-up configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources and targets in the Workflow Manager.

Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow Manager:

weblogic.jndi.WLInitialContextFactory

Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow Manager:

t3://<WebLogic_Server_hostname>:<port>

where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the port number for the WebLogic Server.

JMS Connection

The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection Factory.

The JMS Destination is also configured from the BEA Weblogic Server console.

From the Console pane, select Services > JMS > Servers > <JMS Server name> > Destinations under your domain.

Click Configure a New JMSQueue or Configure a New JMSTopic.

The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

WebLogic Server JMS Object JMS Application Connection Attribute Connection Factory Settings: JNDIName JMS Application Connection Attribute Connection Factory Settings: JNDIName JMS Connection Factory Name Destination Settings: JNDIName JMS Destination

In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO

TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for JMS can’t connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server. Configure a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO Rendezvous Server.

To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server, follow these steps:

1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server.

2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.

Configure the following information in your JNDI application connection:

• JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory • Provider URL.tibjmsnaming://<host>:<port> where host and port are the host

name and port number of the Enterprise Server.

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server:

1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems:

tibrv_transports = enabled


2. Enter the following transports in the transports.conf file:

[RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server

The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system.

3. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example:

topicname export="RV"

The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.

PowerCenter Connect for webMethods

When importing webMethods sources into the Designer, be sure the webMethods host file doesn’t contain ‘.’ character. You can’t use fully-qualified names for the connection when importing webMethods sources. You can use fully-qualified names for the connection when importing webMethods targets because PowerCenter doesn’t use the same grouping method for importing sources and targets. To get around this, modify the host file to resolve the name to the IP address.

For example:

Host File:

crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerCenter Connect for webMethods sources into the Designer.

If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields ‘destid’ and ‘tag’ are populated for the request/reply model. ‘Destid’ should be populated from the ‘pubid’ of the source document and ‘tag’ should be populated from ‘tag’ of the source document. Use the option ‘Create Default Envelope

Fields’ when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.

Configuring the PowerCenter Connect for webMethods connection

To create or edit PowerCenter Connect for webMethods connection select Connections -> Application -> webMethods Broker from the Workflow Manager.

PowerCenter Connect for webMethods connection attributes:

• Name • Broker Host • Broker Name • Client ID • Client Group • Application Name • Automatic Reconnect • Preserve Client State

Enter the connection to the Broker Host in the following format <hostname: port>.

If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.

Setting Up Real-Time Session in PowerCenter

The PowerCenter real-time option uses a Zero Latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute ‘Flush Latency’ to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends.

The following reader attributes determine when a PowerCenter session should end:

• Message Count - Controls the number of messages the PowerCenter Server reads from the source before the session stops reading from the source.

• Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source.

• Time Slice Mode - Indicates a specific range of time the server read messages from the source. Only PowerCenter Connect for MQSeries uses this option.

• Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.

The specific filter conditions and options available to you depend on which PowerCenter Connect you use.


For example: Attributes for JMS Reader

Set the attributes that controls the end of session. One or more attributes can be used to control the end of session.

For example: set the MessageCount attributes to 10. The session will end after it reads 10 messages from the messaging system.

If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session.

Note: The real-time attributes can be found in the Reader Properties for PowerCenter Connect for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ Series, the real-time attributes must be specified as a filter condition.

The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in seconds.

For example, if the Real-time Flush Latency is set to 2, PowerCenter will flush messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session.

The message recovery option can be enabled to make sure no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for real-time sessions because some messaging applications do not store the messages after the messages are consumed by another application.

Executing a Real-Time Session


A real-time session often has to be up and running continuously to listen to the messaging application and to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all PowerCenter Connect versions except for PowerConnect for MQSeries where the session will continue to run and flush the messages to the target using the specific flush latency interval.

Another scenario is the ability to read data from another source system and send it to a real-time target immediately. For example: Reading data from a relational source and writing it to MQ Series. In this case, set the session to run continuously so that every change in the source system can be immediately reflected in the target.

To set a workflow to run continuously, edit the workflow and select the ‘Scheduler’ tab. Edit the ‘Scheduler’ and select ‘Run Continuously’ from ‘Run Options’. A continuous workflow starts automatically when the Load Manager starts. When the workflow stops, it restarts immediately.

Real-Time Session and Active Transformation

Some of the transformations in PowerCenter are ‘active transformations’, which means that the number of input rows and output rows of the transformations are not the same. For most cases, active transformation requires all of the input rows to be processed before processing the output row to the next transformation or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to be processed.

Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-time session by setting the transaction scope property in the active transformation to ‘Transaction’. This signals the session to process the data in the transformation every transaction. For example, if a real-time session is using aggregator that sums a field of an input, the summation will be done per transaction, as opposed to all rows. The result may or may not be correct depending on the requirement. Use the active transformation with real-time session if you want to process the data per transaction.

Custom transformations can also be defined to handle data per transaction so that they can be used in a real-time session.


Session and Data Partitioning

Challenge

Improving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter 6.0 and higher.

Description

On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.

In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. (Use the Workflow Manager client tool to implement session partitioning and see Chapter 13: Pipeline Partitioning in the Workflow Administration Guide for additional information).

Assumptions

The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.

• Indexing has been implemented on the partition key when using a relational source.

• Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay.

• All possible constraints are dropped or disabled on relational targets. • All possible indexes are dropped or disabled on relational targets. • Table spaces and database partitions are properly managed on the target system. • Target files are written to same physical machine that hosts the PowerCenter

process, in order to reduce network overhead and delay. • Oracle External Loaders are utilized whenever possible


Follow these steps when considering partitioning:

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics:

Check Idle Time and Busy Percentage for each thread. This will give the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with “PETL_” under the “RUN INFO FOR TGT LOAD ORDER GROUP” section. These PETL messages give the following details against the Reader, Transformation, and Writer threads:

• Total Run Time • Total Idle Time • Busy Percentage

Under utilized or intermittently used CPUs. To determine if this is the case, check the CPU usage of your machine: UNIX - type VMSTAT 1 10 on the command line. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (twenty percent or more idle time) then this session's performance may be improved by adding a partition.

• NT - check the task manager performance tab.

Sufficient I/O. To determine the I/O statistics:

• UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)


Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:

• UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.


If you determine that partitioning is practical, you can begin setting up the partition. The following are selected hints for session setup; see the Workflow Administration Guide for further directions on setting up partitioned sessions.

Partition Types

PowerCenter v6.x and higher provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a


partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

Round-robin partitioning

The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions.

In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.

• Source file 1: 100,000 rows • Source file 2: 5,000 rows • Source file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.

Hash partitioning

The PowerCenter Server applies a hash function to a partition key to group data among partitions.

Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key.

An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.

Key range partitioning

With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target. The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port.


Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to Workflow Administration Guide for further directions on setting up Key range partitions.

For example, with key range partitioning set at End range = 2020, the PowerCenter Server will pass in data where values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server will pass in data where values are equal to greater than 2020. Null values or values that might not fall in either partition will be passed through the first partition.

Pass-through partitioning

In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them.

Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. Refer to Workflow Administration Guide (Version 6.0) for further directions on setting up pass-through partitions.

The Data Transformation Manager spawns a master thread on each session run, which in itself creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread.

When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.

Tips for Efficient Session and Data Partitioning

• Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions.

• Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session.

• Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session.

• Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition.


• Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces.

• Consider using external loader. As with any session, using an external loader may increase session performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning.

• Write throughput. Check the session statistics to see if you have increased the write throughput.

• Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing degradation in performance.

• When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition


Using Parameters, Variables and Parameter Files

Challenge

Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.

Description

Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth.

PowerCenter 5 made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, it provides built-in parameters for use within Server Manager. Using parameter files, these values can change from session-run to session-run. Subsequently PowerCenter 6 built upon this capability by adding several additional features. The concept is tailored to the new functionality available in this release.

Parameters and Variables

Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created by using a text editor such as WordPad or Notepad. List the parameters or variables and their values in the parameter file. Parameter files can contain the following types of parameters and variables:

• Workflow variables • Worklet variables • Session parameters • Mapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the PowerCenter Server checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for


these parameters and variables, the PowerCenter Server checks for the start value of the parameter or variable in other places.

Session parameters must be defined in a parameter file. Since session parameters do not have default values, when the PowerCenter Server cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file.

Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks use, as necessary. To specify the parameter file that the PowerCenter Server uses with a workflow, worklet, or session, do either of the following:

• Enter the parameter file name and directory in the workflow, worklet, or session properties.

• Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the PowerCenter Server uses the information entered in the pmcmd command line.

Parameter File Format

The format for parameter files changed in version 6 to reflect the improved functionality and nomenclature of the Workflow Manager. When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet, or session whose parameters and variables that are to be assigned. Assign individual parameters and variables directly below this heading, entering each parameter or variable on a new line. List parameters and variables in any order for each task.

The following heading formats can be defined:

Workflow variables:

[folder name.WF:workflow name]

Worklet variables:

[folder name.WF:workflow name.WT:worklet name]

Worklet variables in nested worklets:

[folder name.WF:workflow name.WT:worklet name.WT:worklet name...]

Session parameters, plus mapping parameters and variables:

[folder name.WF:workflow name.ST:session name] or


[folder name.session name] or

[session name]

Below each heading, define parameter and variable values as follows:

• parameter name=value • parameter2 name=value • variable name=value • variable2 name=value

For example, a session in the Production folder, s_MonthlyCalculations, uses a string mapping parameter, $$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file.

The following table shows the parameters and variables that will be defined in the parameter file:

Parameters and Variables in Parameter File

Parameter and Variable Type

Parameter and Variable Name Desired Definition

String Mapping Parameter

$$State MA

Datetime Mapping Variable $$Time 10/1/2000

00:00:00 Source File (Session Parameter)

$InputFile1 Sales.txt

Database Connection (Session Parameter)

$DBConnection_Target Sales (database connection)

Session Log File (Session Parameter)

$PMSessionLogFile d:/session logs/firstrun.txt

The parameter file for the session includes the folder and session name, as well as each parameter and variable:

• [Production.s_MonthlyCalculations] • $$State=MA • $$Time=10/1/2000 00:00:00 • $InputFile1=sales.txt • $DBConnection_target=sales • $PMSessionLogFile=D:/session logs/firstrun.txt


The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the PowerCenter Server to use the value for the variable that was set in the previous session run.

Mapping Variables

Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables. After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations.

Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:

• SetVariable • SetMaxVariable • SetMinVariable • SetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.

Name

The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.

Aggregation type

This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the max value across ALL records until the value is deleted.

Initial value

This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:

• Fails to complete. • Is configured for a test load. • Is a debug session. • Runs in debug mode and is configured to discard session output.

Order of evaluation


The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type.

The PowerCenter Server looks for the start value in the following order:

1. Value in session parameter file 2. Value saved in the repository 3. Initial value 4. Default value

Mapping parameters and variables

Since parameter values do not change over the course of the session run, the value used is based on:

• Value in session parameter file • Initial value • Default value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:

• Expression • Filter • Router • Update Strategy

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation.

The lookup SQL override is similar to entering a custom query in a Source Qualifier transformation. When entering a lookup SQL override, enter the entire override, or generate and edit the default SQL statement. When the Designer generates the default SQL statement for the lookup SQL override, it includes the lookup/output ports in the lookup condition and the lookup/return port.

Note: Although you can use mapping parameters and variables when entering a lookup SQL override, the Designer cannot expand mapping parameters and variables in the query override and does not validate the lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override, the PowerCenter Server expands mapping parameters and variables and connects to the lookup database to validate the query override.

Also note that Workflow Manager does not recognize variable connection parameters such as dbconnection with lookup transformations. At this time, Lookups can use $Source, $Target, or exact db connections.

Guidelines for Creating Parameter Files


Use the following guidelines when creating parameter files:

• Capitalize folder and session names as necessary. Folder and session names are case-sensitive in the parameter file.

• Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session.

• Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files.

• If including parameter and variable information for more than one session in the file, create a new section for each session as follows. The folder name is optional.

[folder_name.session_name]

parameter_name=value

variable_name=value

mapplet_name.parameter_name=value

[folder2_name.session_name]

parameter_name=value

variable_name=value


• Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the PowerCenter Server assigns the parameter or variable value using the first instance of the parameter or variable.

• Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order.

• When defining parameter values, do not use unnecessary line breaks or spaces. The PowerCenter Server may interpret additional spaces as part of the value.

• List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive.

• List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive.

• Use correct date formats for datetime values. When entering datetime values, use the following date formats:

MM/DD/RR


MM/DD/RR HH24:MI:SS

MM/DD/YYYY

MM/DD/YYYY HH24:MI:SS

• Do not enclose parameters or variables in quotes. The PowerCenter Server interprets everything after the equal sign as part of the value.

• Precede parameters and variables created in mapplets with the mapplet name as follows:


mapplet2_name.variable_name=value

Example: Parameter files and session parameters

Parameter files, along with session parameters, allow you to change certain values between sessions. A commonly used feature is the ability to create user-defined database connection session parameters to reuse sessions for different relational sources or targets. Use session parameters in the session properties, and then define the parameters in a parameter file. To do this, name all database connection session parameters with the prefix $DBConnection, followed by any alphanumeric and underscore characters as shown in the previous example where DBConnection_target=sales. Instead of relational connections, it can also be used for source files. Session parameters and parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping need to be changed, as shown in the examples above.

Using Parameters in Source Qualifiers

Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to reuse the same mapping, with different sessions, to extract specified data from the parameter files the session’s references.

Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the second mapping to use that parameter file created from the first mapping. The second mapping will pull the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a parameter file for another session to use.

Note: Server variables cannot be modified by entries in the parameter file. For example, there is no way to set the Workflow log directory in a parameter file. The Workflow Log File Directory can only accept an actual directory or the $PMWorkflowLogDir variable as a valid entry. The $PMWorkflowLogDir variable is a server variable that is set at the server configuration level, not in the Workflow parameter file.

Example: Variables and Parameters in an Incremental Strategy


Variables and parameters can enhance incremental strategies. The following example uses a mapping variable, an expression transformation object, and a parameter file for restarting.

Scenario

Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. Process will run once every twenty-four hours.

Sample Solution

Create a mapping with source and target objects. From the menu create a new mapping variable named $$Post_Date with the following attributes:

• TYPE Variable • DATATYPE Date/Time • AGGREGATION TYPE MAX • INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it is necessary to use the native RDBMS function to convert (e.g., TO DATE(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS')

Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.

The next step is to $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named


Post_Date is created with a data type of date/time. In the expression code section, place the following function:

SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:

DATE_ENTERED Resultant POST_DATE 9/1/2000 9/1/2000 10/30/2001 10/30/2001 9/2/2000 10/30/2001

Consider the following with regard to the functionality:

1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object.

2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target.

3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.


The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session and choose View Persistent Values.

The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

Resetting or overriding persistent values

To reset the persistent value to the initial value declared in the mapping, view the persistent value from Server Manager (see graphic above) and press Delete Values. This will delete the stored value from the repository, causing the Order of Evaluation to use the Initial Value declared from the mapping.

If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:

• Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file.

• Run PMCMD for that session but declare the specific parameter file within the PMCMD command.


Configuring the parameter file location

Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in the workflow or session properties:

• Select either the Workflow or Session, choose, Edit, and click the Properties tab. • Enter the parameter directory and name in the Parameter Filename field. • Enter either a direct path or a server variable directory. Use the appropriate

delimiter for the Informatica Server operating system.

The following graphic shows the parameter filename and location specified in the session task.

The next graphic shows the parameter filename and location specified in the Workflow.


In this example, after the initial session is run the parameter file contents may look like:

[Test.s_Incremental]

;$$Post_Date=

By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to:

[Test.s_Incremental]

$$Post_Date=04/21/2001

Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Example: Using session and mapping parameters in multiple database environments

Reusable mappings that can source a common table definition across multiple databases, regardless of differing environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.


Scenario

Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.

DB Instance Schema Table User Password ORC1 aardso orders Sam max ORC99 environ orders Help me HALC hitme order_done Hi Lois UGLY snakepit orders Punch Judy GORF gmer orders Brer Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID NUMBER (28) NOT NULL, DATE_ENTERED DATE NOT NULL, DATE_PROMISED DATE NOT NULL, DATE_SHIPPED DATE NOT NULL, EMPLOYEE_ID NUMBER (28) NOT NULL, CUSTOMER_ID NUMBER (28) NOT NULL, SALES_TAX_RATE NUMBER (5,4) NOT NULL, STORE_ID NUMBER (28) NOT NULL

Sample Solution

Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:


Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required as this solution will use parameter files.

Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement will show the columns. Override the table names in the SQL statement with the mapping parameter.

Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop down box, choose the following parameter:

$DBConnection_Source.

Point the target to the corresponding target and finish.

Now create the parameter files. In this example, there will be five separate parameter files.

Parmfile1.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=aardso.orders


$DBConnection_Source= ORC1

Parmfile2.txt


$$Source_Schema_Table=environ.orders

$DBConnection_Source= ORC99

Parmfile3.txt


$$Source_Schema_Table=hitme.order_done

$DBConnection_Source= HALC

Parmfile4.txt


$$Source_Schema_Table=snakepit.orders

$DBConnection_Source= UGLY

Parmfile5.txt


$$Source_Schema_Table= gmer.orders

$DBConnection_Source= GORF

Use PMCMD to run the five sessions in parallel. The syntax for PMCMD for starting sessions is as follows:

pmcmd startworkflow -s serveraddress:portno -u Username -p Password s_Incremental

Notes on Using Parameter Files with Startworkflow

When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter Server runs the workflow using the parameters in the file specified.

For UNIX shell users, enclose the parameter file name in single quotes:

-paramfile '$PMRootDir/myfile.txt'


For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the name includes spaces, enclose the file name in double quotes:

-paramfile $PMRootDir\my file.txt

Note: When writing a pmcmd command that includes a parameter file located on another machine, use the backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the server variable.

pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\$PMRootDir/myfile.txt'

In the event that it is necessary to run the same workflow with different parameter files, use the following five separate commands:

pmcmd startworkflow tech_user pwd 127.0.0.1:4001 Test s_Incremental_SOURCE_CHANGES paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1





Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script would change the parameter file for the next session.


Using PowerCenter Labels

Challenge

Using labels effectively in a data warehouse or data integration project to assist with administration and migration.

Description

A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a “pointer” or reference to a group of versioned objects. For example, a label called “Project X version X” can be applied to all object versions that are part of that project and release.

Labels can be used for many purposes:

• Track versioned objects during development • Improve object query results. • Create logical groups of objects for future deployment. • Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility.

The “Use Repository Manager” privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.


When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description.

Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label.

Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.

Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the “Apply Label” wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).


Applying Labels

Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the “Move label” property to point the label to the latest version of the object(s).

Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels.

After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).

Using labels in deployment

An object query can be created using the existing labels (as shown below). Labels can be associated only with a dynamic deployment group. Based on the object query, objects associated with that label can be used in the deployment.


Strategies for Labels

Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels.

For each planned migration between repositories, choose three labels for the development and subsequent repositories:

• The first is to identify the objects that developers can mark as ready for migration. • The second should apply to migrated objects, thus developing a migration audit

trail. • The third is to apply to objects as they are migrated into the receiving repository,

completing the migration audit trail.

When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually.


Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.


Using PowerCenter Metadata Reporter and Metadata Exchange Views for Quality Assurance

Challenge

The principal objectives of any QA strategy are to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.

Less well-appreciated is the role that the PowerCenter repository can play in an automated QA strategy. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes.

Description

Before considering the mechanics of an automated QA strategy it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids:

• shared template for each type of mapping • checklists to guide the developer through the process of adapting the template to

the mapping requirements • macros/scripts to generate productivity aids such as SQL overrides etc.

It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes.

Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Together with the supported facilities provided to query the


repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Reporter,, this opens the door to an automated QA strategy.

For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target.

Possible code errors in this respect could be identified as follows :

• Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD.

• Develop a query on the repository to search for sessions named EXTRACT, which do not have the truncate target option set.

• Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have the truncate target option set.

• Provide a facility to allow developers to run both queries before releasing code to the test environment.

Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS.

The following bullets represent a high level overview of the steps involved in automating QA:

• Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories.

• Identify the key attributes of each category. • Define naming standards to identify the category for each

transformations/mappings/sessions/workflows • Analyze the MX Views to source the key attributes. • Develop the query to compare actual and expected attributes for each category.

After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages:

• Execute a profile to assign environment variables (repository schema user, password etc).

• Select the folder to be reviewed. • Execute the query to find exceptions. • Report the exceptions in an accessible format. • Exit with failure if exceptions are found.

TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica.


Using PowerCenter with UDB

Challenge

Universal Database (UDB) is a database platform that can be used to run PowerCenter repositories and act as source and target databases for PowerCenter mappings. Like any software, it has its own way of doing things. It is important to understand these behaviors so as to configure the environment correctly for implementing PowerCenter and other Informatica products with this database platform. This Best Practice offers a number of tips for using UDB with PowerCenter.

Description

UDB Overview

UDB is used for a variety of purposes and with various environments. UDB servers run on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB supports two independent types of parallelism: symmetric multi-processing (SMP) and massively parallel processing (MPP).

Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction with the Informatica product suite. UDB EEE introduces a dimension of parallelism that can be scaled to very high performance. A UDB EEE database can be partitioned across multiple machines that are connected by a network or a high-speed switch. Additional machines can be added to an EEE system as application requirements grow. The individual machines participating in an EEE installation can be either uniprocessors or symmetric multiprocessors.

Connection Setup

You must set up a remote database connection to connect to DB2 UDB via PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number of attachments per user to the shared memory segments when the user is using the local (or indirect) connection/protocol. The PowerCenter server runs into this limit when it is acting as the database agent or user. This is especially apparent when the repository is installed on DB2 and the target data source is on the same DB2 database.

The local protocol limit will definitely be reached when using the same connection node for the repository via the PowerCenter Server and for the targets. This occurs when the


session is executed and the server sends requests for multiple agents to be launched. Whenever the limit on number of database agents is reached, the following error occurs:

CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=55032]

The following recommendations may resolve this problem:

• Increase the number of connections permitted by DB2. • Catalog the database as if it were remote. (For information of how to catalog

database with remote node refer Knowledgebase id 14745 at my.Informatica.com support Knowledgebase)

• Be sure to close connections when programming exceptions occur. • Verify that connections obtained in one method are returned to the pool via close() • (The PowerCenter Server is very likely already doing this). • Verify that your application does not try to access pre-empted connections (i.e.,

idle connections that are now used by other resources).

DB2 Timestamp

DB2 has a timestamp data type that is precise to the microsecond and uses a 26-character format, as follows:

YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six decimals places of second)

The PowerCenter Date/Time datatype only supports precision to the second (using a 19 character format), so under normal circumstances when a timestamp source is read into PowerCenter, the six decimal places after the second are lost. This is sufficient for most data warehousing applications but can cause significant problems where this timestamp is used as part of a key.

If the MICROS need to be retained, this can be accomplished by changing the format of the column from a timestamp data type to a character 26 in the source and target definitions. When the timestamp is read from DB2, the timestamp will be read in and converted to character in the ‘YYYY-MM-DD-HH.MI.SS.MICROS’ format. Likewise, when writing to a timestamp, pass the date as a character in the ‘YYYY-MM-DD-HH.MI.SS.MICROS’ format. If this format is not retained, the records are likely to be rejected due to an invalid date format error.

It is also possible to maintain the timestamp correctly using the timestamp data type itself. Setting a flag at the PowerCenter Server level does this; the technique is described in Knowledge Base article 10220 at my.Informatica.com.

Importing Sources or Targets


If the value of the DB2 system variable APPLHEAPSZ is too small when you use the Designer to import sources/targets from a DB2 database, the Designer reports an error accessing the repository. The Designer status bar displays the following message:

SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in the application heap to process the statement.

If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2 operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each process using the database.

Unsupported Datatypes

PowerMart and PowerCenter do not support the following DB2 datatypes:

• Dbclob • Blob • Clob • Real

DB2 External Loaders

The DB2 EE and DB2 EEE external loaders can both perform insert and replace operations on targets. Both can also restart or terminate load operations.

• The DB2 EE external loader invokes the db2load executable located in the PowerCenter Server installation directory. The DB2 EE external loader can load data to a DB2 server on a machine that is remote to the PowerCenter Server.

• The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load data. The Autoloader program uses the db2atld executable. The DB2 EEE external loader can partition data and load the partitioned data simultaneously to the corresponding database partitions. When you use the DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server must be on the same machine.

The DB2 external loaders load from a delimited flat file. Be sure that the target table columns are wide enough to store all of the data. If you configure multiple targets in the same pipeline to use DB2 external loaders, each loader must load to a different tablespace on the target database. For information on selecting external loaders, see Configuring External Loading in a Session in the PowerCenter User Guide.

Setting DB2 External Loader Operation Modes

DB2 operation modes specify the type of load the external loader runs. You can configure the DB2 EE or DB2 EEE external loader to run in any one of the following operation modes:

• Insert. Adds loaded data to the table without changing existing table data. • Replace. Deletes all existing data from the table, and inserts the loaded data. The

table and index definitions do not change.


• Restart. Restarts a previously interrupted load operation. • Terminate. Terminates a previously interrupted load operation and rolls back the

operation to the starting point, even if consistency points were passed. The tablespaces return to normal state, and all table objects are made consistent.

Configuring Authorities, Privileges, and Permissions

When you load data to a DB2 database using either the DB2 EE or DB2 EEE external loader, you must have the correct authority levels and privileges to load data into to the database tables.

DB2 privileges allow you to create or access database resources. Authority levels provide a method of grouping privileges and higher-level database manager maintenance and utility operations. Together, these functions control access to the database manager and its database objects. You can access only those objects for which you have the required privilege or authority.

To load data into a table, you must have one of the following authorities:

• SYSADM authority • DBADM authority • LOAD authority on the database, with INSERT privilege

In addition, you must have proper read access and read/write permissions:

• The database instance owner must have read access to the external loader input files.

• If you use run DB2 as a service on Windows, you must configure the service start account with a user account that has read/write permissions to use LAN resources, including drives, directories, and files.

• If you load to DB2 EEE, the database instance owner must have write access to the load dump file and the load temporary file.

Remember, the target file must be delimited when using the DB2 AutoLoader.

Guidelines for Performance Tuning

You can achieve numerous performance improvements by properly configuring the database manager, database, and tablespace container and parameter settings. For example, MAXFILOP is one of the database configuration parameters that you can tune. The default value for MAXFILOP is far too small for most databases. When this value is too small, UDB spends a lot of extra CPU processing time closing and opening files. To resolve this problem, increase MAXFILOP value until UDB stops closing files.

You must also have enough DB2 agents available to process the workload based on the number of users accessing the database. Incrementally increase the value of MAXAGENTS until agents are not stolen from another application. Moreover, sufficient memory allocated to the CATALOGCACHE_SZ database configuration parameter also benefits the database. If the value of catalog cache heap is greater than zero, both DBHEAP and CATALOGCACHE_SZ should be proportionally increased.


In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have at least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a multiple of EXTENTSIZE, where the multiplier is equal to the number of containers. Doing so will enable parallel I/O for larger sorts, joins, and other database functions requiring substantial TEMPSPACE space.

In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an INTRA_PARALLEL value of YES for CPU parallelism. The database configuration parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the number of CPUs available and number of processes that will be running simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can take up all the processing power with this setting. Setting it to one does not make sense as there is no parallelism in one.

(Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB).

Data warehouse databases perform numerous sorts, many of which can be very large. SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users fail to enable. To do so, use the db2set command to set environment variable DB2_HASH_JOIN=ON.

For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If real memory is available, some clients use even larger values for these configuration parameters.

SQL is very complex in a data warehouse environment and often consumes large quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9.

UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of CPUs on the UDB server and focus on your disk layout strategy instead.

Lastly, for RAID devices where several disks appear as one to the operating system, be sure to do the following:

1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or before a redirected restore)

2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces residing on the RAID devices for example DB2_PARALLEL_IO=4,5,6,7,8,10,12,13)

3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.

Database Locks and Performance Problems

When working in an environment with many users that target a DB2 UDB database, you may experience slow and erratic behavior resulting from the way UDB handles database locks. Out of the box, DB2 UDB database and client connections are configured on the assumption that they will be part of an OLTP system and place several locks on records


and tables. Because PowerCenter typically works with OLAP systems where it is the only process writing to the database and users are primarily reading from the database, this default locking behavior can have a significant impact on performance

Connections to DB2 UDB databases are set up using the DB2 Client Configuration utility. To minimize problems with the default settings, make the following changes to all remote clients accessing the database for read-only purposes. To help replicate these settings, you can export the settings from one client and then import the resulting file into all the other clients.

• Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the configuration settings and make sure the Enable Cursor Hold option is not checked.

• Connection Mode should be Shared, not Exclusive • Isolation Level should be Read Uncommitted (the minimum level) or Read

Committed (if updates by other applications are possible and dirty reads must be avoided)

For setting the Isolation level to dirty read at the PowerCenter Server level, you can set a flag can at the PowerCenter configuration file. For details on this process, refer to the KB article 13575 in my.Informatica.com support knowledgebase.

If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration utility, then highlight the database connection you use and select Properties. In Properties, select Settings and then select Advanced. You will see these options and their settings on the Transaction tab

To export the settings from the main screen of the IBM DB2 client configuration utility, highlight the database connection you use, then select Export and all. Use the same process to import the settings on another client.

If users run hand-coded queries against the target table using DB2's Command Center, be sure they know to use script mode and avoid interactive mode (by choosing the script tab instead of the interactive tab when writing queries). Interactive mode can lock returned records while script mode merely returns the result and does not hold them.

If your target DB2 table is partitioned and resides across different nodes in DB2, you can use a target partition type “DB Partitioning” in PowerCenter session properties. When DB partitioning is selected, separate connections are opened directly to each node and the load starts in parallel. This improves performance and scalability.

Using Shortcut Keys in PowerCenter Designer

Challenge

Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter Mapping Designer and Workflow Manager.

Description

After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety of shortcuts to speed up their operation.

PowerCenter provides two types of shortcuts: keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as efficiently as possible; and shortcuts that simplify the maintenance of repository objects.

General Suggestions

Maneuvering the Navigator window

Follow these steps to open a folder with workspace open as well:

1. Click the Open folder icon. (Note that double clicking on the folder name only opens the folder if the folder has not yet been opened or connected to.)

2. Alternatively, right click the folder name, then scroll down and click Open.

Working with the toolbar

Using an icon on the toolbar is nearly always faster than selecting a command from a drop-down menu.

• To add more toolbars, select Tools | Customize. • Select the Toolbar tab to add or remove toolbars.

Follow these steps to use drop-down menus without the mouse:

1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles.

2. Press the letter key of the letter underlined in the drop down menu you want. For instance, press 'r' for the 'Repository' menu. The menu will appear.

3. Press the letter key of the letter underlined in the option you want. For instance, press 'w' for 'Print Preview'.

4. Alternatively, after you have pressed the <Alt> key, use the right/left and up/down arrows to scroll across and down the menus. Press Enter when the desired command is highlighted.

• To use the 'Create Customized Toolbars' feature to tailor a toolbar for the functions you use frequently, press <Alt> <T> then <C>.

• To delete customized icons, select Tools | Customize and select the Tools tab. You can add an icon to an existing toolbar or create a new toolbar, depending on where you "drag and drop" the icon. (Note: adding the 'Arrange' icon can speed up the process of arranging mapping transformations.)

• To rearrange the toolbars, click and drag the double bar that begins each toolbar. You can insert more than one toolbar at the top of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can move toolbars to the bottom, side, or between the workspace and the message windows (which is a handy place to put the transformations toolbar).

• To use a Docking\UnDocking window (e.g., Repository Navigator), double click on the window's title bar. If you have a problem making it dock again, right click somewhere in the white space of the runaway window (not the title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper place and, when an outline of where the window used to be appears, release the window.

Keyboard Shortcuts

Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.

To: Press:

To: Press: Cancel editing in an object Esc Check and uncheck a check box Space Bar Copy text from an object onto a clipboard Ctrl+C Cut text from an object onto the clipboard Ctrl+X. Edit the text of an object F2. Then move the cursor to the desired

location Find all combination and list boxes Type the first letter of the list Find tables or fields in the workspace Ctrl+F Move around objects in a dialog box Ctrl+directional arrows Paste copied or cut text from the clipboard into an object

Ctrl+V

Select the text of an object F2 To start help F1

Mapping Designer

Navigating the Workspace

When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables, be sure to start in the Foreign Key table and drag the key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to dragging.

Follow these steps to quickly select multiple transformations:

1. Hold the mouse down and drag to view a box. 2. Be sure the box touches every object you want to select. The selected items will

have a distinctive outline around them. 3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key

and click the offending transformations one at a time. They will alternate between being selected and deselected each time you click on them.

Follow these steps to copy and link fields between transformations:

1. You can select multiple ports when you are trying to link to the next transformation.

2. When you are linking multiple ports, they are linked in the same order as they are in the source transformation. You need to highlight the fields you want in the source transformation and hold the mouse button over the port name in the target transformation that corresponds to the source transformation port.

3. Use the Autolink function whenever possible. It is located under the Layout menu or accessible by right-clicking somewhere in the background of the Mapping Designer.

4. Autolink can link by name or position. PowerCenter version 6 or above gives you the option of entering prefixes or suffixes (when you click the 'More' button). This is especially helpful when you are trying to autolink to a Router transformation, for instance. Each group created in a Router will have a distinct suffix number added to the port/field name. To autolink, you need to choose the proper Router and Router group in the 'From Transformation' space. You also

need to click the 'More' button and enter the appropriate suffix value. You must do both to create a link.

5. Autolink does not work if any of the fields in the 'To' transformation are already linked to another group or another stream. No error appears; the links are just not created.

Sometimes, a shared object is very close to, but not exactly what you need. In this case, you may want to make a copy with some minor alterations to suit your purposes. If you try to simply click and drag the object, it will ask you if you want to make a shortcut or it will be reusable every time. Follow these steps to make a non-reusable copy of a reusable object:

1. Open the target folder. 2. Select the object that you want to make a copy of, either in the source or target

folder. 3. Drag the object over the workspace. 4. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white

box) 5. Release the mouse button, then release the <Ctrl> key. 6. A copy confirmation window and a copy wizard window will appear. Note that

the look and feel of the copy wizard differs between versions 6 and 7. 7. The newly created transformation no longer says that it is reusable and you are

free to make changes without affecting the original reusable object.

Editing Tables/Transformation

Follow these steps to move one port in a transformation:

1. Double click the transformation and make sure you are in the "Ports" tab. (You go directly to the Ports tab if you double click a port instead of the colored title bar.)

2. Highlight the port and use the up/down arrow keys with the mouse (see red circle in the figure below).

3. Or, highlight the port and then press <Alt><w> for down or <Alt> for up. (Note: You can hold down the <Alt> and hit the <w> or as often as you need although this may not be practical if you are moving far).

Alternatively, you can accomplish the same thing by following these steps:

1. Highlight the port you want to move by clicking the number beside the port (note the blue arrow in the figure below).

2. Hold down the <Alt> key and grab the port by its number. 3. Drag the port to the desired location (the list of ports scrolls when you reach the

end). A red line indicates the new location (note the red arrow in the figure below).

4. When the red line is pointing to the desired location, release the mouse button, then release the <Alt> key.

Note that you cannot move more than one port at a time with this method. See below for instructions on moving more than one port at a time.

If you are using PowerCenter version 6 or 7 and the ports you are moving are adjacent, you can follow these steps to move more than one port at a time:

1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key.

2. Use the up/down arrows (see the red circle above) to move the ports to the desired location. To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it.

• To validate the Default value, first highlight the port you want to validate, and then press <Alt><v>.

• When adding a new port, just begin typing. There is no need to first press DEL to remove the "NEWFIELD" text, or to click OK when you have finished.

This is also true when you are editing a field, as long as you have highlighted the port so that the entire Port Name cell has a light box around it. The white box is created when you click on the white space of the port name cell. If you click on the words in the Port Name cell, a cursor will appear where you click. At this point, delete the parts of the word you don’t want.

• When moving about in the fields of the Ports tab of the Expression Editor, use the SPACE bar to check or uncheck the port type. Be sure to highlight the port box to check or uncheck the port type.

Follow either of these steps to quickly open the Expression Editor of an OUT/VAR port:

1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>.

2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.

• To cancel an edit in the grid, press <Esc> so the changes are not saved. • For all combo/dropdown list boxes, type the first letter on the list to select the

item you want. For instance, you can highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type . Then use the arrow keys to go down to the next port. This is very handy if you want to change all fields to string for example because using the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each time.

• To copy a selected item in the grid, press <Ctrl><c>. • To past a selected item from the Clipboard to the grid, press <Ctrl><v>.

• To delete a selected field or port from the grid, press <Alt><c>. • To copy a selected row from the grid, press <Alt><o>. • To paste a selected row from the grid, press <Alt>.

You can use either of the following methods to delete more than one port at a time.

• You can repeatedly hit the cut button (red circle below); or

· You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are holding <Shift> or <Ctrl>.

Editing Expressions

Follow either of these steps to expedite validation of a newly created expression:

• Click on the <Validate> button or press <Alt> and <v>. Note that this validates and leaves the Expression Editor up.

• Or, press <OK> to initiate parsing/validating the expression. The system will close the Expression Editor if the validation is successful. If you click OK once again in the "Expression parsed successfully" pop-up, the Expression Editor remains open.

There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If you want an item to appear in the Formula box, just double click on it in the appropriate list on the left. This helps to avoid typographical errors and mistakes such as including an output-only port name in an expression.

In version 6.0 an above, if you change a port name, PowerCenter automatically updates any expression that uses that port with the new name.

Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active for that expression.

The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:

Repository Object Shortcuts

A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the referenced object).

Whenever is possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository, and reduces maintenance.

Follow these steps to create a repository object shortcut:

1. Expand the “Shared Folder”. 2. Click and drag the object definition into the mapping that is open in the

workspace. 3. As the cursor enters the workspace, the object icon will appear along with a

small curve; as an example, the icon should look like this:

4. A dialog box will appear. Confirm that you want to create a shortcut.

If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the object into the workspace.

Workflow Manager

Navigating the Workspace

When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the operation you are performing:

To: Press Add a new field or port Alt + F Copy a row Alt + O Cut a row Alt + C Move current row down Alt + W Move current row up Alt + U Paste a row Alt + P Validate the default value in a transformation Alt + V Open the Expression Editor from the expression field F2, then press F3 To start the debugger F9


Repository Object Shortcuts

Mappings that reside in a “shared folder” can be reused within workflows by creating shortcut mappings.

A set of workflow logic can be reused within workflows by creating a reusable worklet.

To: Press: Create links Press Ctrl+F2 to select first task you want to link.

Press Tab to select the rest of the tasks you want to link

Press Ctrl+F2 again to link all the tasks you selected

Edit tasks name in the workspace F2 Expand a selected node and all its children

SHIFT + * (use asterisk on numeric keypad)

Move across to select tasks in the workspace

Tab

Select multiple tasks Ctrl + Mouse click


Web Services

Challenge

Understanding PowerCenter Connect for Web Services and configuring PowerCenter to access a secure web service.

Description

PowerCenter Connect for Web Services (aka WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways:

• Web Service source • Web Service transformation • Web Service target

Web Service Source Usage

PowerCenter supports a request-response type of operation using Web Services source. You can use the web service as a source if the input in the SOAP request remains fairly constant since input values for a web service source can only be provided at the source transformation level.

Note: If a SOAP fault occurs, it is treated as a fatal error, logged in the session log, and the session is terminated.

The following steps serve as an example for invoking a temperature web service to retrieve the current temperature for a given zip code:

1. In Source Analyzer, click Import from WSDL(Consumer). 2. Specify URL http://www.xmethods.net/sd/2001/TemperatureService.wsdl and

pick operation getTemp. 3. Open the Web Services Consumer Properties tab and click Populate SOAP

request and populate the desired zip code value. 4. Connect the output port of the web services source to the target.

Web Service Transformation Usage


PowerCenter also supports a request-response type of operation using Web Services transformation. You can use the web service as a transformation if your input data is available midstream and you want to capture the response values from the web service. If a SOAP fault occurs, it is considered as a row error and logged into the session log.

The following steps serve as an example for invoking a Stock Quote web service to learn the price for each of the ticker symbols available in a flat file:

1. In transformation developer, create a web service consumer transformation. 2. Specify URL http://services.xmethods.net/soap/urn:xmethods-delayed-

quotes.wsdl and pick operation getQuote. 3. Connect the input port of this transformation to the field containing the ticker

symbols. 4. To invoke the web service for each input row, change to source-based commit

and the interval to 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.

Web Service Target Usage

PowerCenter supports a one-way type of operation using Web Services target. You can use the web service as a target if you only need to send a message (i.e., and do not need a response). PowerCenter only waits for the web server to start processing the message; it does not wait for the web server to finish processing the web service operation. If a SOAP fault occurs, it is considered as a row error and logged into the session log.

The following provides an example for invoking a sendmail web service:

1. In Warehouse Designer, click Import from WSDL(Consumer) 2. Specify URL

http://webservices.matlus.com/scripts/emailwebservice.dll/wsdl/IEmailService and pick operation SendMail

3. In the mapping, connect the input ports of the web services target to the ports containing appropriate values.

PowerCenter Connect for Web Services and Web Services Provider

Informatica also offers a product called Web Services Provider which differs from PowerCenter Connect for Web Services.

• In Web Services Provider, PowerCenter acts as a Service Provider and exposes many key functionalities as web services.

• In PowerCenter Connect for Web Services, PowerCenter acts as a web service client and consumes external web services.

• It is not necessary to install or configure Web Services Provider in order to use PowerCenter Connect for Web Services.

Configuring PowerCenter to Invoke a Secure Web Service


Secure Sockets Layer (SSL) is used to provide such security features as authentication and encryption to web services applications. The authentication certificates follow the Public Key Infrastructure (PKI) standard, a system of digital certificates provided by certificate authorities to verify and authenticate parties of Internet communications or transactions. These certificates are managed in the following two keystore files:

• Truststore. Truststore holds the public keys for the entities it can trust. PowerCenter uses the entries in the Truststore file to authenticate the external web services servers.

• Keystore (Clientstore). Clientstore holds both the entity’s public and private keys. PowerCenter sends the entries in the Clientstore file to the web services server so that the web services server can authenticate the PowerCenter server.

By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/ security directory are used for Truststores. You can also create new keystore files and configure the TrustStore and ClientStore parameters in the PowerCenter Server setup to point to these files. Keystore files can contain multiple certificates and are managed using utilities like keytool.

SSL authentication can be performed in three ways:

• Server authentication • Client authentication • Mutual authentication

Server authentication:

When establishing an SSL session in server authentication, the web services server sends its certificate to PowerCenter and PowerCenter verifies whether the server certificate can be trusted. Only the truststore file needs to be configured in this case.

Assumptions:

Web Services Server certificate is stored in server.cer file

PowerCenter Server(Client) public/private key pair is available in keystore client.jks

Steps:

1. Import the server’s certificate into the PowerCenter Server’s truststore file. You can use either the default keystores jssecacerts, cacerts or create your own keystore file.

2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –storepass changeit

3. At the prompt for trusting this certificate, type “yes”. 4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server

setup-> JVM options tab and in the value for Truststore, give the full path and name of the keystore file (e.g., c:\trust.jks)

Client authentication:


When establishing an SSL session in client authentication, PowerCenter sends its certificate to the web services server. The web services server then verifies whether the PowerCenter Server can be trusted. In this case, you need only the clientstore file.

Steps:

1. Keystore containing the private/public key pair is called client.jks. Be sure the client private key password and the keystore password are the same, (e.g., “changeit”)

2. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (e.g., c:\client.jks)

3. Add an additional JVM parameter in the PowerCenter Server setup and give the value as Djavax.net.ssl.keyStorePassword=changeit

Mutual authentication:

When establishing an SSL session in mutual authentication, both PowerCenter Server and the Web Services server send their certificates to each other and both verify if the other one can be trusted. You need to configure both the clientstore and the truststore files.

Steps:

1. Import the server’s certificate into the PowerCenter Server’s truststore file. 2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –

storepass changeit 3. Configure PowerCenter to use this truststore file. Open the PowerCenter server

setup-> JVM options tab and in the value for Truststore, type the full path and name of the keystore file (e.g., c:\trust.jks).

4. Keystore containing the client public/private key pair is called client.jks. Be sure the client private key password and the keystore password are the same (e.g., “changeit”).

5. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (e.g., c:\client.jks).

6. Add an additional JVM parameter in the PowerCenter Server setup and type the value as Djavax.net.ssl.keyStorePassword=changeit

Note: If your client private key is not already present in the keystore file, you cannot use keytool command to import it. Keytool can only generate a private key; it cannot import a private key into a keystore. In this case, use an external java utility such as utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to move it into the JKS keystore.

Converting Other Formats of Certificate Files

There are a number of formats of certificate files available: DER format (.cer and .der extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12 extension). You can convert from one format of certificate to another using openssl.


Refer to the openssl documentation for complete information on such conversion. A few examples are given below:

To convert from PEM to DER: assuming that you have a PEM file called server.pem

• openssl x509 -in server.pem -inform PEM -out server.der -outform DER

To convert a PKCS12 file, you must first convert to PEM, and then from PEM to DER:

Assuming that your PKCS12 file is called server.pfx, the two commands are:

• openssl pkcs12 -in server.pfx -out server.pem • openssl x509 -in server.pem -inform PEM -out server.der -outform DER


Working with PowerCenter Connect for MQSeries

Challenge

Understanding how to use IBM MQSeries applications in PowerCenter mappings.

Description

MQSeries applications communicate by sending messages asynchronously rather than by calling each other directly. Applications can also request data using a "request message" on a message queue. Because no open connection is required between systems, they can run independently of one another. MQSeries enforces no structure on the content or format of the message; this is defined by the application.

With more and more requirements for “on-demand” or real-time analytics, as well as the development of Enterprise Application Integration (EAI) capabilities, MQ Series has become an important vehicle for providing information to data warehouses in a real-time mode.

PowerCenter provides data integration for transactional data generated by online continuously messaging systems (such as MQ Series). For these types of messaging systems, PowerCenter’s Zero Latency (ZL) Engine provides immediate processing of trickle-feed data, allowing the processing of real-time data flow in both uni-directional and bi-directional manner.

TIP: In order to enable PowerCenter’s ZL engine to process MQ messages in real-time, the workflow must be configured to run continuously and a real-time MQ filter needs to be applied to the MQ source qualifier (such as idle time, reader time limit, or message count).

MQSeries Architecture

IBM MQSeries is a messaging and queuing application that permits programs to communicate with one another across heterogeneous platforms and network protocols using a consistent application-programming interface.

MQSeries architecture has three parts:

1. Queue Manager 2. Message Queue, which is a destination to which messages can be sent


3. MQSeries Message, which incorporates a header and a data component

Queue Manager

• PowerCenter connects to Queue Manager to send and receive messages. • A Queue Manager may publish one or more MQ queues. • Every message queue belongs to a Queue Manager. • Queue Manager administers queues, creates queues, and controls queue

operation.

Message Queue

• PowerCenter connects to Queue Manager to send and receive messages to one or more message queues.

• PowerCenter is responsible to deleting the message from the queue after processing it.

TIP: There are several ways to maintain transactional consistency (i.e., clean up the queue after reading). Refer to the Informatica Webzine article on Transactional Consistency for details on the various ways to delete messages from the queue.

MQSeries Message

An MQSeries message is composed of two distinct sections:

• MQSeries header. This section contains data about the queue message itself. Message header data includes the message identification number, message format, and other message descriptor data. In PowerCenter, MQSeries sources and dynamic MQSeries targets automatically incorporate MQSeries message header fields.

• MQSeries message data block. A single data element that contains the application data (sometime referred to as the "message body"). The content and format of the message data is defined by the application that puts the message on the queue.

Extracting Data from a Queue

Reading Messages from a Queue

In order for PowerCenter to extract from the message data block, the source system must define the data in one of the following formats:

• Flat file (fixed width or delimited) • XML • COBOL • Binary

When reading a message from a queue, the PowerCenter mapping must contain an MQ Source Qualifier (MQSQ). If the mapping also needs to read the message data block, then an Associated Source Qualifier (ASQ) is also needed. When developing an MQ

Series mapping, the MESSAGE_DATA block is re-defined by the ASQ. Based on the format of the source data, PowerCenter will generate the appropriate transformation for parsing the MESSAGE_DATA. Once associated, the MSG_ID field is linked within the associated source qualifier transformation.

Applying Filters to Limit Messages Returned

Filters can be applied to the MQ Source Qualifier to reduce the number of messages read.

Filters can also be added to control the length of time PowerCenter reads the MQ queue.

If no filters are applied, PowerCenter reads all messages in the queue and then stops reading.

Example:

PutDate >= “20040901” && PutDate <= “20040930”

TIP: In order to leverage reading a single MQ queue to process multiple record types, have the source application populate an MQ header field and then filter the value set in this field (Example: ApplIdentityData = ‘TRM’).

Using MQ Functions

PowerCenter provides built-in functions that can also be used to filter message data.

• Functions can be used to control the end-of-file of the MQSeries queue. • Functions can be used to enable PowerCenter real-time data extraction.

Available Functions:

Function Description Idle(n) Time RT remains idle before stopping. MsgCount(n) Number of messages read from the queue before stopping. StartTime(time) GMT time when RT begins reading queue. EndTime(time) GMT time when RT stops reading queue. FlushLatency(n) Time period RT waits before committing messages read from the

queue. ForcedEOQ(n) Time period RT reads messages from the queue before stopping. RemoveMsg(TRUE) Removes messages from the queue.

TIP: In order to enable real-time message processing, use the FlushLatency() or ForcedEOQ() MQ functions as part of the filter expression in the MQSQ.

Loading Message to a Queue

PowerCenter supports two types of MQ targeting: Static and Dynamic.


• Static MQ Targets. Used for loading message data (instead of header data) to the target. A Static target does not load data to the message header fields. Use the target definition specific to the format of the message data (i.e., flat file, XML, or COBOL). Design the mapping as if it were not using MQ Series, then configure the target connection to point to a MQ message queue in the session when using MQSeries.

• Dynamic. Used for binary targets only, and when loading data to a message header. Note that certain message headers in an MQSeries message require a predefined set of values assigned by IBM.

Dynamic MQSeries Targets

Use this type of target if message header fields need to be populated from the ETL pipeline.

MESSAGE_DATA field data type is binary only.

Certain fields cannot be populated by the pipeline (i.e., set by the target MQ environment):

• UserIdentifier • AccountingToken • ApplIdentityData • PutApplType • PutApplName • PutDate • PutTime • ApplOriginData

Static MQSeries Targets

Unlike dynamic targets, where an MQ target transformation exists in the mapping, static targets use existing target transformations.

• Flat file • XML • COBOL • RT can only write to one MQ queue per target definition. • XML targets with multiple hierarchies can generate one or more MQ messages

(configurable).

Creating and Configuring MQSeries Sessions

After you create mappings in the Designer, you can create and configure sessions in the Workflow Manager.

Configuring MQSeries Sources

The MQSeries source definition represents the metadata for the MQSeries source in the repository. Unlike other source definitions, you do not create an MQSeries source definition by importing the metadata from the MQSeries source. Since all MQSeries


messages contain the same message header and message data fields, the Designer provides an MQSeries source definition with predefined column names.

MQSeries Mappings

MQSeries mappings cannot be partitioned if an associated source qualifier is used.

For MQ Series sources, set the Source Type to the following:

• Heterogeneous - when there is an associated source definition in the mapping. This indicates that the source data is coming from an MQ source, and the message data is in flat file, COBOL or XML format.

• Message Queue - when there is no associated source definition in the mapping.

Note that there are two pages on the Source Options dialog: XML and MQSeries. You can alternate between the two pages to set configurations for each.

Configuring MQSeries Targets

For Static MQSeries targets, select File Target type from the list. When the target is an XML file or XML message data for a target message queue, the target type is automatically set to XML.

• If you load data to a dynamic MQ target, the target type is automatically set to Message Queue.

• On the MQSeries page, select the MQ connection to use for the source message queue, and click OK.

• Be sure to select the MQ checkbox in Target Options for the Associated file type. Then click Edit Object Properties and type:

o the connection name of the target message queue. o the format of the message data in the target queue (ex. MQSTR). o the number of rows per message (only applies to flat file MQ targets).

Considerations when Working with MQSeries

The following features and functions are not available to PowerCenter when using MQSeries:

• Lookup transformations can be used in an MQSeries mapping, but lookups on MQSeries sources are not allowed.

• No Debug "Sessions". You must run an actual session to debug a queue mapping. • Certain considerations are necessary when using AEPs, Aggregators, Joiners,

Sorters, Rank, or Transaction Control transformations because they can only be performed on one queue, as opposed to a full data set.

• The MQSeries mapping cannot contain a flat file target definition if you are trying to target an MQSeries queue.

• PowerCenter version 6 and earlier performs a browse of the MQ queue. PowerCenter version 7 provides the ability to perform a destructive read of the MQ queue (instead of a browse).

• PowerCenter version 7 also provides support for active transformations (i.e., Aggregators) in an MQ source mapping.


• PowerCenter version 7 provides MQ message recovery on restart of failed sessions. • PowerCenter version 7 offers enhanced XML capabilities for mid-stream XML

parsing.

Appendix Information

PowerCenter uses the following datatypes in MQSeries mappings:

• IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a mapping.

• Native datatypes. Flat file, XML, or COBOL datatypes associated with an MQSeries message data. Native datatypes appear in flat file, XML and COBOL source definitions. Native datatypes also appear in flat file and XML target definitions in the mapping.

• Transformation datatypes. Transformation datatypes are generic datatypes that PowerCenter uses during the transformation process. They appear in all the transformations in the mapping.

IBM MQSeries Datatypes

MQSeries Datatypes Transformation Datatypes MQBYTE BINARY MQCHAR STRING MQLONG INTEGER MQHEX

Values for Message Header Fields in MQSeries Target Messages

MQSeries Message Header Description StrucId Structure identifier Version Structure version number Report Options for report messages MsgType Message type Expiry Message lifetime Feedback Feedback or reason code Encoding Data encoding CodedCharSetId Coded character set identifier Format Format name Priority Message priority Persistence Message persistence MsgId Message identifier CorrelId Correlation identifier BackoutCount Backout counter ReplytoZ Name of reply queue ReplytoQMgr Name of reply gueue Manager UserIdentifier Defined by the environment. If the MQSeries server

cannot determine this value, the value for the field is


MQSeries Message Header Description null.

AccountingToken Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is MQACT_NONE.

ApplIdentityData Application data relating to identity. The value for ApplIdentityData is null.

PutApplType Type of application that put the message on queue. Defined by the environment.

PutApplName Name of application that put the message on queue. Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null.

PutDate Date when the message arrives in the queue. PutTime Time when the message arrives in queue. ApplOrigData Application data relating to origin. Value for

ApplOriginData is null. GroupId Group identifier MsgSeqNumber Sequence number of logical messages within group. Offset Offset of data in physical message from start of

logical message. MsgFlags Message flags OrigialLength Length of original message


A Mapping Approach to Trapping Data Errors

Challenge

To address data content errors within mappings to re-route erroneous rows to a target other than the original target table.

Description

Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. In the production environment, data must be checked and validated prior to entry into the data warehouse. One strategy for handling errors is to maintain database constraints. Another approach is to use mappings to trap data errors.

The first step in using mappings to trap errors is to understand and identify the error handling requirements.

Consider the following questions:

• What types of errors are likely to be encountered? • Of these errors, which ones should be captured? • What process can capture the possible errors? • Should errors be captured before they have a chance to be written to the target

database? • Should bad files be used? • Will any of these errors need to be reloaded or corrected? • How will the users know if errors are encountered? • How will the errors be stored? • Should descriptions be assigned for individual errors? • Can a table be designed to store captured errors and the error descriptions?

Capturing data errors within a mapping and re-routing these errors to an error table allows for easy analysis by the end users and improves performance. One practical application of the mapping approach is to capture foreign key constraint errors. This can be accomplished by creating a lookup into a dimension table prior to loading the fact table. Referential integrity is assured by including this functionality in a mapping. The database still enforces the foreign key constraints, but erroneous data will not be written to the target table. Also, if constraint errors are captured within the mapping,


the PowerCenter server will not have to write the error to the session log and the reject/bad file.

Data content errors can also be captured in a mapping. Mapping logic can identify data content errors and attach descriptions to the errors. This approach can be effective for many types of data content errors, including: date conversion, null values intended for not null target fields, and incorrect data formats or data types.

Error Handling Example

In the following example, we want to capture null values before they enter into target fields that do not allow nulls.

Once we’ve identified the null values, the next step is to separate these errors from the data flow.Use the Router Transformation to create a stream of data that will be the error route. Any row containing an error (or errors) will be separated from the valid data and uniquely identified with a composite key consisting of a MAPPING_ID and a ROW_ID. The MAPPING_ID refers to the mapping name and the ROW_ID is generated by a Sequence Generator. The composite key allows developers to trace rows written to the error tables.

Error tables are important to an error handling strategy because they store the information useful to error identification and troubleshooting. In this example, the two error tables are ERR_DESC_TBL and TARGET_NAME_ERR.

The ERR_DESC_TBL table will hold information about the error, such as the mapping name, the ROW_ID, and a description of the error. This table is designed to hold all error descriptions for all mappings within the repository for reporting purposes.

The TARGET_NAME_ERR table will be an exact replica of the target table with two additional columns: ROW_ID and MAPPING_ID. These two columns allow the TARGET_NAME_ERR and the ERR_DESC_TBL to be linked. The TARGET_NAME_ERR table provides the user with the entire row that was rejected, enabling the user to trace the error rows back to the source. These two tables might look like the following:

The error handling functionality assigns a unique description for each error in the rejected row. In this example, any null value intended for a not null target field will


generate an error message such as ‘Column1 is NULL’ or ‘Column2 is NULL’. This step can be done in an Expression Transformation.

After the field descriptions are assigned, we need to break the error row into several rows, with each containing the same content except for a different error description. You can use the Normalizer Transformation to break one row of data into many rows. After a single row of data is separated based on the number of possible errors on it, we need to filter the columns within the row that are actually errors. One record of data may have zero to multiple errors. In this example, the record has three errors. We needs to generate three error rows with the different error descriptions (ERROR_DESC) to table ERR_DESC_TBL.

When the error records are written to ERR_DESC_TBL, we can link those records to the one record in table TARGET_NAME_ERR using the ROW_ID and MAPPING_ID. The following chart shows how the two error tables can be linked. Focus on the bold selections in both tables.

TARGET_NAME_ERR

Column1 Column2 Column3 ROW_ID MAPPING_ID NULL NULL NULL 1 DIM_LOAD

ERR_DESC_TBL

FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target CUST DIM_LOAD 1 Column 1 is

NULL SYSDATE DIM FACT

CUST DIM_LOAD 1 Column 2 is NULL

SYSDATE DIM FACT

CUST DIM_LOAD 1 Column 3 is NULL

SYSDATE DIM FACT

The solution example would look like the following in a mapping:


The mapping approach is effective because it takes advantage of reusable objects, thereby using the same logic repeatedly within a mapplet. This makes error detection easy to implement and manage in a variety of mappings.

By adding another layer of complexity within the mappings, errors can be flagged as ‘soft’ or ‘hard’.

• A ‘hard’ error can be defined as one that would fail when being written to the database, such as a constraint error.

• A ‘soft’ error can be defined as a data content error.

A record flagged as a ‘hard’ error is written to the error route, while a record flagged as a ‘soft’ error can be written to boththe target system and the error tables. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting.

Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors, data warehouse operators can effectively communicate data quality issues to the business users.


Error Handling Strategies

Challenge

The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.

Description

Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals:

• The need for accurate information • The ability to analyze or process the most complete information available with the

understanding that errors can exist.

Data Integration Process Validation

In general, there are three methods for handling data errors detected in the loading process:

• Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data. Dimensional or Master Data errors can cause valid factual data to be rejected


because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions. The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the target if it is correct, and it would then be loaded into the data mart using the normal process.

• Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. With Reject None, the complete set of data is loaded but the data may not support correct transactions or aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each night’s load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts.

• Reject Critical. This method provides a balance between missing information and incorrect information. This approach involves examining each row of data, and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. This approach requires categorizing the data in two ways: 1) as Key Elements or Attributes, and 2) as Inserts or Updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at different levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual


data can usually be loaded with the existing dimensional data unless the update is to a Key Element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from the Reject None approach in that processes must be developed to fix incorrect data in the entire target data architecture. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to understand that some information may be held out of the target, and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies.

Using Profiles

Profiles are tables used to track history changes to the source data. As the source systems change, Profile records are created with date stamps that indicate when the change took place. This allows power users to review the target data using either current (As-Is) or past (As-Was) views of the data.

Profiles should occur once per change in the source systems. Problems occur when two fields change in the source system and one of those fields produces an error. When the second field is fixed, it is difficult for the ETL process to produce a reflection of data changes since there is now a question whether to update a previous Profile or create a new one. The first value passes validation, which produces a new Profile record, while the second value is rejected and is not included in the new Profile. When this error is fixed, it would be desirable to update the existing Profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is complicated.

If a third field is changed before the second field is fixed, the correction process cannot be automated. The following hypothetical example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000 Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.

Date Field 1 Value Field 2 Value Field 3 Value 1/1/2000 Closed Sunday Black Open 9 – 5 1/5/2000 Open Sunday BRed Open 9 – 5 1/10/2000 Open Sunday BRed Open 24hrs 1/15/2000 Open Sunday Red Open 24hrs

Three methods exist for handling the creation and update of Profiles:


1. The first method produces a new Profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.

Date Profile Date Field 1 Value Field 2 Value

Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5 1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5 1/10/2000 1/10/2000 Open Sunday Black Open 24hrs 1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

By applying all corrections as new Profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates a new Profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first Profile. The second Profile should not have been created.

2. The second method updates the first Profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the Profile record for the change to Field 3.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value 1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5 1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5 1/10/2000 1/5/2000

(Update) Open Sunday Black Open 24hrs

1/15/2000 1/5/2000 (Update)

Open Sunday Red Open 24hrs

If we try to apply changes to the existing Profile, as in this method, we run the risk of losing Profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed it would also be added to the existing Profile, which incorrectly reflects the changes in the source system.

3. The third method creates only two new Profiles, but then causes an update to the Profile records on 1/15/2000 to fix the Field 2 value in both.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value 1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5 1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5 1/10/2000 1/10/2000 Open Sunday Black Open 24hrs 1/15/2000 1/5/2000

(Update) Open Sunday Red Open 9-5

1/15/2000 1/10/2000 (Update)

Open Sunday Red Open 24hrs

If we try to implement a method that updates old Profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all Profiles


generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old Profile records, when in reality a new Profile record should have been entered.

Recommended Method

A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile.

Data Quality Edits

Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators can be used to:

• show the record and field level quality associated with a given record at the time of extract

• identify data sources and errors encountered in specific records • support the resolution of specific record error types via an update and

resubmission process.

Quality indicators may be used to record several types of errors – e.g., fatal errors (missing primary key value), missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.

The following types of errors cannot be processed:

• A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has been replaced or not.

• The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating that x number of invalid records were received and


could not be processed may or may not be available for a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual unique records within the file are not identifiable. While information can be provided to the source system site indicating there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.

In these error types, the records can be processed, but they contain errors:

• A required (non-key) field is missing. • The value in a numeric or date field is non-numeric. • The value in a field does not fall within the range of acceptable values identified for

the field. Typically, a reference table is used for this validation.

When an error is detected during ingest and cleansing, the identified error type is recorded.

Quality Indicators (Quality Code Table)

The requirement to validate virtually every data element received from the source data systems mandates the development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data quality problems, systemic issues, business process problems and information technology breakdowns.

The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required Field, “3”-Wrong Data Type/Format, “4”-Invalid Data Value and “5”-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.

Handling Data Errors

The need to periodically correct data in the target is inevitable. But how often should these corrections be performed?

The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.

Reject Tables vs. Source System

As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix data in


the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source systems, then these fixes must be applied correctly to the target data.

Attribute Errors and Default Values

Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target.

When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter thetarget. Some rules that have been proposed for handling defaults are as follows:

Value Types Description Default Reference Values Attributes that are foreign

keys to other tables Unknown

Small Value Sets Y/N indicator fields No Other Any other type of attribute Null or Business

provided value

Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the ‘Unknown’ value. (All reference tables contain a value of ‘Unknown’ for this purpose.)

The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g. On/Off or Yes/No indicators), are referred to as small value sets. When errors are encountered in translating these values, we use the value that represents off or ‘No’ as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate ‘Null’ into these fields, which means “undefined” in the target. After a source system value is corrected and passes validation, it is corrected in the target.

Primary Key Errors

The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new key is actually an update to an old key in the source system. For example, a location number is assigned and the new location is transferred to the target using the normal process; then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture.


Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two rows involves combining the Profile information, taking care to coordinate the effective dates of the Profiles to sequence properly. If two Profile records exist for the same day, then a manual decision is required as to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct the data.

The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the restore to correct the errors.

DM Facts Calculated from EDW Dimensions

If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement.

If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.

Fact Errors

If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.

Data Stewards

Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.

Reference Tables

The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long


description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures.

The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. For example, the SOURCE column in FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data steward would be responsible for entering in the Translation table the following values:

Source Value Code Translation O OFFICE S STORE W WAREHSE

These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the following entries into the translation table to maintain consistency across systems:

Source Value Code Translation OF OFFICE ST STORE WH WAREHSE

The data stewards are also responsible for maintaining the Reference table that translates the Codes into descriptions. The ETL process uses the Reference table to populate the following values into the target:

Code Translation Code Description OFFICE Office STORE Retail Store WAREHSE Distribution Warehouse

Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.

Dimensional Data

New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as Reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product


in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.)

There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as ‘Pending Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it can be loaded.

When the dimensional value is left as ‘Pending Verification’ however, facts may be rejected or allocated to dummy values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate an e-mail each night if there are any translation table entries pending verification. The data steward then opens a report that lists them.

A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to be merged, requiring manual intervention.

The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but really represent two different products). In this case, it is necessary to restore the source information for all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split to generate correct Profile information.

Manual Updates

Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both Profile and Date Event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.

Multiple Sources

The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information.

When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and Profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is


loaded into the target, it creates a new Profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new Profile, assumes a change occurred and creates another new Profile with the old, unchanged value. If the two systems remain different, the process causes two Profiles to be loaded every day until the two source systems are synchronized with the same information.

To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the one Profile record created for that day.

One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can use the field level information to update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process.


Error Handling Techniques using PowerCenter 7 (PC7) and PowerCenter Metadata Reporter (PCMR)

Challenge

Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and reprocess the corrected data.

Description

Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of acceptable data quality; and process errors, which are driven by the stability of the process itself.

The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider the following questions:

• What tools and methods can help in detecting all the possible errors? • What tools and methods can help in correcting the errors? • What is the best way to reconcile data across multiple systems? • Where and how will the errors be stored? (i.e., relational tables or flat files)

A robust error handling strategy can be implemented using PowerCenter’s built-in error handling capabilities along with the PowerCenter Metadata Reporter (PCMR) as follows:

• Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process failures.

• Data Errors: Setup the ETL process to: o Use the Row Error Logging feature in PowerCenter to capture data errors

in the PowerCenter error tables for analysis, correction, and reprocessing. o Setup PCMR alerts to notify the PowerCenter Administrator in the event of

any rejected rows. o Setup customized PCMR reports and dashboards at the project level to

provide information on failed sessions, sessions with failed rows, load time, etc.

Configuring an Email Task to Handle Process Failures


Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the event of a session failure. Create a reusable email task and use it in the “On Failure Email” property settings in the Components tab of the session, as shown in the following figure:

When you configure the subject and body of a post-session email, use email variables to include information about the session run, such as session name, mapping name, status, total number of records loaded, and total number of records rejected. The following table lists all the available email variables:

Email Variables for Post-Session Email Email

Variable Description

%s Session name. %e Session status. %b Session start time. %c Session completion time. %i Session elapsed time (session completion time-session start time). %l Total rows loaded. %r Total rows rejected.

%t Source and target table details, including read throughput in bytes per second and write throughput in rows per second. The PowerCenter Server includes all information displayed in the session detail dialog box.

Email Variables for Post-Session Email Email

Variable Description

%m Name of the mapping used in the session. %n Name of the folder containing the session. %d Name of the repository containing the session. %g Attach the session log to the message.

%a<filename>

Attach the named file. The file must be local to the PowerCenter Server. The following are valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>.

Note: The file name cannot include the greater than character (>) or a line break.

Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include these variables in the email message only.

Configuring Row Error Logging in PowerCenter

PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when compared with a custom error handling solution.

When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the PowerCenter Server logs error information that allows you to determine the cause and source of the error. The PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error code, error message, repository name, folder name, session name, and mapping information. This error metadata is logged for all row level errors, including database errors, transformation errors, and errors raised through the ERROR() function, such as business rule violations.

Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you enable error logging and chose the ‘Relational Database’ Error Log Type, the PowerCenter Server offers you the following features:

• Generates the following tables to help you track row errors: o PMERR_DATA. Stores data and metadata about a transformation row

error and its corresponding source row. o PMERR_MSG. Stores metadata about an error and the error message. o PMERR_SESS. Stores metadata about the session. o PMERR_TRANS. Stores metadata about the source and transformation

ports, such as name and datatype, when a transformation error occurs. § Appends error data to the same tables cumulatively, if they already

exist, for the further runs of the session. § Allows you to specify a prefix for the error tables. For instance, if

you want all your EDW session errors to go to one set of error tables, you can specify the prefix as ‘EDW_’

§ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you specify the same error log table name prefix for all sessions.


Example:

In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the EDW Customer table. The Customer Table in EDW has the following structure:

CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)

CUSTOMER_NAME NULL VARCHAR2(30)

CUSTOMER_STATUS NULL VARCHAR2(10)

There is a primary key constraint on the column CUSTOMER_ID.

To take advantage of PowerCenter’s built-in error handling features, you would set the session properties as shown below:

The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB Connection’ and ‘Table name Prefix’ values are given accordingly.

When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into the Error Tables as shown below:

EDW_PMERR_DATA:


REPOSITORY_ GID

WORKFLOW_ RUN_ID

WORKLET_ RUN_ID

SESS_ INST_ID

TRANS_ MAPPLET_INST

TRANS_NAME TRANS_GROUP

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

8 0 3 N/A Customer_Table Input

37379c74-f4b5-4dc7-a927-3b38c9ec09ca


37379c74-f4b5-4dc7-a927-3b38c9ec09ca


EDW_PMERR_MSG:

REPOSITORY_ GID

WORKFLOW_ RUN_ID

WORKLET_ RUN_ID

SESS_ INST_ID

SESS_ START_TIME

SES_ UTC_TIME

REPOSITORY_ NAME

FOLDER_NAME

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

6 0 3 9/15/2004 18:31

9/15/2004 18:31

pc711 Folder1

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

7 0 3 9/15/2004 18:33

9/15/2004 18:33

pc711 Folder1

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

8 0 3 9/15/2004 18:34

9/15/2004 18:34

pc711 Folder1

EDW_PMERR_SESS:

REPOSITORY_ GID

WORKFLOW_ RUN_ID

WORKLET_ RUN_ID

SESS_ INST_ID

SESS_ START_TIME

SES_ UTC_TIME

REPOSITORY_ NAME

FOLDER_NAME

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

6 0 3 9/15/2004 18:31

9/15/2004 18:31

pc711 Folder1

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

7 0 3 9/15/2004 18:33

9/15/2004 18:33

pc711 Folder1

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

8 0 3 9/15/2004 18:34

9/15/2004 18:34

pc711 Folder1


EDW_PMERR_TRANS:

REPOSITORY_ GID

WORKFLOW_ RUN_ID

WORKLET_ RUN_ID

SESS_ INST_ID

TRANS_ MAPPLET_INST

TRANS_ NAME

TRANS_ GROUP

TRANS_ATTR

37379c74-f4b5-4dc7-a927-3b38c9ec09ca

8 0 3 N/A Customer_Table Input Customer_Id:3,Customer_Name:12,Customer_Status:12

By looking at the workflow run id and other fields, you can easily analyze the errors and reprocess them after fixing the errors.

Error Detection and Notification using PCMR

Informatica provides PowerCenter Metadata Reporter (PCMR) with every PowerCenter license. The PCMR uses Informatica’s powerful business intelligence tool, PowerAnalyzer, to provide insight into the PowerCenter repository metadata.

You can use the Operations Dashboard of the PCMR as one central location to gain insight into production environment ETL activities. In addition, the following capabilities of the PCMR are recommended best practices:

• Configure PCMR alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made into the error tables PMERR_DATA or PMERR_TRANS.

• Configure reports and dashboards using the PCMR to provide detailed session run information grouped by projects/PowerCenter folders for easy analysis.

• Configure reports in PCMR to provide detailed information of the row level errors for each session. This can be accomplished by using the four error tables as sources of data for the reports.

Error Correction and Reprocessing

The method of error correction depends on the type of error that occurred. Here are a few things that you should consider during error correction:

• The ‘owner’ of the data should always fix the data errors. For example, if the source data is coming from an external system, then you should send the errors back to the source system to be fixed.

• In some situations, a simple re-execution of the session will reprocess the data. You may be able to modify the SQL or some other session property to make sure that no duplicate data is processed during the re-run of the session and that all data is processed correctly.

• In some situations, partial data that has been loaded into the target systems should be backed out in order to avoid duplicate processing of rows.

o Having a field in every target table, such as a BATCH_ID field, to identify each unique run of the session can help greatly in the process of backing out partial loads, but sometimes you may need to design a special mapping to achieve this.


• Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the PCMR error reports. Then the corrected data can be manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.

Data Reconciliation using PowerAnalyzer

Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing tedious queries, comparing two separately produced reports, or using constructs such as DBLinks.

By upgrading the PCMR from a limited-use license that can source the PowerCenter repository metadata only to a full-use PowerAnalyzer license that can source your company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data marts), PowerAnalyzer provides a reliable and reusable way to accomplish data reconciliation. Using PowerAnalyzer’s reporting capabilities, you can select data from various data sources such as ODS, data marts and data warehouses to compare key reconciliation metrics and numbers through aggregate reports. You can further schedule the reports to run automatically every time the relevant PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any discrepancies.

For example, a report can be created to ensure that the same number of customers exist in the ODS as well in the data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on demand. This process allows users to verify the accuracy of data and build confidence in the data warehouse solution.


Error Management in a Data Warehouse Environment

Challenge

A key requirement for any successful data warehouse or data integration project is that it attain credibility within the user community. At the same time, it is imperative that the warehouse be as up-to-date as possible since the more recent the information derived from it is, the more relevant it is to the business operations of the organization, thereby providing the best opportunity to gain an advantage over the competition.

Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information.

Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e., before the business tries to use the information in error).

The types of error to consider include:

• Source data structures • Sources presented out-of-sequence • ‘Old’ sources represented in error • Incomplete source files • Data-type errors for individual fields • Unrealistic values (e.g., impossible dates) • Business rule breaches • Missing mandatory data • O/S errors • RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.


Description

In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change.

Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; and operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need to allow some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse expects.

Because of the credibility issue, in-error data cannot be allowed to get to the metrics and measures used by the business managers. If such data does reach the warehouse, it must be identified as such, and removed immediately (before the current version of the warehouse can be published). Even better, however, is for such data to be identified during the load process and prevented from reaching the warehouse at all. Best of all is for erroneous source data to be identified before a load even begins, so that no resources are wasted trying to load it.

The principle to follow for correction of errors should be to ensure that the data is corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, such a principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data.

As a final consideration, error management complements and overlaps load management, data quality and key management, and operational processes and procedures. Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred. Quality management defines the criteria whereby data can be identified as in error; and error management identifies the specific error(s), thereby allowing the source data to be corrected. Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors, perhaps indicating a failure in operational procedure.

A key tool for all of these systems is the effective creation, and use of metadata. Such metadata encompasses operational, field-level, loading process, business rule, and relational areas and is integral to a proactively-managed data warehouse.

Error Management Considerations


High-Level Issues

From previous discussion of load management, a number of checks can be performed before any attempt is made to load a source data set. Without load management in place, it is unlikely that the warehouse process will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot (in so far as nearly all maintenance and development resources will be working full time to manuallycorrect bad data in the warehouse). The following assumes that you have implemented load management processes similar to Informatica’s best practices.

• Process Dependency checks in the load management can identify when a source data set is missing, duplicates a previous version, or has been presented out of sequence, and where the previous load failed but has not yet been corrected.

• Load management prevents this source data from being loaded. At the same time, error management processes should record the details of the failed load; noting the source instance, the load affected, and when and why the load was aborted.

• Source file structures can be compared to expected structures stored as metadata, either from header information or by attempting to read the first data row.

• Source table structures can be compared to expectations; typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by simply running a ‘describe’ command against the table (again comparing to a pre-stored version in metadata).

• Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational application).

• In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and any other relevant process-level details.

Low-Level Issues

Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further error management processes need to be applied to the individual source rows and fields.

• Individual source fields can be compared to expected data-types against standard metadata within the repository, or additional information added by the development. In some instances, this will be enough to abort the rest of the load. Since if the field structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all, or (more worrisome) will be processed unpredictably.

• Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can be used to spot failed date conversions, conversions of string to numbers, missing required data. In rare cases, stored procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the potentially crushing impact on performance if a particularly error-filled load occurs.

• Business rule breaches will then be picked up. It is possible to define allowable values, or acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included in the mapping itself). A more flexible approach is to use external


tables to codify the business rules. In this way, only the rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to implement such a process.

• Missing Key/Unknown Key issues have already been defined in their own best practice document, Key Management in Data Warehousing Solutions, with suggested management techniques for identifying and handling them. However, from an error handling perspective, such errors must still be identified and recorded, even when key management techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular source data fails, it is difficult to realize when there is a systematic problem in the source systems.

• Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of events (e.g., a customer query, followed by a booking request, followed by a confirmation, followed by a payment). If the events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem with the source system, or the way in which the source system is being used.

• An important principle to follow should be to try to identify all of the errors on a particular row before halting processing, rather than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced data set if we already know it is in error; however, since the row will need to be corrected at source, then reprocessed subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the first, re-running, and then identifying a second error (which halts the load for a second time).

OS and RDBMS Issues

Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference information is missing).

However, there is little that can be done to identify that more generic RDBMS problems will occur; changes to schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table space extents available, missing partitions and the like.

Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax, and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who, from an operational perspective, are not aware that there will be an impact on the data warehouse, or are not aware that the data warehouse managers need to be kept up to speed.

In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNIX host are amended, bad files themselves (or even the log files) may not be able to be written to.


Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a load to ‘complete’ should be absolutely the last step in a given process, any failure before, or including, that point leaves the load in an ‘incomplete’ state. Subsequent runs will note this, and enforce correction of the last load before beginning the new one.

The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and DBAs have proper and working communication with the data warehouse management to allow proactive control of changes. Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if they occur.

Auto-Correction vs. Manual Correction

Load management and key management best practices (Key Management in Data Warehousing Solutions)have already defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is preserved, and incorrect key values are corrected as soon as the source system provides the missing data.

We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely from scratch.

In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user training.

The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and managed.

Error Management Techniques

Simple Error Handling Structure


This simple example defines three main sets of information:

• The Error_Definition table simply stores descriptions for the various types of errors, including process-level (e.g., incorrect source file, load started out-of-sequence), row-level (e.g., missing foreign key, incorrect data-type, conversion errors), and reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

• The Error_Header provides a high-level view on the process, allowing a quick identification of the frequency of error for particular loads and of the distribution of error types. It is linked to the load management processes via the Src_Inst_ID and Proc_Inst_ID, from which other process-level information can be gathered.

• The Error_Detail stores information about actual rows with errors, including how to identify the specific row that was in error (using the source natural keys and row number) together with a string of field identifier/value pairs concatenated together. It is NOT expected that this information will be deconstructed as part of an automatic correction load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent reporting.


Error Management Process Flow

Error management must fit into the load process as a whole, although the implementation depends on the particular data warehouse. Typically, mapping templates are created with the necessary objects to interact with the load management and error management control tables; these are then added to or adapted with the specific transformations to fulfil each load requirement. In many instances common transformations are created to perform error description lookups, business rule validation, and metadata queries; these are then referenced as and when a given data item within a transformation requires them.

In any case, error management, load management, metadata, and the load itself are intimately connected; it is the integration of all these approaches that provides the robust system that is needed to successfully generate the data warehouse. The following diagram illustrates the integrated process.


Error Management Process Flow

Challenge

Error management must fit into the load process as a whole. The specific implementation depends on the particular data warehouse requirements.

Error management involves the following three steps:

• Error identification • Error retrieval • Error correction

The Best Practice focuses on the process for implementing each of these steps in a PowerCenter architecture.

Description

A typical error management process leverages the best-of-breed error management technology available in PowerCenter, such as relational database error logging, email notification of workflow failures, session error thresholds, PowerCenter Metadata Reporter (PCMR) reporting capabilities, and data profiling and integrates them with the load process and metadata to provide a seamless load process.

Error Identification

The first step to error management is error identification. Error identification is most often achieved through enabling referential integrity constraints at the database level and enabling relational error logging in PowerCenter. This approach ensures that all row-level, referential integrity errors are identified by the database and captured in the relational error handling tables in the PowerCenter repository. By enabling relational error logging, all row-level errors can automatically be written to a centralized set of four error handling tables.

These four tables store information such as error messages, error data, and source row data. These tables include PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS. Examples of row-level errors include database errors, transformation errors, and business rule exceptions for which the ERROR() function has been called within the mapping.


Error Retrieval

The second step to error management is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make the retrieval of these errors simple and automated in order to make the error management process as efficient as possible. The PCMR should be customized to create error retrieval reports to extract this information from the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a PCMR report, or an email alert that identifies a user when a certain threshold is crossed in a report (such as “number of errors is greater than zero”).

Error Correction

The final step in error management is error correction. Since PowerCenter automates the process of error identification, and PCMR simplifies error retrieval, the error correction step is also simple. After retrieving an error through the PCMR, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be easily exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The PCMR interface supports emailing a report directly through the web-based interface to make the process even easier.

For further automation, a report broadcasting rule that emails the error report to a developer’s email inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. Depending on the type and cause of the error, a fix can be as simple as a re-execution of the mapping, or as complex as a data repair. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, and the level of data criticality to the business unit(s).

Data Profiling Option

For organizations that want to identify data irregularities post-load but don’t want to reject such rows at load time, the PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that provides profile reporting such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default values). Just as with the PCMR, the PowerCenter Data Profiling option comes with a license to use PowerAnalyzer reports that source the data profile warehouse to deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports can be delivered to users through the same easy-to-use BI tool.

Integrating Error Management, Load Management, and Metadata

Error management, load management, metadata, and the load itself are intimately connected; it is the integration of all these approaches that provides the robust system


needed to successfully generate the data warehouse. The following diagram illustrates this integration process.


Creating Inventories of Reusable Objects & Mappings

Challenge

Successfully creating inventories of reusable objects and mappings, including identifying potential economies of scale in loading multiple sources to the same target.

Description

Reusable Objects

The first step in creating an inventory of reusable objects is to review the business requirements and look for any common routines/modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects. In PowerCenter, reusable objects can be single transformations (lookups, filters, etc.), single tasks (command, email, and session), a set of tasks that allow you to reuse a set of workflow logic in several workflows (worklets), or even a string of transformations (mapplets).

Evaluate potential reusable objects by two criteria:

• Is there enough usage and complexity to warrant the development of a common object?

• Are the data types of the information passing through the reusable object the same from case to case or is it simply the same high-level steps with different fields and data?

Common objects are sometimes created just for the sake of creating common components when in reality, creating and testing the object does not save development time or future maintenance. For example, if there is a simple calculation like subtracting a current rate from a budget rate that will be used for two different mappings, carefully consider whether the effort to create, test, and document the common object is worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the calculation were to be performed in a number of mappings, if it was very difficult, and if all occurrences would be updated following any change or fix – then this would be an ideal case for a reusable object. When you add instances of a reusable transformation to mappings, you must be careful that changes you make to the transformation do not invalidate the mapping or generate unexpected data. The Designer stores each reusable transformation as metadata, separate from any mapping that uses the transformation.


The second criterion for a reusable object concerns the data that will pass through the reusable object. Many times developers see a situation where they may perform a certain type of high-level process (e.g., filter, expression, update strategy, or in two or more mappings. For example, if you have several fact tables that require a series of dimension keys, you can create a mapplet containing a series of lookup transformations to find eachdimension key. You can then use the mapplet in each fact table mapping, rather than recreate the same lookup logic in each mapping. This seems like a great candidate for a mapplet. However, after performing half of the mapplet work, the developers may realize that the actual data or ports passing through the high-level logic are totally different from case to case, thus making the use of a mapplet impractical. Consider whether there is a practical way to generalize the common logicso that it can be successfully applied to multiple cases. Remember, when creating a reusable object, the actual object will be replicated in one to many mappings. Thus, in each mapping using the mapplet or reusable transformation object, the same size and number of ports must pass into and out of the mapping/reusable object.

Document the list of the reusable objects that pass this criteria test, providing a high-level description of what each object will accomplish. The detailed design will occur in a future subtask, but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. Keep in mind that it will be impossible to identify one hundred percent of the reusable objects at this point; the goal here is to create an inventory of as many as possible, and hopefully the most difficult ones. The remainder will be discovered while building the data integration processes.

Mappings

A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate table.

The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single ‘object’ in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system).

Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked.

A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each


target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target, each with the same number, and list the additional source(s) of data.

The table would look similar to the following:

Number Target Table Source 1 Customers Cust_File 2 Products Items 3 Customer_Type Cust_File 4 Orders_Item Tickets 4 Orders_Item Ticket_Items

When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets.

When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty and reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping.

When merging targets into one mapping in this manner, give both targets the same number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping.

The resulting inventory would look similar to the following:

Number Target Table Source 1 Customers

Customer_Type Cust_File

2 Products Items 4 Orders_Item Tickets

Ticket_Items

At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance.

First, give each mapping a name. Apply the naming standards generated in 2.2 Design Development Architecture. These names can then be used to distinguish mappings from one other and also can be put on the project plan as individual tasks.

Next, determine for the project a threshold for a high, medium, or low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply:


Low – 1 to 10,000 rows

Medium – 10,000 to 100,000 rows

High – 100,000 rows +

Assign a likely row volume (high, medium or low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of ‘high’ volume; these mappings will be the first candidates for performance tuning.

Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating.


Metadata Reporting and Sharing

Challenge

Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.

Description

The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata:

• Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a very comprehensive tool that is powered by the functionality of Informatica’s BI reporting tool, PowerAnalyzer. It is included on the PowerCenter CD.

• Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).

Metadata Reporter

The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter is based on the PowerAnalyzer and PowerCenter products. It provides PowerAnalyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the repository, and even reports to access objects in the PowerAnalyzer repository. The architecture of the Metadata Reporter is web-based, with an Internet browser front end.


Because Metadata Reporter runs on PowerAnalyzer, you must have PowerAnalyzer installed and running before you proceed with Metadata Reporter setup.

Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they are listed below:

• Schemas.xml • Schedule.xml • GlobalVariables_Oracle.xml (This file is database specific, Informatica provides

GlobalVariable files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment)

• Reports.xml • Dashboards.xml

Note : If you have setup a new instance of PowerAnalyzer exclusively for Metadata reporter, you should have no problem importing these files. However, if you are using an existing instance of PowerAnalyzer which you currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting objects.

The following are the folders that are created in PowerAnalyzer when you import the above-listed files:

• PowerAnalyzer Metadata Reporting - contains reports for PowerAnalyzer repository itself e.g. Today’s Login ,Reports accessed by Users Today etc.

• PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their functionality these reports are further grouped into subfolders as following:

• Configuration Management - contains a set of reports that provide detailed information on configuration management, including deployment and label details. This folder contains following subfolders:

o Deployment o Label o Object Version

• Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage, run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following subfolders:

o Session Execution o Workflow Execution

• PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties, and interdependencies on other objects within the repository. This folder contains following subfolders:

o Mappings o Mapplets o Metadata Extension o Server Grids o Sessions o Sources o Target o Transformations

o Workflows o Worklets

• Security - contains a set of reports that provide detailed information on the users, groups and their association within the repository.

Informatica recommends retaining this folder organization, adding new folders if necessary.

The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform.

(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc:odbc:<data_source_name>)

• Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of metadata objects.

• Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any machine that has access to the web server. The reports in the Metadata Reporter are customizable. The Metadata Reporter allows you to set parameters for the metadata objects to include in the report.

• The Metadata Reporter allows you to go easily from one report to another. The name of any metadata object that displays on a report links to an associated report. As you view a report, you can generate reports for objects on which you need more information.

The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description:

Reports For PowerCenter Repository Sr No Name Folder Description 1 Deployment

Group Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group

Displays deployment groups by repository

2 Deployment Group History

Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group History

Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. This is a primary report in an analytic workflow.

3 Labels Public Folders>PowerCenter Metadata Displays labels created


Reports For PowerCenter Repository Sr No Name Folder Description

Reports>Configuration Management>Labels>Labels

in the repository for any versioned object by repository.

4 All Object Version History

Public Folders>PowerCenter Metadata Reports>Configuration Management>Object Version>All Object Version History

Displays all versions of an object by the date the object is saved in the repository. This is a standalone report.

5 Server Load by Day of Week

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Server Load by Day of Week

Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays

6 Session Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Session Run Details

Displays session run details for any start date by repository by folder. This is a primary report in an analytic workflow.

7 Target Table Load Analysis (Last Month)

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Target Table Load Analysis (Last Month)

Displays the load statistics for each table for last month by repository by folder. This is a primary report in an analytic workflow.

8 Workflow Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Workflow Run Details

Displays the run statistics of all workflows by repository by folder. This is a primary report in an analytic workflow.

9 Worklet Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Worklet Run Details

Displays the run statistics of all worklets by repository by folder. This is a primary report in an analytic workflow.

10 Mapping List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping List

Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources



used in a mapping, the number of transformations, and the number of targets. This is a primary report in an analytic workflow.

11 Mapping Lookup Transformations

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Lookup Transformations

Displays Lookup transformations used in a mapping by repository and folder. This report is a standalone report and also the first node in the analytic workflow associated with the Mapping List primary report.

12 Mapping Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Shortcuts

Displays mappings defined as a shortcut by repository and folder.

13 Source to Target Dependency

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.

14 Mapplet List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet List

Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. This is a primary report in an analytic workflow.

15 Mapplet Lookup Transformations

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Lookup Transformations

Displays all Lookup transformations used in a mapplet by folder and repository. This report is a standalone report



and also the first node in the analytic workflow associated with the Mapplet List primary report.

16 Mapplet Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Shortcuts

Displays mapplets defined as a shortcut by repository and folder.

17 Unused Mapplets in Mappings

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Unused Mapplets in Mappings

Displays mapplets defined in a folder but not used in any mapping in that folder.

18 Metadata Extensions Usage

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Metadata Extensions>Metadata Extensions Usage

Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension.

19 Server Grid List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Server Grid>Server Grid List

Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers.

20 Session List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sessions>Session List

Displays all sessions and their properties by repository by folder. This is a primary report in an analytic workflow.

21

Source List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source List

Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in an analytic workflow.

22 Source Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source Shortcuts

Displays sources that are defined as shortcuts by repository and folder

23 Target List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target List

Displays relational and non-relational targets available by repository



and folder. It also displays the target properties. This is a primary report in an analytic workflow.

24 Target Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target Shortcuts

Displays targets that are defined as shortcuts by repository and folder.

25 Transformation List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation List

Displays transformations defined by repository and folder. This is a primary report in an analytic workflow.

26 Transformation Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation Shortcuts

Displays transformations that are defined as shortcuts by repository and folder.

27 Scheduler (Reusable) List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Workflows>Scheduler (Reusable) List

Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder. This is a primary report in an analytic workflow.

28 Workflow List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Workflows>Workflow List

Displays workflows and workflow properties by repository by folder. This report is a primary report in an analytic workflow.

29 Worklet List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Worklets>Worklet List

Displays worklets and worklet properties by repository by folder. This is a primary report in an analytic workflow.

30 Users By Group Public Folders>PowerCenter Metadata Reports>Security>Users By Group

Displays users by repository and group.

Reports For PowerAnalyzer Repository Sr No Name Folder Description 1 Bottom 10 Least

Accessed Reports this Year

Public Folders>PowerAnalyzer Metadata Reporting>Bottom 10 Least Accessed Reports this Year

Displays the ten least accessed reports for the current year. It has an analytic workflow that provides access



details such as user name and access time.

2 Report Activity Details

Public Folders>PowerAnalyzer Metadata Reporting>Report Activity Details

Part of the analytic workflows "Top 10 Most Accessed Reports This Year", "Bottom 10 Least Accessed Reports this Year" and "Usage by Login (Month To Date)".

3 Report Activity Details for Current Month

Public Folders>PowerAnalyzer Metadata Reporting>Report Activity Details for Current Month

Provides information about reports accessed in the current month until current date.

4 Report Refresh Schedule

Public Folders>PowerAnalyzer Metadata Reporting>Report Refresh Schedule

Provides information about the next scheduled update for scheduled reports. It can be used to decide schedule timing for various reports for optimum system performance.

5 Reports Accessed by Users Today

Public Folders>PowerAnalyzer Metadata Reporting>Reports Accessed by Users Today

Part of the analytic workflow for "Today's Logins". It provides detailed information on the reports accessed by users today. This can be used independently to get comprehensive information about today's report activity details.

6 Todays Logins Public Folders>PowerAnalyzer Metadata Reporting>Todays Logins

Provides the login count and average login duration for users who logged in today.

7 Todays Report Usage by Hour

Public Folders>PowerAnalyzer Metadata Reporting>Todays Report Usage by Hour

Provides information about the number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour.


Reports For PowerCenter Repository Sr No Name Folder Description 8 Top 10 Most

Accessed Reports this Year

Public Folders>PowerAnalyzer Metadata Reporting>Top 10 Most Accessed Reports this Year

Shows the ten most accessed reports for the current year. It has an analytic workflow that provides access details such as user name and access time.

9 Top 5 Logins (Month To Date)

Public Folders>PowerAnalyzer Metadata Reporting>Top 5 Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

10 Top 5 Longest Running On-Demand Reports (Month To Date)

Public Folders>PowerAnalyzer Metadata Reporting>Top 5 Longest Running On-Demand Reports (Month To Date)

Shows the five longest running on-demand reports for the current month to date. It displays the average total response time, average DB response time, and the average PowerAnalyzer response time (all in seconds) for each report shown.

11 Top 5 Longest Running Scheduled Reports (Month To Date)

Public Folders>PowerAnalyzer Metadata Reporting>Top 5 Longest Running Scheduled Reports (Month To Date)

Shows the five longest running scheduled reports for the current month to date. It displays the average response time (in seconds) for each report shown.

12 Total Schedule Errors for Today

Public Folders>PowerAnalyzer Metadata Reporting>Total Schedule Errors for Today

Provides the number of errors encountered during execution of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it.

13 User Logins (Month To Date)

Public Folders>PowerAnalyzer Metadata Reporting>User Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The

analytic workflow attached to it provides more details about the reports accessed by a selected user.

14 Users Who Have Never Logged On

Public Folders>PowerAnalyzer Metadata Reporting>Users Who Have Never Logged On

Provides information about users who exist in the repository but have never logged in. This information can be used to make administrative decisions about disabling accounts.

Customizing a Report or Creating New Reports

Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics. PowerAnalyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters offers tremendous reporting flexibility. Additionally, you can setup report templates and export them as Excel files, which can be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.

Wildcards

The Metadata Reporter supports two wildcard characters:

• Percent symbol (%) - represents any number of characters and spaces. • Underscore (_) - represents one character or space.

You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters.

Suppose you have the following values available to select:

items, items_in_promotions, order_items, promotions

The following list shows the return values for some wildcard combinations you can use:

Wildcard Combination Return Values

% items, items_in_promotions, order_items, promotions

<blank> items, items_in_promotions, order_items, promotions

%items items, order_items


item_ Items item% items, items_in_promotions ___m% items, items_in_promotions, promotions %pr_mo% items_in_promotions, promotions

A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document.

For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.

Security Awareness for Metadata Reporter

Metadata Reporter uses PowerAnalyzer for reporting out of the PowerCenter /PowerAnalyzer repository. PowerAnalyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production, the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's profile. For more information on the ways in which you can implement security in PowerAnalyzer, refer to the PowerAnalyzer documentation.

Metadata Exchange: the Second Generation (MX2)

The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report and query the Informatica metadata.

Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2 supersede those of MX.

The primary requirements and features of MX2 are:

Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based API for


accessing and manipulating the PowerCenter Repository from various programming languages.

Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products.

Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces.

Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository.

Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution.

Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools.

Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository.

Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of MX2.


Repository Tables & Metadata Management

Challenge

Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.

Description

Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.

Managing Repository

The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.

Repository backup

Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.


Figure 1 Shell Script to backup repository

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.

Figure 2 Repository Backup workflow

The following paragraphs describe some useful practices for maintaining backups:

Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size.

Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip.

Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself.


Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.

Restore repository

Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.

Restore folders

There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.

Remove older versions

Use the purge command to remove older versions of objects from repository. To purge a specific version of an object, view the history of the object, select the version, and purge it.

Finding deleted objects and removing them from repository

If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option. Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted objects, use either the find checkouts command in the client tools or a query generated in the repository manager, or a specific query.


Figure 3 Query to list DELETED objects

After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.

Truncating Logs

You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder.

Options allow truncating all log entries or selected entries based on date and time.


Figure 4 Truncate Log for entire repository

Figure 5 Truncate Log - for a specific folder

Repository Performance

Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.

Repository Agent and Repository Server performance

Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various causes should be analyzed and the repository server (or agent) configuration file modified to improve performance.

Managing Metadata

The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle


and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.

Failed Sessions

The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Last_Error AS Error_Message,

DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status,

Actual_Start AS Start_Time,

Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code != 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

Long running Sessions

The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Successful_Source_Rows AS Source_Rows,

Successful_Rows AS Target_Rows,

Actual_Start AS Start_Time,

Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code = 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)


AND (Session_TimeStamp - Actual_Start) > (10/(24*60))

ORDER BY Session_timeStamp

Invalid Tasks

The following query lists folder names and task name, version number, and last saved for all invalid tasks.

SELECT SUBJECT_AREA AS FOLDER_NAME,

DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE,

TASK_NAME AS OBJECT_NAME,

VERSION_NUMBER, -- comment out for V6

LAST_SAVED

FROM REP_ALL_TASKS

WHERE IS_VALID=0

AND IS_ENABLED=1

--AND CHECKOUT_USER_ID = 0 -- Comment out for V6

--AND is_visible=1 -- Comment out for V6

ORDER BY SUBJECT_AREA,TASK_NAME

Load Counts

The following query lists the load counts (number of rows loaded) for the successful sessions.

SELECT

subject_area,

workflow_name,

session_name,

DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status,

successful_rows,


failed_rows,

actual_start

FROM

REP_SESS_LOG

WHERE

TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

ORDER BY

subject_area,

workflow_name,

session_name,

Session_status


Using Metadata Extensions

Challenge

To provide for efficient documentation and achieve extended metadata reporting through the use of metadata extensions in repository objects.

Description

Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository.

Informatica Client applications can contain two types of metadata extensions: vendor-defined and user-defined.

• Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them.

• User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendor-defined metadata extensions are always reusable.

Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a non-reusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable.

Metadata extensions can be created for the following repository objects:

• Source definitions • Target definitions • Transformations (Expressions, Filters, etc.) • Mappings • Mapplets


• Sessions • Tasks • Workflows • Worklets

Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source.

The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field.

You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer.

You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and Description.

Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide.

You can also migrate Metadata Extensions from one environment to another. When you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values.

Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications.


Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.


Daily Operations

Challenge

Once the data warehouse has been moved to production, the most important task is keeping the system running and available for the end users.

Description

In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support Team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team, becomes in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse.

Service Level Agreement

The Service Level Agreement outlines how the overall data warehouse system is to be maintained. This is a high-level document that discusses system maintenance and the components of the system, and identifies the groups responsible for monitoring the various components. At a minimum, it should contain the following information:

• Times when the system should be available to users. • Scheduled maintenance window. • Who is expected to monitor the operating system. • Who is expected to monitor the database. • Who is expected to monitor the PowerCenter sessions. • How quickly the support team is expected to respond to notifications of system

failures. • Escalation procedures that include data warehouse team contacts in the event that

the support team cannot resolve the system failure.

Operations Manual

The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a


production support operator to maintain the system and resolve most problems that may arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain:

• Information on how to stop and re-start the various components of the system. • Ids and passwords (or how to obtain passwords) for the system components. • Information on how to re-start failed PowerCenter sessions and recovery

procedures. • A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and

the average run times. • Error handling strategies. • Who to call in the event of a component failure that cannot be resolved by the

Production Support team.


Data Integration Load Traceability

Challenge

Load management is one of the major difficulties facing a data integration or data warehouse operations team. This Best Practice tries to answer the following questions:

• How can the team keep track of what has been loaded? • What order should the data be loaded in? • What happens when there is a load failure? • How can bad data be removed and replaced? • How can the source of data be identified? • When it was loaded?

Description

Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.

Benefits of a Load Management Architecture

Data Lineage

The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data.

To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth.

It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie.


It is often assumed that data issues are produced by the transformation processes executed as part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.

Process Lineage

Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.

Process Dependency Management

Having a metadata structure in place provides an environment to facilitate the application and maintenance of business dependency rules. Once a structure is in place that identifies every process, it becomes very simple to add the necessary metadata and validation processes required to ensure enforcement of the dependencies among processes. Such enforcement resolves many of the scheduling issues that operations teams typically faces.

Process dependency metadata needs to exist because it is often not possible to rely on the source systems to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple systems and must be loaded into the target schema in a specific order. This is usually difficult to manage because the various source systems have no way of coordinating the release of data to the target schema.

Robustness

Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.

Load Ordering

Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata.

There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars.


Either way, load ordering should be employed in any data integration or data warehousing implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure.

The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!

Rollback

If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.

Simple Load Management Metadata Model

As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata:

• Source tracking • Process tracking

Source Tracking


Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.

Source Definitions

Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems.

The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters.

These definitions should be held in a Source Master table like the one shown in the data model above.

These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds.

In the case of flat files, it is usual to hold details like:

• Header information (if any) • How many columns • Data types for each column • Expected number of rows

For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses).

These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. It’s better to catch a bad data structure than to start loading bad data.

Source Instances

A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type.

The various source types may need slightly different source instance metadata to enable optimal control over each individual load.

Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would


be the selection criteria (i.e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed.

This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.

Process Tracking

Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data.

While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.

Process Definition

Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.

Process Instances

A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance.

The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.

Integrating Source and Process Tracking

Integrating source and process tracking can produce an extremely powerful investigative and control tool for the administrators of data warehouses and integrated schemas. This is achieved by simply linking every process ID with the source instance ID of the source it is processing. This requires that a write-back facility be built into every process to update its process instance record with the ID of the source instance being processed.


The effect is that there is a one to one/many relationship between the source instance table and the process instance table containing several rows for each set of source data loaded into a target schema. For example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.

Integrated Load Management Flow Diagram

Tracking Transactions

This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.


Tracking Reference Data

This task is complicated by the fact that reference data, by its nature, is not static. This means that if you simply update the data in a row any time there is a change, there is no way that the change can be backed out using the load management practice described earlier. Instead, Informatica recommends always using slowly changing dimension processing on every reference data and dimension table to accomplish source and process tracking. Updating the reference data as a ‘slowly changing table’ retains the previous versions of updated records, thus allowing any changes to be backed out.

Tracking Aggregations

Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions.

This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.


Event Based Scheduling

Challenge

In an operational environment, the beginning of a task often needs to be triggered by some event, either internal or external, to the Informatica environment. In versions of PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and EventWait Workflow and Worklet tasks, as well as indicator files.

Description

Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasn’t there, it waited until it appeared, then deleted it, and triggered the session.

In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows.

• An Event-Raise task represents a user-defined event (i.e., an indicator file). • An Event-Wait task waits for an event to occurwithin a workflow. After the event

triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events

To use a pre-defined event, you need a session, shell command, script, or batch file to create an indicator file. You must create the file locally or send it to a directory local to the PowerCenter Server. The file can be any format recognized by the PowerCenter Server operating system. You can choose to have the PowerCenter Server delete the indicator file after it detects the file, or you can manually delete the indicator file. The PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot delete the indicator file.


When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log.

Follow these steps to set up a pre-defined event in the workflow:

1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box.

2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects

the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.

Pre-defined Event

A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event

A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward.

The following is an example of using user-defined events:

Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:

1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow

properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This allows

the Event-Raise task to trigger the event when P1_session and Q3_session complete.

6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task.


8. Add Q4_session after the Event-Wait task. When the PowerCenter Server processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session.

The PowerCenter Server executes the workflow in the following order:

1. The PowerCenter Server executes P1_session and P2_session concurrently. 2. When P1_session completes, the PowerCenter Server executes Q3_session. 3. The PowerCenter Server finishes executing P2_session. 4. The Event-Wait task waits for the Event-Raise task to trigger the event. 5. The PowerCenter Server completes Q3_session. 6. The Event-Raise task triggers the event, Q1Q3_complete. 7. The Informatica Server executes Q4_session because the event,

Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend on error’. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.


High Availability

Challenge

Availability of the environment that processes data is key to all organizations. When processing systems are unavailable, companies are not able to meet their service level agreements and service their internal and external customers.

High availability within the PowerCenter architecture is related to making sure the necessary processing resources are available to meet these business needs.

In a highly available environment such as PowerCenter, load schedules cannot be allowed to be impacted by the failure of physical hardware. The PowerCenter server must be running at all times. If the machine hosting the PowerCenter server goes down, another machine must recognize this and start another server and transfer responsibility for running the sessions and batches.

Processes also need to be designed for restartability and to handle switching between servers, making all processes server independent.

These architecture and process considerations support ‘High Availability’ in a PowerCenter environment.

Description

In PowerCenter terms ‘High availability’ is best accomplished in a clustered environment.

Example

While there are many types of hardware and many ways to configure a clustered environment, this example is based on the following hardware and software characteristics:

• Two Sun 4500s, running Solaris OS • Sun High-Availability Clustering Software • External EMC storage, with each server owning specific disks

PowerCenter installed on a separate disk that is accessible by both servers in the cluster, but only by one server at a time


One of the Sun 4500’s serves as the primary data integration server, while the other server in the cluster is the secondary server. Under normal operations, the PowerCenter server ‘thinks’ it is physically hosted by the primary server and uses the resources of the primary server, although it is physically located on its own server.

When the primary server goes down, the Sun high-availability software automatically starts the PowerCenter server on the secondary server using the basic auto start/stop scripts that are used in many UNIX environments to automatically start the PowerCenter server whenever a host is rebooted. In addition, the Sun high-availability software changes the ownership of the disk where the PowerCenter server is installed from the primary server to the secondary server. To facilitate this, a logical IP address can be created specifically for the PowerCenter server. This logical IP address is specified in the pmserver.cfg file instead of the physical IP addresses of the servers. Thus, only one pmserver.cfg file is needed.

Note: The pmserver.cfg file is located with the pmserver code, typically at: {informatica_home}/{version label}/pmserver.

Process

A high-availability environment can handle a variety of situations such as hardware failures or core dumps. However, such an environment can also generate problems when processes fail via an ‘Abort’ mid-stream which uses signal files, surrogate keys or other intermediate results.

When an abort occurs on the non-Informatica side, any intermediate files created by UNIX scripts need to be taken into account in the restart procedures. However, if an abort or system failure occurs on the Informatica side, any write-back to the repository will not be executed. For example, if a sequence generator is being used for a surrogate key, the final surrogate key value will not be written to the repository. This problem needs to be addressed as part of the restart logic by caching sequence generator values or designing code that can handle this situation.

An example of the consequences of not addressing this problem could include incorrect handling of surrogate keys. A surrogate key is a key that does not have business meaning, it is generated as part of a process. Informatica sequence generators are frequently used to hold the next key value to use for a new key. If a hardware failure occurs, the current value of the sequence generator will not be written to the repository. Therefore, without handling this situation, the next time a new row is written it would use an old key value and update an incorrect row of data. This would be a catastrophic data problem and must be prevented.

It is recommended to design processes that can restart in the event of any failure including this example without any manual cleanup required. For the above surrogate key problem there are two solutions:

• Every time you get a sequence value, cache the number of values that will be needed before the next commit of the database. While this will prevent the catastrophic data problem, it also could waste a large number of key values that were never used.

• An alternative approach would be to lookup the maximum key value each time this process runs, then use the sequence generator ‘reset’ feature and always start


at 1, incrementing the value for each new row of data. This would allow simple and risk-free restarts and not waste any key values. This is the recommended approach.

The previous example is just one of many potential restart problems. Developers need to design carefully and extend these principles to other objects such as variable values, run details, and any other details written to the repository at the completion of a session or task. These problems are most significant when the repository is used to hold process data or when temporary results are stored on the server rather then having processes handle these situations.

Designing a high-availability system

In developing a high-availability system or developing processes in a high availability environment, it is advisable to address the following process issues :

Issue Solution Are signal files used? As part of the restart process, check for

the existence of signal files and clean up files as appropriate on all servers

Are sequence generators used? If sequence generators are used, write audit or operational processes to evaluate if a sequence generator is out of sync and update as appropriate

Are there nested processes within a workflow?

Are the workflows written in such a way that they can either be restarted at the beginning of the workflow with no ill effects, or that the individual sessions can be restarted without causing error handling to fail because other sessions were not run during the current execution

Are there batch controls that utilize components from previous issues

Validate that batch controls can handle a mid-stream restart.

These situations should be resolved before running high availability in production. If the high availability environment is already in production, restart procedures should be modified to handle these situations.

When an environment has high availability in place, all development should be designed for restartablity and address the considerations listed in the previous examples.

Summary

High Availability in PowerCenter is composed of two sets of tasks, architectural and procedural. It is critical that both are considered when creating a High Availability solution. Companies must both implement a clustered environment to handle hardware failures and develop processes which can be easily restarted regardless of the type of failure or the server they are executing on.


Load Validation

Challenge

Knowing that all data for the current load cycle has loaded correctly is essential for good data warehouse management. However, the need for load validation varies, depending on the extent of error checking, data validation, and/or data cleansing functionalities inherent in your mappings. For large data integration projects, with thousands of mappings, the task of reporting load statuses becomes overwhelming without a well-planned load validationprocess.

Description

Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process:

1. Determine what information you need for load validation (e.g., workflow names, session names, session start times, session completion times, successful rows and failed rows).

2. Determine the source of this information. All this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting this information.

3. Determine how you want this information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table, so that history is easily preserved? Do you want it stored as a flat file?

All of these factors weigh in finding the correct solution for you.

The following paragraphs describe five possible solutions for load validation, beginning with a fairly simple solution and moving toward the more complex:

1. Post-session Emails on Either Success or Failure

One practical application of the post-session email functionality is the situation in which a key business user waits for completion of a session to run a report. You can configure email to this user, notifying him or her that the session was successful and the report can run. Another practical application is the situation in which a production support analyst needs to be notified immediately of any failures. You can configure the session


to send an email to the analyst for a failure. For around the clock support, a pager number can be used in place of an email address.

Post-session e-mail is configured in the session, under the General tab and

‘Session Commands’.

A number of variables are available to simplify the text of the e-mail:

• %s Session name • %e Session status • %b Session start time • %c Session completion time • %i Session elapsed time • %l Total records loaded • %r Total records rejected • %t Target table details • %m Name of the mapping used in the session • %n Name of the folder containing the session • %d Name of the repository containing the session • %g Attach the session log to the message

2. Other Workflow Manager Features

Besides post session emails, there are other features available in the Workflow Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the features you can use to place multiple controls on the behavior of your loads. Another feature is to place conditions in your links. Links are used to connect tasks within a workflow or worklet. You can use the pre-defined or user-defined variables in the link conditions. In the example below, upon the ‘Successful’ completion of both sessions A and B, the PowerCenter Server will execute session C.

3. PowerCenter Metadata Reporter (PCMR) Reports

The PowerCenter Metadata Reporter (PCMR) is a web-based business intelligence (BI) tool that is included with every Informatica PowerCenter license to give visibility into metadata stored in the PowerCenter repository in a manner that is easy to comprehend and distribute. The PCMR includes more than 130 pre-packaged metadata reports and dashboards delivered through PowerAnalyzer, Informatica’s BI offering. These pre-


packaged reports enable PowerCenter customers to extract extensive business and technical metadata through easy-to-read reports including:

• Load statistics and operational metadata that enable load validation. • Table dependencies and impact analysis that enable change management. • PowerCenter object statistics to aid in development assistance. • Historical load statistics that enable planning for growth.

In addition to the 130 pre-packaged reports and dashboards that come standard with PCMR, you can develop additional custom reports and dashboards based on the PCMR limited use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include:

• Repository-wide reports and/or dashboards with indicators of daily load success/failure.

• Customized project-based dashboard with visual indicators of daily load success/failure.

• Detailed daily load statistics report for each project that can be exported to Microsoft Excel or PDF.

• Error handling reports that deliver error messages and source data for row level errors that may have occurred during a load.


Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.

4. Query Informatica Metadata Exchange (MX) Views

Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter repository. The Repository Manager generates these views when you create or upgrade a repository. Almost any query can be put together to retrieve metadata related to the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows, and session duration:

select subject_area, session_name, session_timestamp, successful_rows,

(session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where

session_timestamp = (select max(session_timestamp) from rep_sess_log

where session_name =a.session_name) order by subject_area, session_name

The sample output would look like this:


TIP

Informatica strongly advises against querying directly from the repository tables. Since future versions of PowerCenter will most likely alter the underlying repository tables, PowerCenter will support queries from the unaltered MX views, not the repository tables.

5. Mapping Approach

A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information.

The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations.

Please note that unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCMR report. However, you can use a business intelligence tool of your choice instead.


Repository Administration

Challenge

Definining the role of the PowerCenter Administrator to describe the tasks required to properly manage the repository.

Description

The PowerCenter repository administrator has many responsibilities. In addition to regularly backing up the repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions:.

• Determine metadata strategy • Install/configure client/server software • Migrate development to test and production • Maintain PowerCenter Servers • Upgrade software • Administer security and folder organization • Monitor and tune environment

NOTE: The Repository Administrator is also typically responsible for maintaining repository passwords; changing them on a regular basis and keeping a record of them in a secure place.

Determine Metadata Strategy

The Repository Administrator is responsible for developing the structure and standard for metadata in the PowerCenter Repository. This includes developing naming conventions for all objects in the repository, creating a folder organization, and maintaining the repository. The Administrator is also responsible for modifying the metadata strategies to suit changing business needs or to fit the needs of a particular project. Such changes may include new folder names and/or different security setup.

Install/Configure Client/Server Software

This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a NT or UNIX Admin and DBA.


The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.

Migrate Development to Production

When the time comes for content in the development environment to be moved to test and production environments, it is the responsibility of the Administrator to schedule, track, and copy folder changes. Also, it is crucial to keep track of the changes that have taken place. It is the role of the Administrator to track these changes through a change control process. The Administrator should be the only individual able to physically move folders from one environment to another.

If a versioned repository is used, the Administrator should set up labels and instruct the developers on the labels that they must apply to their repository objects (i.e., reuseable transformations, mappings, workflows and sessions). This task also requires close communication with project staff to review the status of items of work to ensure, for example, that only tested or approved work is migrated.

Maintain PowerCenter Servers

The Administrator must also be able to understand and troubleshoot the server environment. He or she should have a good understanding of how the server operates under various situations and be fully aware of all connections to the server. The Administrator should also understand what the server does when a session is running and be able to identify those processes. Additionally, certain mappings may produce files in addition to the standard session and workflow logs. The Administrator should be familiar with these files and know how and where to maintain them.

Upgrade Software

If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.

Security and Folder Administration

Security administration consists of creating, maintaining, and updating all users within the repository, including creating and assigning groups based on new and changing projects and defining which folders are to be shared, and at what level. Folder administration involves creating and maintaining the security of all folders. The Administrator should be the only user with privileges to edit folder properties.

Tune Environment

The Administrator should have sole responsibility for implementing performance changes to the server environment. He or she should observe server performance throughout development so as to identify any bottlenecks in the system. In the production environment, the Repository Administrator should monitor the jobs and any growth (e.g., increases in data or throughput time) and communicate such change to


other staff as appropriate to address bottlenecks, accommodate growth, and ensure that the required data is loaded within the prescribed load window.


SuperGlue Repository Administration

Challenge

The task of administering SuperGlue Repository involves taking care of both the integration repository and the SuperGlue warehouse. This requires a knowledge of both PowerCenter administrative features (i.e., the integration repository used in SuperGlue) and SuperGlue administration features.

Description

A SuperGlue administrator needs to be involved in the following areas to ensure that the SuperGlue metadata warehouse is fulfilling the end-user needs:

• Migration of SuperGlue objects created in the Development environment to QA or the Production environment

• Creation and maintenance of access and privileges of SuperGlue objects • Repository backups • Job monitoring • Metamodel creation.

Migration from Development to QA or Production

In cases where a client has modified out-of-the-box objects provided in SuperGlue or created a custom metamodel for custom metadata, the objects must be tested in the Development environment prior to being migrated to the QA or Production environments. The SuperGlue Administrator needs to do the following to ensure that the objects are in sync between the two environments:

• Install a new SuperGlue instance for the QA/Production environment. This involves creating a new integration repository and SuperGlue warehouse

• Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the SuperGlue Administration tab) or via the SGCmd command lineutility


• Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality in SG Administration Tab. This functionality is identical to the function in PowerAnalyzer; refer to the PowerAnalyzer Administration Guide for details on the import/export function.

Providing Access and Privileges

Users can perform a variety of SuperGlue tasks based on their privileges. The SuperGlue Administrator can assign privileges to users by assigning them roles. Each role has a set of privileges that allow the associated users to perform specific tasks. The Administrator can also create groups of users so that all users in a particular group have the same functions. When an Administrator assigns a role to a group, all users of that group receive the privileges assigned to the role. For more information about privileges, users, and groups, see the PowerAnalyzer Administrator Guide.

The SuperGlue Administrator can assign privileges to users to enable users to perform the any of the following tasks in SuperGlue:

• Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema.

• Configure the SuperGlue Warehouse. Users can add, edit, and delete repository objects using SuperGlue.

• Configure metamodels. Users can add, edit, and delete metamodels.

SuperGlue also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in SuperGlue.

Similarly, the Administrator can establish access permissions for source repository objects in the SuperGlue warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in SuperGlue. The Administrator can assign the following types of access permissions to objects:

• Read - Grants permission to view the details of an object and the names of any objects it contains.


• Write - Grants permission to edit an object and create new repository objects in the SuperGlue warehouse.

• Delete - Grants permission to delete an object from a repository. • Change permission - Grants permission to change the access permissions for an

object.

When a repository is first loaded into the SuperGlue warehouse, SuperGlue provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions

Metamodel Creation


In cases where a client needs to create custom metamodels for sourcing custom metadata, the SuperGlue Administrator needs to create new packages, originators, repository types and class associations. For details on how to create new metamodels for custom metadata loading and rendering in SuperGlue, refer to the SuperGlue Installation and Administration Guide.

Job Monitoring

When Super Glue Xconnects are running in the Production environment, Informatica recommends monitoring loads through the SuperGlue console. The Configuration Console Activity Log in the SuperGlue console can identify the total time it takes for an Xconnect to complete. The console maintains a history of all runs of an Xconnect, enabling a SuperGlue Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in SuperGlue warehouse.

The Activity Log provides the following details about each repository load:

• Repository Name- name of the source repository defined in SuperGlue • Run Start Date- day of week and date the XConnect run began • Start Time- time the XConnect run started • End Time- time the XConnect run completed • Duration- number of seconds the XConnect run took to complete • Ran From- machine hosting the source repository • Last Refresh Status- status of the XConnect run, and whether it completed

successfully or failed

Repository Backups


When SuperGlue is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas:

• Database backups of the SuperGlue warehouse • Integration repository; Informatica recommends either of two methods for this

backup: o The PowerCenter Repository Server Administration Console or pmrep

command line utility o The traditional, native database backup method.

The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.


Third Party Scheduler

Challenge

Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler.

Description

Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel.

Many companies want to automate the scheduling process by using scripts or third-party schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process.

A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. pmcmd is a program used to communicate with the PowerCenter server. PowerCenter 7 greatly enhances pmcmd functionality, providing commands to support the concept of workflows and workflow monitoring while retaining compatibility with old syntax.

Third Party Scheduler Integration Levels

In general, there are three levels of integration between a third-party scheduler and PowerCenter: Low, Medium, and High.

Low Level

Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler.


This type of integration is very simple to implement because the third-party scheduler kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor.

Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the company’s standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.

Medium Level

With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks.

With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.

High Level

With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter.

Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support team.

Sample Scheduler Script

There are many independent scheduling tools on the market. The following is an example of a AutoSys script that can be used to start tasks; it is included here simply as an illustration of how a scheduler can be implemented in the PowerCenter environment. This script can also capture the return codes, and abort on error, returning a success or failure (with associated return codes to the command line or the Autosys GUI monitor).

# Name: jobname.job # Author: Author Name # Date: 01/03/2005 # Description: # Schedule: Daily # # Modification History # When Who Why # #------------------------------------------------------------------

. jobstart $0 $*

# set variables ERR_DIR=/tmp

# Temporary file will be created to store all the Error Information # The file format is TDDHHMISS<PROCESS-ID>.lst curDayTime=`date +%d%H%M%S` FName=T$CurDayTime$$.lst

if [ $STEP -le 1 ] then echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."

cd /dbvol03/vendor/informatica/pmserver/ #pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01 wf_stg_tmp_product_xref_table #pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S

# The above lines need to be edited to include the name of the workflow or the task that you are attempting to start.

TG_TMP_PRODUCT_XREF_TABLE

# Checking whether to abort the Current Process or not RetVal=$? echo "Status = $RetVal" if [ $RetVal -ge 1 ] then jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n" exit 1 fi echo "Step 1: Successful" fi

jobend normal

exit 0


Updating Repository Statistics

Challenge

The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up queries. Most databases use column distribution statistics to determine which index to use to optimize performance. It can be important, especially in large or high-use repositories, to update these statistics regularly to avoid performance degradation.

Description

For PowerCenter 7 and later, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script.

For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.

Features in PowerCenter version 7 and later

Copy, Backup and Restore Repositories

PowerCenter 7 automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.

PMREP Command

PowerCenter 7 also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile}

The –s option allows for you to skip different tables you may not want to update statistics.

Example of Automating the Process


One approach to automating this would be to use a UNIX shell that includes the pmrep command “updatestatistics” which is incorporated into a special workflow in PowerCenter and run on a scheduled basis. Note: Workflow Manager supports command line as well as scheduling.

Below listed is an example of the command line object.

In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.

Tuning Strategies for PowerCenter version 6 and earlier

The following are strategies for generating scripts to update distribution statistics. Note that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".

Oracle

Run the following queries:

select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%'


select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%'

This will produce output like:

'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;'

analyze table OPB_ANALYZE_DEP compute statistics;

analyze table OPB_ATTR compute statistics;

analyze table OPB_BATCH_OBJECT compute statistics;

.

.

.

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'

analyze index OPB_DBD_IDX compute statistics;

analyze index OPB_DIM_LEVEL compute statistics;

analyze index OPB_EXPR_IDX compute statistics;

.

.

Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like:

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'

Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server

Run the following query:

select 'update statistics ', name from sysobjects where name like 'OPB_%'

This will produce output like :

name


update statistics OPB_ANALYZE_DEP

update statistics OPB_ATTR

update statistics OPB_BATCH_OBJECT

.

.

Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file.


Sybase


select 'update statistics ', name from sysobjects where name like 'OPB_%'

This will produce output like

name

update statistics OPB_ANALYZE_DEP

update statistics OPB_ATTR

update statistics OPB_BATCH_OBJECT

.

.

.

Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file.


Informix



select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%';

This will produce output like :

(constant) tabname (constant)

update statistics low for table OPB_ANALYZE_DEP ;

update statistics low for table OPB_ATTR ;

update statistics low for table OPB_BATCH_OBJECT ;

.

.

.

Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like:

(constant) tabname (constant)


DB2

Run the following query :

select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;'

from sysstat.tables where tabname like 'OPB_%'

This will produce output like:

runstats on table PARTH.OPB_ANALYZE_DEP

and indexes all;

runstats on table PARTH.OPB_ATTR

and indexes all;

runstats on table PARTH.OPB_BATCH_OBJECT

and indexes all;


.

.

.

Save the output to a file.

Run this as a SQL script to update statistics for the repository tables.


Deploying PowerAnalyzer Objects

Challenge

To understand the methods for deploying PowerAnalyzer objects between repositories and the limitations.

Description

The following PowerAnalyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the PowerAnalyzer repository and deploying PowerAnalyzer Dashboards and reports from development to production.

The following repository objects in PowerAnalyzer can be exported and imported:

• Schemas • Reports • Time Dimensions • Global Variables • Dashboards • Security profiles • Schedules • Users • Groups • Roles

It is advisable not to modify the XML file created after exporting objects. Any change might invalidate the XML file and result in failure of import objects into a PowerAnalyzer repository.

For more information on exporting objects from the PowerAnalyzer repository, refer to Chapter 13 in PowerAnalyzer Administration Guide.

EXPORTING SCHEMA(S):

To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any


objects, PowerAnalyzer does not export any schema definition and displays the following message:

There is no content to be exported.

There are two ways to export metrics or folders containing metrics. First, you can select the “Export Metric Definitions and All Associated Schema Table and Attribute Definitions” option. If you select to export a metric and its associated schema objects, PowerAnalyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, PowerAnalyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder.

The other way to export metrics or folders containing metrics is to select the “Export Metric Definitions Only” option. When you choose to export only the definition of the selected metric, PowerAnalyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click on the Administration tab » XML Export/Import » Export Schemas 3. All the metric folders in the schema directory are displayed. Click “Refresh

Schema” option to display the latest list of folders and metrics in the schema directory

4. Select the check box for the folder or metric to be exported and click “Export as XML” option

5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING REPORT(S):

To export the definitions of more than one report, select multiple reports or folders. PowerAnalyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, PowerAnalyzer exports the Report table, Report chart, Filters, Indicators (gauge, chart, and table indicators), Custom metrics, Links to similar reports, All reports in an analytic workflow, including links to similar reports.

Reports might have public or personal indicators associated with them. By default, PowerAnalyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box.

To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, PowerAnalyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, PowerAnalyzer exports the links to the similar reports.


PowerAnalyzer does not export alerts, schedules, or global variables associated with the report. Although PowerAnalyzer does not export global variables, it lists all global variables it finds in the report filter. You can export these global variables separately.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export Reports 3. Select the folder or report to be exported 4. Click “Export as XML” option 5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING GLOBAL VARIABLE(S):

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export Global Variables 3. Select the Global variable to be exported 4. Click “Export as XML” option 5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING A DASHBOARD:

When a dashboard is exported, PowerAnalyzer exports Reports, Indicators, Shared Documents, and Gauges associated with the dashboard. PowerAnalyzer does not export Alerts, Access Permissions, Attributes and Metrics in the Report(s), or Real-time Objects. You can export any of the public dashboards defined in the repository and you can export more than one dashboard at one time.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export Dashboards 3. Select the Dashboard to be exported 4. Click “Export as XML” option 5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING A USER SECURITY PROFILE:

PowerAnalyzer keeps a security profile for each user or group in the repository. A security profile consists of the access permissions and data restrictions that the system administrator sets for a user or group.


When exporting a security profile, PowerAnalyzer exports access permissions for objects under the Schema Directory, which include folders, metrics, and attributes. PowerAnalyzer does not export access permissions for filtersets, reports, or shared documents.

PowerAnalyzer allows you to export only one security profile at a time. If a user or group security profile you export does not have any access permissions or data restrictions, PowerAnalyzer does not export any object definitions and displays the following message:

There is no content to be exported.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export Security Profile 3. Click “Export from users” and select the user for which security profile to be

exported 4. Click “Export as XML” option 5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING A SCHEDULE:

You can export a time-based or event-based schedule to an XML file. PowerAnalyzer runs a report with a time-based schedule on a configured schedule. PowerAnalyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, PowerAnalyzer does not export the history of the schedule.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export Schedules 3. Select the Schedule to be exported 4. Click “Export as XML” option 5. Enter XML filename and click “Save” option to save the XML file 6. The XML file will be stored locally on the client machine

EXPORTING A USER/GROUP/ROLE:

Exporting Users

You can export the definition of any user you define in the repository. However, you cannot export the definitions of system users defined by PowerAnalyzer. If you have over a thousand users defined in the repository, PowerAnalyzer allows you to search for


the users that you want to export. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for users to export.

You can export the definitions of more than one user, including the following information:

• Login name • Description • First, middle, and last name • Title • Password • Change password privilege • Password never expires indicator • Account status • Groups to which the user belongs • Roles assigned to the user • Query governing settings

PowerAnalyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export User/Group/Role 3. Click “Export Users/Group(s)/Role(s)” option 4. Select the user to be exported 5. Click “Export as XML” option 6. Enter XML filename and click “Save” option to save the XML file 7. The XML file will be stored locally on the client machine

Exporting Groups

You can export any group defined in the repository. You can export the definitions of more than one group. You can also export the definitions of all the users within a selected group. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for groups to export. You can export the definitions of more than one group. Each user definition includes the following information:

• Name • Description • Department • Color scheme assignment • Group hierarchy • Roles assigned to the group • Users assigned to the group • Query governing settings

PowerAnalyzer does not export the color scheme associated with an exported group.


Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export User/Group/Role 3. Click “Export Users/Group(s)/Role(s)” option 4. Select the group to be exported 5. Click “Export as XML” option 6. Enter XML filename and click “Save” option to save the XML file 7. The XML file will be stored locally on the client machine

Exporting Roles

You can export the definitions of the custom roles that you define in the repository. You cannot export the definitions of system roles defined by PowerAnalyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Export User/Group/Role 3. Click “Export Users/Group(s)/Role(s)” option 4. Select the role to be exported 5. Click “Export as XML” option 6. Enter XML filename and click “Save” option to save the XML file 7. The XML file will be stored locally on the client machine

IMPORTING OBJECTS

You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository.

When you import objects, you can validate the XML file against the DTD provided by PowerAnalyzer. Informatica recommends that you do not modify the XML files after you export from PowerAnalyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from PowerAnalyzer. However, if you are not sure of the validity of an XML file, you can validate it against the PowerAnalyzer DTD file when you start the import process.

To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege.

When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in PowerAnalyzer have


read and write access to them. You can change the access permissions to reports after you import them.

IMPORTING SCHEMAS

When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file.

When you import a schema, PowerAnalyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Schema 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

IMPORTING REPORTS

A valid XML file of exported report objects can contain definitions of cached or on-demand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you might not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository.

You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all PowerAnalyzer users. If you publish reports to everyone, PowerAnalyzer provides read access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to the user’s personal folder. Any public indicator associated with the report also becomes accessible to the user.

If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, PowerAnalyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that


PowerAnalyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Report 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

IMPORTING GLOBAL VARIABLES

You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, PowerAnalyzer imports only the global variables not in the target repository.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Global Variables 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

IMPORTING DASHBOARDS

Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, PowerAnalyzer imports the following objects associated with the dashboard:

• Reports • Indicators • Shared documents • Gauges

PowerAnalyzer does not import the following objects associated with the dashboard:

• Alerts • Access permissions • Attributes and metrics in the report • Real-time objects


If an object already exists in the repository, PowerAnalyzer provides an option to overwrite the object. PowerAnalyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Dashboard 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

IMPORTING SECURITY PROFILE(S):

When you import a security profile, you must first select the user or group to which you want to assign the security profile. You can assign the same security profile to more than one user or group.

When you import a security profile and associate it with a user or group, you can either overwrite the current security profile or add to it. When you overwrite a security profile, you assign the user or group only the access permissions and data restrictions found in the new security profile. PowerAnalyzer removes the old restrictions associated with the user or group. When you append a security profile, you assign the user or group the new access permissions and data restrictions in addition to the old permissions and restrictions.

When exporting a security profile, PowerAnalyzer exports the security profile for objects in Schema Directory, including folders, attributes, and metrics. However, it does not include the security profile for filtersets.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Security Profile 3. Click “Import to Users” 4. Select the user with which you want to associate the security profile you import.

o To associate the imported security profiles with all the users in the page, select the check box under Users at the top of the list.

o To associate the imported security profiles with all the users in the repository, select “Import to All.”

o To overwrite the selected user’s current security profile with the imported security profile, select “Overwrite.”

o To append the imported security profile to the selected user’s current security profile, select “Append.”


5. Click “Browse” to choose an XML file to import 6. Select the “Validate XML against DTD” option 7. Click “Import XML” option 8. Verify all attributes on the summary page, and choose “Continue”

IMPORTING SCHEDULE(S):

A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a time-based or event-based schedules from an XML file. When you import a schedule, PowerAnalyzer does not attach the schedule to any reports.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import Schedule 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

IMPORTING USER(S)/GROUP(S)/ROLE(S):

When you import a user, group, or role, you import all the information associated with each user, group, or role. The XML file includes definitions of roles assigned to users or groups, and definitions of users within groups. For this reason, you can import the definition of a user, group, or role in the same import process.

When you importing a user, you import the definitions of roles assigned to the user and the groups to which the user belongs. When you import a user or group, you import the user or group definitions only. The XML file does not contain the color scheme assignments, access permissions, or data restrictions for the user or group. To import the access permissions and data restrictions, you must import the security profile for the user or group.

Steps:

1. Login to PowerAnalyzer as a System Administrator 2. Click Administration » XML Export/Import » Import User/Group/Role 3. Click “Browse” to choose an XML file to import 4. Select the “Validate XML against DTD” option 5. Click “Import XML” option 6. Verify all attributes on the summary page, and choose “Continue”

Tips for Importing/Exporting


Schedule Importing/Exporting of repository objects should be scheduled at a time of minimal PowerAnalyzer activity, when most of the users are not accessing the PowerAnalyzer repository. This will prevent the likelihood of users experiencing timeout errors or degraded response time. Only the System Administrator should perform the export/import operation.

Take a backup of the PowerAnalyzer repository before performing the export/import operation. This backup should be completed using the Repository Backup Utility provided with PowerAnalyzer.

Manually add user/group permissions for the report. They will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server.

Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents into a version control tool such as Microsoft Visual Source Safe, or PVCS. This will facilitate the versioning of repository objects and provide a means to rollback to a prior version of an object, if necessary.

PowerAnalyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them.

If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository.

You must add indicators to the dashboard manually. When you import a dashboard, PowerAnalyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually.

Check with your system administrator to understand what level of LDAP integration has been configured, if any. Users, groups, and roles will need to be exported and imported during deployment when using repository authentication. If PowerAnalyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.

When you import users into a Microsoft SQL Server or IBM DB2 repository, PowerAnalyzer blocks all user authentication requests until the import process is complete.


Installing PowerAnalyzer

Challenge

Installing PowerAnalyzer on new or existing hardware, either as a dedicated application on a physical machine (as Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications on the same application server.

Description

Consider the following questions when determining what type of hardware to use for PowerAnalyzer:

If the hardware already exists:

1. Is the processor, operating system, and database software supported by PowerAnalyzer?

2. Are the necessary operating system and database patches applied? 3. How many CPUs does the machine currently have? Can the CPU capacity be

expanded? 4. How much memory does the machine have? How much is available to the

PowerAnalyzer application? 5. Will PowerAnalyzer run alone or share the machine with other applications? If

yes, what are the CPU and memory requirements of the other applications?

If the hardware does not already exist:

1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? i.e., Solaris 9,

Windows 2003, AIX 5.2, HP-UX 11i, Redhat AS 3.0, SuSE 8 3. What database and version is preferred and supported for the PowerAnalyzer

repository?

Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for PowerAnalyzer. The following questions should be answered in order to estimate the size of a PowerAnalyzer server:

1. How many users are predicted for concurrent access?


2. On average, how many rows will be returned in each report? 3. On average, how many charts will there be for each report? 4. Do the business requirements mandate a SSL Web server?

The hardware requirements for the PowerAnalyzer environment depend on the number of concurrent users, types of reports being used (interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a PowerAnalyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, please contact Informatica Professional Services for a PowerAnalyzer Sizing and Baseline Architecture engagement.

Windows 2000

# of Concurrent

Users

Average Number of Rows per

Report

Average # of Charts

per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For

PowerAnalyzer alone)

Estimated # of App

servers in a Clustered

Environment 50 1000 2 2 1 GB 1 100 1000 2 3 2 GB 1 - 2 200 1000 2 6 3.5 GB 3 400 1000 2 12 6.5 GB 6 100 1000 2 3 2 GB 1 - 2 100 2000 2 3 2.5 GB 1 - 2 100 5000 2 4 3 GB 2 100 10000 2 5 4 GB 2 - 3 100 1000 2 3 2 GB 1 - 2 100 1000 5 3 2 GB 1 - 2 100 1000 7 3 2.5 GB 1 - 2 100 1000 10 3 - 4 3 GB 1 - 2

Notes:

1. This estimating guide is based on certain experiments conducted in the Informatica lab.

2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU 2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments.

3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10% of the user base is concurrent. However, this percentage can be as high as 50% or as low as 5% in some organizations.

4. For every 2 CPUs on the server, Informatica recommends 1 managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance.

5. Add 30 - 50 % overhead on for a SSL Web server architecture, depending on strength of encryption.


6. CPU utilization can be minimized by 10 - 25% by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting.

7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be across multiple boxes if >= 4 CPU)

8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

IBM AIX 5.2

# of Concurrent

Users

Average Number of Rows per

Report

Average # of Charts

per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For

PowerAnalyzer alone)

Estimated # of App

servers in a Clustered

Environment 50 1000 2 2 1 GB 1 100 1000 2 2 - 3 2 GB 1 200 1000 2 4 - 5 3.5 GB 2 - 3 400 1000 2 9 - 10 6 GB 4 - 5 100 1000 2 2 - 3 2 GB 1 100 2000 2 2 - 3 2 GB 1 - 2 100 5000 2 2 - 3 3 GB 1 - 2 100 10000 2 4 4 GB 2 100 1000 2 2 - 3 2 GB 1 100 1000 5 2 - 3 2 GB 1 100 1000 7 2 - 3 2 GB 1 - 2 100 1000 10 2 - 3 2.5 GB 1 - 2

Notes:

1. This estimating guide is based on certain experiments conducted in the Informatica lab.

2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU 2.4 GHz IBM p630. This estimate may not be accurate for other, different environments.

3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied by the percentage of concurrent users. In practice, typically 10% of the user base is concurrent. However, this percentage can be as high as 50% or as low as 5% in some organizations.

4. For every 2 CPUs on the server, Informatica recommends 1 managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance.

5. Add 30 - 50 % overhead on for a SSL Web server architecture, depending on strength of encryption.

6. CPU utilization can be minimized by 10 - 25% by using SVG charts, otherwise known as interactive charting, rather than the default PNG charting.

7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be across multiple boxes if >= 4 CPU)

8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.


PowerAnalyzer Installation

There are two main components of the PowerAnalyzer installation process: the PowerAnalyzer Repository and the PowerAnalyzer Server, which is an application deployed on an application server. A Web server is necessary to support these components and is included with the installation of the application servers. This section discusses the installation process for BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments. This section is intended to serve as a supplement to the PowerAnalyzer Installation Guide.

Before installing PowerAnalyzer, please complete the following steps:

• Verify that the hardware meets the minimum system requirements for PowerAnalyzer.Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software are supported by PowerAnalyzer.Ensure that sufficient space has been allocated to the PowerAnalyzer repository.

• Apply all necessary patches to the operating system and database software. • Verify connectivity to the data warehouse database (or other reporting source) and

repository database. • If LDAP or NT Domain is used for PowerAnalyzer authentication, verify connectivity

to the LDAP directory server or the NT primary domain controller. • The PowerAnalyzer license file has been obtained from

[email protected]. • On UNIX/Linux installations, the OS user that is installing PowerAnalyzer must

have execute privileges on all PowerAnalyzer installation executables.

In addition to the standard PowerAnalyzer components that are installed by default, other components of PowerAnalyzer that can be installed include:

• PowerCenter Integration Utility • PowerAnalyzer SDK • PowerAnalyzer Portal Integration Kit • PowerAnalyzer Metadata Reporter (PAMR)

Please see the PowerAnalyzer documentation for more detailed installation instructions for these components.

Installation Steps – BEA WebLogic

The following are the basic installation steps for PowerAnalyzer on BEA WebLogic:

1. Setup the PowerAnalyzer repository database. The PowerAnalyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation.

2. Install BEA WebLogic and apply the BEA license. 3. Install PowerAnalyzer. 4. Apply the PowerAnalyzer license key. 5. Install the PowerAnalyzer Online Help.


TIP

When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly.

The following example shows how to set the recommended storage parameters, assuming the repository is stored in the “REPOSITORY” tablespace:

ALTER TABLESPACE “REPOSITORY” DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips – BEA WebLogic

The following are the basic installation tips for PowerAnalyzer on BEA WebLogic:

• Beginning with PowerAnalyzer 5, multiple PowerAnalyzer instances can be installed on a single instance of WebLogic. Also, other applications can co-exist with PowerAnalyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation.

• For WebLogic installations on UNIX, the BEA WebLogic Server installation program requires an X-Windows server. If BEA WebLogic Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the “UNIX Servers” section of the installation and configuration tips below.

• If the PowerAnalyzer installation files are transferred to the PowerAnalyzer Server, they must be FTP’d in binary format

• To view additional debugging information during UNIX installations, the LAX_DEBUG environment variable can be set to “true” (LAX_DEBUG=true).

• During the PowerAnalyzer installation process, the user will be prompted to choose an authentication method for PowerAnalyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, it is best to have the configuration parameters available during installation as the installer will configure all properties files at installation.

• The PowerAnalyzer license file and BEA WebLogic license must be applied prior to starting PowerAnalyzer.

Installation Steps – IBM WebSphere

The following are the basic installation steps for PowerAnalyzer on IBM WebSphere:

1. Setup the PowerAnalyzer repository database. The PowerAnalyzer Server installation process will create the repository tables, but the empty database


schema needs to exist and be able to be connected to via JDBC prior to installation.

2. Install IBM WebSphere and apply the all WebSphere patches. WebSphere can be installed in its “Base” configuration or “Network Deployment” configuration if clustering will be utilized. In both cases, patchsets will need to be applied.

3. Install PowerAnalyzer. 4. Apply the PowerAnalyzer license key. 5. Install the PowerAnalyzer Online Help.

Installation Tips – IBM WebSphere

• Starting in PowerAnalyzer 5, multiple PowerAnalyzer instances can be installed on a single instance of WebSphere. Also, other applications can co-exist with PowerAnalyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation.

• For WebSphere installations on UNIX, the IBM WebSphere installation program requires an X-Windows server. If IBM WebSphere is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI based installation program. For more information on installing on UNIX, please see the “UNIX Servers” section of the installation and configuration tips below.

• For WebSphere on UNIX installations, PowerAnalyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups.

• For WebSphere on Windows installations, ensure that PowerAnalyzer is installed under the “padaemon” local Windows user ID that is in the Administrative group and has the advanced user rights "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group.

• If the PowerAnalyzer installation files are transferred to the PowerAnalyzer Server, they must be FTP’d in binary format.

• To view additional debugging information during UNIX installations, the LAX_DEBUG environment variable can be set to “true” (LAX_DEBUG=true).

• During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is best to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors.

• During the PowerAnalyzer installation process, the user will be prompted to choose an authentication method for PowerAnalyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, it is best to have the configuration parameters available during installation as the installer will configure all properties files at installation.

• The PowerAnalyzer license file and BEA WebLogic license must be applied prior to starting PowerAnalyzer.

Installation and Configuration Tips - UNIX Servers

A graphics display server is required for a PowerAnalyzer installation on UNIX. On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the X-Windows server does not need to exist on the local machine where PowerAnalyzer is being installed, but does need to be accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP address, as discussed below.

If the X-Windows server is not installed on the machine where PowerAnalyzer will be installed, PowerAnalyzer can be installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the X-Windows server on another UNIX machine.

To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command and press Enter:

C shell:

setenv DISPLAY=<TCP/IP node of X-Windows server>:0

Bourne/Korn shell:

export DISPLAY=”<TCP/IP node of X-Windows server>:0”

Configuration

• PowerAnalyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators will not be displayed properly on dashboards or reports. For PowerAnalyzer installations using an application server with JDK 1.4 and greater, the “java.awt.headless=true” setting can be set in the application server startup scripts to facilitate graphics rendering for PowerAnalyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting PowerAnalyzer.

• The application server heap size is the memory allocation for the JVM. The recommended heap size greatly depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a PowerAnalyzer instance.


PowerAnalyzer Security

Challenge

Using PowerAnalyzer's sophisticated security architecture to establish a robust security system to safeguard valuable business information against a full range of technologies and security models. Ensuring that PowerAnalyzer security provides appropriate mechanisms to support and augment the security infrastructure of a Business Intelligence environment at every level.

Description

Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer.


User layer

Users must be authenticated and authorized to access data. PowerAnalyzer integrates seamlessly with the following LDAP compliant directory servers: SunOne/iPlanet Directory Server 4.1

Sun Java System Directory Server 5.2

Novell eDirectory Server 8.7

IBM SecureWay Directory 3.2

IBM SecureWay Directory 4.1

IBM Tivoli Directory Server 5.2

Microsoft Active Directory 2000

Microsoft Active Directory 2003

In addition to the directory server, PowerAnalyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.

Transmission layer

Data transmission must be secured and hacker-proof. PowerAnalyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application layer

Only appropriate application functionality should be provided to users with associated privileges. PowerAnalyzer provides three basic types of application-level security:

• Report, Folder & Dashboard Security – restricts users and groups to specific reports or folders and dashboards that they can access.

• Column-level Security – restricts users and groups to particular metric and attribute columns.

• Row-level Security – restricts users to specific attribute values within an attribute column of a table.

Components for Managing Application Layer Security

PowerAnalyzer users can perform different tasks based on the privileges that you grant them. PowerAnalyzer provides the following components for managing application layer security:

• Roles: A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all groups and users with the role automatically inherit the change.

• Groups: A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles.


After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change.

• Users: A user has a user name and password. Each person accessing PowerAnalyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or assign the user to a group with predefined roles.

Types of Roles

• System roles PowerAnalyzer provides the following roles when the repository is created. Each role has sets of privileges assigned to it.

• Custom roles The end user can create and assign privileges to these roles.

Managing Groups

Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes.

For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage PowerAnalyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage PowerAnalyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example – if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.


Managing Users

Each user must have a unique user name to access PowerAnalyzer. To perform PowerAnalyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups.

PowerAnalyzer creates a system administrator user account when you create the repository. The default user name for the system administrator user account is admin. The system daemon, ias_scheduler, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform PowerAnalyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. PowerAnalyzer permanently assigns the Daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups.

To change the password for a system daemon, you must complete the following steps:

1. Change the password in the Administration tab in PowerAnalyzer. 2. Change the password in the web.xml file in the PowerAnalyzer folder. 3. Restart PowerAnalyzer.

Customizing User Access

You can customize PowerAnalyzer user access with the following security options:

• Access permissions: Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository.

• Data restrictions: Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports.

• Password restrictions: Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.


When you create an object in the repository, every user has default read and write permissions on that object. By customizing access permissions on an object, you determine which users and/or groups can read, write, delete, or change access permissions on that object.

When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, PowerAnalyzer does not display the restricted data to that user.

Types of Access Permissions

Access permissions determine the tasks you can perform for a specific repository object. When you set access permissions, you determine which users and groups have access to the folders and repository objects. You can assign the following types of access permissions to repository objects:

• Read: Allows you to view a folder or object. • Write: Allows you to edit an object. Also allows you to create and edit folders and

objects within a folder. • Delete: Allows you to delete a folder or an object from the repository. • Change permission: Allows you to change the access permissions on a folder or

object.

By default, PowerAnalyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.

Data Restrictions

You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports.

You can set data restrictions using one of the following methods:

• Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group.

• Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups

Two Types of Data Restrictions

You can set two kinds of data restrictions:


• Inclusive: Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an “IN 2001” rule.

• Exclusive: Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a “NOT IN 2001” rule.

Restricting Data Access by User or Group

You can edit a user or group profile to restrict the data the user or group can access in reports. When you edit a user profile, you can set data restrictions for any schema in the repository, including operational schemas and fact tables.

You can set a data restriction to limit user or group access to data in a single schema based on the attributes you select. If the attributes apply to more than one schema in the repository, you can also restrict the user or group access from related data across all schemas in the repository. For example, you have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one data restriction that applies to both the sales and salary fact tables based on the region you select.

To set data restrictions for a user or group, you need the following role or privilege:

• System Administrator role • Access Management privilege

When PowerAnalyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security then the PowerAnalyzer Server will create a separate report for each unique security profile.

The following information applies to the required steps for changing admin user for weblogic only.

To change the PowerAnalyzer default users from admin, ias_scheduler.:

1. Back up the repository. 2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib 3. Open the file ias.jar and locate the file entry called

InfChangeSystemUserNames.class 4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory

(example: d:\temp) 5. This will extract the file as 'd:\temp\repository

tils\Refresh\InfChangeSystemUserNames.class' 6. Create a batch file (change_sys_user.bat) with the following commands in the

directory D:\Temp\Repository Utils\Refresh\

REM To change the system user name and password

REM *******************************************

REM Change the BEA home here


REM ************************

set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06

set WL_HOME=E:\bea\wlserver6.1

set CLASSPATH=%WL_HOME%\sql

set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar

set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip

set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar

set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar

set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar

set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense

REM Change the DB information here and also

REM the user Dias_scheduler and -Dadmin to values of your choice

REM *************************************************************

%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name -Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNames

REM END OF BATCH FILE

7. Make changes in the batch file as directed in the remarks [REM lines] 8. Save the file and open up a command prompt window and navigate to

D:\Temp\Repository Utils\Refresh\ 9. At the prompt type change_sys_user.bat and enter.

The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively.

10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias\WEB-INF) by replacing ias_scheduler with 'pa_scheduler'

11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml

This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\


To edit the file:

Make a copy of the iasEjb.jar

a. mkdir \tmp b. cd \tmp c. jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF d. cd META-INF e. Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler f. cd \ g. jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

NOTE:There is a tailing period at the end of the command above.

12. Restart the server.

TIP

To use the PowerAnalyzer user login name as a filter for a report:

PowerAnalyzer Version 4.0 provides a System Variable USER_LOGIN that functions as a Global Variable for the username that it used whenever the user submits SQL to the RDBMS. 1. Build a cross-reference table in the database that includes USER_ID and specifies those tables that the user can access 2. Use the System Variable USER_LOGIN that sources from that cross-reference table and can be referenced both: a. in the filters area (by overriding the SQL and inserting $USER_ID$) or b. in the WHERE clause box under "Dimension tables" (again specifying $USER_ID$). 3. The system parses $USER_ID$ when it sees it in the report SQL and populates it with the value from the USER_LOGIN System Variable.


Tuning and Configuring PowerAnalyzer and PowerAnalyzer Reports

Challenge

A PowerAnalyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning PowerAnalyzer and PowerAnalyzer reports.

Description

Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself.

1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic. This will provide a baseline to measure changes against.

2. Review Report. Confirm that all data elements are required in report. Eliminate any unnecessary data elements, filters and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level.

3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters pre-created to allow for scheduling of the report.

4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This will provide the DBA the


opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves PowerAnalyzer performance significantly.

5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracle’s local database communication protocol) and monitor the database throughout this process. This test will pinpoint if the bottleneck is occurring on the network or in the database. If for instance, the query performs similarly regardless of where it is executed, but the report continues to be slow, this indicates a web server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing PowerAnalyzer on a dedicated web server.

6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports.

Can any of these be generated off of aggregate tables instead of base tables? PowerAnalyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database.

Calculated metrics can also be created in an ETL tool and stored in the database instead of created in PowerAnalyzer. Each time a calculation must be done in PowerAnalyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few.

7. Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance maybe beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view.

WARNING: editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning.

Poweranalyzer repository database should be tuned for an OLTP workload.

Tuning JVM

JVM Layout

JVM is the repository for all live objects, dead objects, and free memory. It has the following primary jobs:

• Execute code • Manage memory • Remove garbage objects

The size of JVM determines how often and how long garbage collection will run.

JVM parameters can be set at "startWebLogic.cmd" or "startWebLogic.sh" for weblogic.

Parameters of JVM

1. -Xms and -Xmx parameters define the minimum and maximum heap size; for large applications, the values should be set equal to each other.

2. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection. 

3. Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line parameter controls the permanent generation's size.

4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size.

5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap while the new generation occupies 1/6 of the heap).

o When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation.

o When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection.

6. If you increase the new generation size, the old generation size decrease. Minor collections occur less often, but the frequency of major collection increase.

7. If you decrease the new generation size, the old generation size increase. Minor collections occur more, but the frequency of major collection decrease.

8. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size).

9. Enable additional JVM if you expect large number of users. Informatica typically recommends two to three CPUs per JVM.

If you increase the new generation size, the old generation size decrease. Minor collections occur less, but the frequency of major collection increase. If you decrease the new generation size, the old generation size increase. Minor collections occur more, but the frequency of major collection decrease.

Enable additional JVM if large number of users expected. Recommend 2-3 CPUs per JVM

Other areas to tune

Execute Threads

• Threads available to process simultaneous operations in Weblogic • Too few threads means CPUs are under-utilized and jobs are waiting for threads to

be come available • Too many threads means system is wasting resource in managing threads. OS

does unnecessary context switch • Default is 15 threads. Informatica recommends using the default value, but you

may need to experiment to determine the optimal value for your environment.

Connection Pooling

Application borrows connection from the pool, uses it, and then returns it to the pool by closing it.

• Initial capacity = 15 • Maximum capacity = 15 • Sum of connections of all pools should be equal to the number of execution

threads

Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and maximum pool size at the same level.

Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.

• Check Enable Native I/O on the server attribute tab • Adds <NativeIOEnabled> to config.xml as true

For Websphere, use the Performance Tuner to modify the configurable parameters.

For optimal configuration, separate application, data warehouse, and repository into separate dedicated machines.

Application Server-Specific Tuning Details

JBoss Application Server

Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable number of HTTP requests as required by the

PowerAnalyzer installation. Ensure that the web container is made available to optimal number of threads so that it can accept and process more HTTP requests.

<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml

The following is a typical configuration:

 <Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port="8080" minProcessors="10" maxProcessors="100" enableLookups="true" acceptCount="20" debug="0" tcpNoDelay="true" bufferSize="2048" connectionLinger="-1" connectionTimeout="20000" />

The following parameters may need tuning:

• minProcessors. Number of threads created initially in the pool. • maxProcessors. Maximum number of threads that can ever be created in the

pool. • acceptCount. Controls the length of the queue of waiting requests when no more

threads are available from the pool to process the request. • connectionTimeout. Amount of time to wait before a URI is received from the

stream. Default is 20 seconds. This avoid problems where a client opens a connection and does not send any data

• tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be full. This reduces latency at the cost of more packets being sent over the network. The default is true.

• enableLookups. Whether to perform a reverse DNS lookup to prevent snoofing. Snoofing can cause problems when a DNS is misbehaving. The enableLookups parameter can be turned off when you implicitly trust all clients.

• connectionLinger. How long connections should linger after they are closed. Informatica recommends using the default value: -1 (no linger).

In the PowerAnalyzer application, each web page can potentially have more than one request to the application server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.

If the number of threads is too low, the following message may appear in the log files:

ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads

JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships PowerAnalyzer with pre-compiled JSPs.


<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml

<servlet> <servlet-name>jsp</servlet-name> <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class> <init-param> <param-name>logVerbosityLevel</param-name> <param-value>WARNING</param-value> <param-name>development</param-name> <param-value>false</param-value> </init-param> <load-on-startup>3</load-on-startup> </servlet>

The following parameter may need tuning:

• Set the development parameter to false in a production installation.

Database Connection Pool. PowerAnalyzer accesses the repository database to retrieve metadata information. When it runs reports, it accesses the data sources to get the report information. PowerAnalyzer keeps a pool of database connections for the repository. It also keeps a separate database connection pool for each data source. To optimize PowerAnalyzer database connections, you can tune the database connection pools.

Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss configuration file:

<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml

The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example, for an Oracle repository, the configuration file name is oracle_ds.xml.


<datasources> <local-tx-datasource> <jndi-name>jdbc/IASDataSource</jndi-name> <connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-url> <driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class> <user-name>powera</user-name> <password>powera</password> <exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter </exception-sorter-class-name> <min-pool-size>5</min-pool-size> <max-pool-size>50</max-pool-size> <blocking-timeout-millis>5000</blocking-timeout-millis> <idle-timeout-minutes>1500</idle-timeout-minutes> </local-tx-datasource> </datasources>

• min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed, i.e. it will be empty until it is first accessed. Once used, it will always have at least the min-pool-size connections.)

• max-pool-size. The strict maximum size of the connection pool. • blocking-timeout-millis. The maximum time in milliseconds that a caller waits to

get a connection when no more free connections are available in the pool. • idle-timeout-minutes. The length of time an idle connection will remain in the

pool before it is used.

The max-pool-size value is recommended to be at least five more than maximum number of concurrent users because there may be several scheduled reports running in the background and each of them will need a database connection.

A higher value is recommended for idle-timeout-minutes. Since PowerAnalyzer accesses the repository very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for idle connections may block other threads that require new connections.

Data Source Database Connection Pool. Similar to the repository database connection pools, the data source also has a pool of connections that PowerAnalyzer dynamically creates as soon as the first client requests a connection.

The tuning parameters for these dynamic pools are present in following file:

<JBOSS_HOME>/bin/IAS.properties.file


# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20 dynapool.waitForConnection=true dynapool.waitSec=1 dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60 datamart.defaultRowPrefetch=20

The following JBoss-specific parameters may need tuning:

• dynapool.initialCapacity. The minimum number of initial connections in the data source pool.

• dynapool.maxCapacity. The maximum number of connections that the data source pool may grow to.

• dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name for identification purposes.

• dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a connection from the pool if none is readily available.

• dynapool.refreshTestMinutes. This parameter determines the frequency at which a health check is performed on the idle connections in the pool. This should not be performed too frequently because it locks up the connection pool and may prevent other clients from grabbing connections from the pool.

• dynapool.shrinkPeriodMins. This parameter determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking parameter is set to true.

EJB Container

PowerAnalyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time functionalities.

Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune the EJB pool parameters in the following file:

<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml.


<container-configuration> <container-name> Standard Stateless SessionBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name> stateless-rmi-invoker</invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor> org.jboss.ejb.plugins.SecurityInterceptor</interceptor>  <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor>  <interceptor transaction="Bean"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.TxInterceptorBMT</interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor>

org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool> org.jboss.ejb.plugins.StatelessSessionInstancePool</instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration>


• MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that will be created. If <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are requests for more objects. However, only the <MaximumSize> number of objects will be returned to the pool.

• Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in PowerAnalyzer. They can be tuned after you have done proper iterative testing in PowerAnalyzer to increase the throughput for high-concurrency installations.

• strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests will wait for an object to be returned to the pool.

• strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests will wait for an object to be made available in the pool.

Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the MDB when they are available.

To tune the MDB parameters, modify the following configuration file:

<JBOSS_HOME>/server/informatica/conf/standardjboss.xml


<container-configuration> <container-name>Standard Message Driven Bean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>message-driven-bean </invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor </interceptor>

<interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor </interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor>  <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT </interceptor> <interceptor transaction="Bean" metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool </instance-pool> <instance-cache></instance-cache> <persistence-manager></persistence-manager> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> </container-configuration>


MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that will be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects will be returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in PowerAnalyzer. They can be tuned after you have done proper iterative testing in PowerAnalyzer to increase the throughput for high-concurrency installations.

• strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests will wait for an object to be returned to the pool.

Enterprise Java Beans (EJB). PowerAnalyzer EJBs use BMP (bean-managed persistence) as opposed to CMP (container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning parameters.

The EJB tuning parameters are in the following configuration file:

<JBOSS_HOME>/server/informatica/conf/standardjboss.xml.


<container-configuration> <container-name>Standard BMP EntityBean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>entity-rmi-invoker </invoker-proxy-binding-name> <sync-on-commit-only>false</sync-on-commit-only> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.SecurityInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.TxInterceptorCMT </interceptor> <interceptor metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityLockInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor </interceptor> <interceptor> org.jboss.resource.connectionmanager.CachedConnectionInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.EntitySynchronizationInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.EntityInstancePool </instance-pool> <instance-cache>org.jboss.ejb.plugins.EntityInstanceCache </instance-cache> <persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager </persistence-manager> <locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock </locking-policy> <container-cache-conf> <cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy </cache-policy>

<cache-policy-conf> <min-capacity>50</min-capacity> <max-capacity>1000000</max-capacity> <overager-period>300</overager-period> <max-bean-age>600</max-bean-age> <resizer-period>400</resizer-period> <max-cache-miss-period>60</max-cache-miss-period> <min-cache-miss-period>1</min-cache-miss-period> <cache-load-factor>0.75</cache-load-factor> </cache-policy-conf> </container-cache-conf> <container-pool-conf> <MaximumSize>100</MaximumSize> </container-pool-conf> <commit-option>A</commit-option> </container-configuration>


MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that will be created. Otherwise, if <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects will be returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default in PowerAnalyzer. They can be tuned after you have done proper iterative testing in PowerAnalyzer to increase the throughput for high-concurrency installations.

• strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests will wait for an object to be returned to the pool.


RMI Pool

The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the PowerAnalyzer API from other custom applications, you can optimize the RMI thread pool parameters.

To optimize the RMI pool, modify the following configuration file:

<JBOSS_HOME>/server/informatica/conf/jboss-service.xml

<mbeancode="org.jboss.invocation.pooled.server.PooledInvoker"name="jboss:service=invoker,type=pooled"> <attribute name="NumAcceptThreads">1</attribute> <attribute name="MaxPoolSize">300</attribute> <attribute name="ClientMaxPoolSize">300</attribute> <attribute name="SocketTimeout">60000</attribute> <attribute name="ServerBindAddress"></attribute> <attribute name="ServerBindPort">0</attribute> <attribute name="ClientConnectAddress"></attribute> <attribute name="ClientConnectPort">0</attribute> <attribute name="EnableTcpNoDelay">false</attribute> <depends optional-attribute-name="TransactionManagerService"> jboss:service=TransactionManager </depends> </mbean>


• NumAcceptThreads. The controlling threads used to accept connections from the client.

• MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.

• ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the client.

• Backlog. The number of requests in the queue when all the processing threads are in use.

• EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true may increase the network traffic because more packets will be sent across the network.

WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some of the parameters and arrive at a good settings.

Web Container

Navigate to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the following parameters.

• Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate.

• Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration) the value of 50-60 has been determined to be optimal.

• Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered optimal.

• Is Growable: Specifies whether the number of threads can increase beyond the maximum size configured for the thread pool. Be sure to leave this option unchecked. Also the maximum threads should be hard-limited to the value given in the “Maximum Size”.


Note: In a load-balanced environment, there will be more than one server instance that may possibly be spread across multiple machines. In such a scenario, be sure that the changes have been properly propagated to all the server instances.

Transaction Services

• Total transaction lifetime timeout: In certain circumstances (e.g. import of large XML files) the default value of 120 seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.

Diagnostic Trace Services

• Disable the trace in a production environment . • Navigate to “Application Servers > [your_server_instance] > Administration

Services > Diagnostic Trace Service “ and make sure “Enable Tracing” is not checked.

Debugging Services

Ensure that the tracing is disabled in a production environment

Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service “ and make sure “Startup” is not checked.

Performance Monitoring Services

This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the application server after a certain interval; if the server is found to be dead, then it tries to restart the server.

Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the parameters according to a policy determined for each PowerAnalyzer installation.

Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is faulty. Then the monitoring service attempts to kill the server and restart it if “Automatic restart” is checked. Take care that “Ping Timeout” is not set to too small a value.

Process Definitions (JVM Parameters)

For PowerAnalyzer with high number of concurrent users, Informatica recommends that the minimum and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning on the verbosegc option.


The following is a list of java parameters (for IBM JVM 1.4.1) that should NOT be modified from the default values for PowerAnalyzer installation:

• -Xnocompactgc: This parameter switches off heap compaction altogether. Switching off heap compaction results in heap fragmentation. Since PowerAnalyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions.

• -Xcompactgc: Using this parameter leads to each garbage collection cycle carrying out compaction, regardless of whether it's useful.

• -Xgcthreads: This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection.

• -Xclassnogc: This disables collection of class objects. • -Xinitsh: This sets the initial size of the application-class system heap. The system

heap is expanded as needed and is never garbage collected.

You may want to alter the following parameters after carefully examining the application server processes:

• Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine"

• Verbose garbage collection: This option can be checked to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment.

• Initial heap size: This is the –ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may actually reduce throughput because the garbage collection cycles will take more time to go through the large heap, even though the cycles may be occurring less frequently.

• Maximum heap size: This is the –mx value. It should be equal to the “Initial heap size” value.

• RunHProf: This should remain unchecked in production mode, because it slows down the VM considerably.

• Debug Mode: This should remain unchecked in production mode, because it slows down the VM considerably.

• Disable JIT: This should remain unchecked (i.e., JIT should never be disabled).

Performance Monitoring Services

Be sure that performance monitoring services are not enabled in a production environment.

Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure “Startup” is not checked.

Database Connection Pool

The repository database connection pool can be configured by navigating to “JDBC Providers > User-defined JDBC Provider > Data Sources > IASDataSource > Connection Pools”

The various parameters that may need tuning are:

• Connection Timeout: The default value of 180 seconds should be good. This implies that after 180 seconds, the request to grab a connection from the pool will timeout. After it times out, PowerAnalyzer will throw an exception. In that case, the pool size may need to be increased.

• Max Connections: The maximum number of connections in the pool. Informatica recommends a value of 50 for this.

• Min Connections: The minimum number of connections in the pool. Informatica recommends a value of 10 for this.

• Reap Time: This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new connection form the pool. If the database and the network are reliable, this should have a very high value (e.g., 1000).

• Unused Timeout: This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine.

• Aged Timeout: Specifies the interval in seconds before a physical connection is discarded. If the database and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections do not age). If the database or the network connection to the repository database frequently comes down (compared to the life of the AppServer), this may be used to age out the stale connections.

Much like the repository database connection pools, the data source or data warehouse databases also have a pool of connections that are created dynamically by PowerAnalyzer as soon as the first client makes a request.

The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file.

The following is a typical configuration:.

#

# Datasource definition

#

dynapool.initialCapacity=5

dynapool.maxCapacity=50

dynapool.capacityIncrement=2


dynapool.allowShrinking=true

dynapool.shrinkPeriodMins=20

dynapool.waitForConnection=true

dynapool.waitSec=1

dynapool.poolNamePrefix=IAS_

dynapool.refreshTestMinutes=60

datamart.defaultRowPrefetch=20

The various parameters that may need tuning are:

• dynapool.initialCapacity: the minimum number of initial connections in the data-source pool.

• dynapool.maxCapacity: the maximum number of connections that the data-source pool may grow up to.

• dynapool.poolNamePrefix: this is just a prefix added to the dynamic JDB pool name for identification purposes.

• dynapool.waitSec: the maximum amount of time (in seconds) that a client will wait to grab a connection from the pool if none is readily available.

• dynapool.refreshTestMinutes: this determines the frequency at which a health check on the idle connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool.

• dynapool.shrinkPeriodMins: this determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.

Message Listeners Services

To process scheduled reports, PowerAnalyzer uses Message-Driven-Beans. It is possible to run multiple reports within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report consumes considerable resources (e.g., database connections, and CPU processing at both the application-server and database server levels) and setting this to a very high value may be actually detrimental to the whole system.

Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort” .

The parameters that can be tuned are:

• Maximum sessions: The default value is 1. On a highly-concurrent user scenario, Informatica does not recommend going beyond 5.


• Maximum messages: This should remain as 1. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than 1 may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts

When PowerAnalyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the load-balancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in then routes the request to the proper application-server.

The plug-in file can be generated automatically by navigating to “

Environment > Update web server plugin configuration”.

The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is sent over to the next available member of the cluster.

The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down. The default value is 10 seconds. This means if a cluster member is marked as down, the server will not try to send a request to the same member for 10 seconds.


Upgrading PowerAnalyzer

Challenge

Seamlessly upgrade PowerAnalyzer from one release to another while safeguarding the repository. This Best Practice describes the upgrade process from version 4.1.1 to version 5.0, but the same general steps apply to any PowerAnalyzer upgrade.

Description

Upgrading PowerAnalyzer involves two steps:

1. Upgrading the PowerAnalyzer application. 2. Upgrading the PowerAnaylzer repository.

Steps Before The Upgrade

1. Backup the repository. To ensure a clean backup, shutdown PowerAnalyzer and create the backup, following the steps in the PowerAnalyzer manual.

2. Restore the backed up repository into an empty database or a new schema. This will ensure that you have a hot backup of the repository if, for some reason, the upgrade fails.

Steps for upgrading PowerAnalyzer application

The upgrade process varies from application server to application server on which PowerAnalyzer is hosted.

For WebLogic:

1. Install WebLogic 8.1 without uninstalling the existing Application Server(WebLogic 6.1).

2. Install the PowerAnalyzer application on the new WebLogic 8.1 Application Server, making sure to use a different port than the one used in the old installation.. When prompted for repository, please choose the option of “existing repository” and give the connection details of the database that hosts the backed up repository of PowerAnalyzer 4.1.1.

3. When the installation is complete, use the Upgrade utility to connect to the database that hosts the PowerAnalyzer 4.1.1 backed up repository and perform the upgrade.


For Jboss and WebSphere:

1. Uninstall PowerAnalyzer4.1.1 2. Install PowerAnalyzer 5.0. 3. When prompted for a repository, choose the option of “existing repository” and

give the connection details of the database that hosts the backed up PowerAnalyzer 4.1.1

4. Use the Upgrade utility and connect to the database that hosts the backed up PowerAnalyzer 4.1.1 repository and perform the upgrade.

When the repository upgrade is complete, start PowerAnalyzer 5.0 and perform a simple acceptance test.

You can use the following test case (or a subset of the following test case) as an acceptance test).

1. Open a simple report 2. Open a cached report. 3. Open a report with filtersets. 4. Open a sectional report. 5. Open a workflow and also its nodes. 6. Open a report and drill through it.

When all the reports open without problems, your upgrade can be called complete.

Once the upgrade is complete, repeat the above process on the actual repository.

Note: This upgrade process creates two instances of PowerAnalyzer. So when the upgrade is successful, uninstall the older version, following the steps in the PowerAnalyzer manual.


Advanced Client Configuration Options

Challenge

Setting the Registry to ensure consistent client installations, resolve potential missing or invalid license key issues, and change the Server Manager Session Log Editor to your preferred editor.

Description

Ensuring Consistent Data Source Names

To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine.

Solution:

• From Repository Manager, choose Export Registry from the Tools drop down menu.

• For all subsequent client installs, simply choose Import Registry from the Tools drop down menu.

Resolving the Missing or Invalid License Key Issue

The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than ‘Administrator.’

This problem also occurs when the client software tools are installed under the Administrator account, and subsequently a user with a non-administrator ID attempts to run the tools. The user who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the PowerCenter Client tools. Instead, the software will display the message indicating that the license key is missing or invalid.

Solution:


• While logged in as the installation user with administrator authority, use regedt32 to edit the registry.

• Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. From the menu bar, select Security/Permissions, and grant read access to the users that should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)

Changing the Session Log Editor

In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. On the ‘general’ tab, browse for the editor that you want.

For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager, prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it.

Solution:

• While logged in as the installation user with administrator authority, use regedt32 to go into the registry.

• Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data.

• Select the Log File Editor entry by double clicking on it. • Replace the entry with the appropriate editor entry, i.e. typically WordPad.exe or

Write.exe. • Select Registry --> Exit from the menu bar to save the entry.

For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor. See fig 1 below.

Fig 1: Workflow Monitor Options Dialog Box used for setting the editor for workflow and session logs.


Customize to Add a New Command Under a Tools Menu

Other tools are often needed during development and testing in addition to the PowerCenter client tools. For example, a tool to query the database such as Enterprise manager (SQL Server) or Toad (Oracle) is often needed. It is possible to add shortcuts to executable programs from any client tool’s ‘Tools’ dropdown menu. This allows for quick access to these programs.

Solution:

Just choose ‘Customize’ under the Tools menu and then add a new item. Once it is added, browse to find the executable it will call.


After this is done once, you can easily call another program from your PowerCenter client tools.

In the following example, TOAD can be called quickly from the Repository Manager tool.

Target Load Type

In PowerCenter versions 6.0 and earlier, every time a session was created, it defaulted to be of type ‘bulk’. This was not necessarily what was desired and the session might fail under certain conditions if it was not changed. In version 7.0 and above, there is a property that can be set in the workflow manager to choose your default load type to be bulk or normal.

Solution:

• In the workflow manager tool, choose Tools > Options and go to the Miscellaneous tab.

• Click the button to be normal or bulk, as desired. • Click the ‘Ok’ button and then close and open the workflow manager tool.


After this, every time a session is created, the target load type for all relational targets will default to your choice.

Undocked Explorer Window

The Repository Navigator window sometimes becomes undocked. Docking it again can be frustrating because double clicking on the window header does not put it back in place.

Solution:

To get it docked again, right click in the white space of the Navigator window and make sure that ‘Allow Docking’ option is checked. If it is checked, just double click on the title bar of the navigator window.


Advanced Server Configuration Options

Challenge

Configuring the Throttle Reader and File Debugging options, adjusting semaphore settings in the UNIX environment, and configuring server variables.

Description

Configuring the Throttle Reader

If problems occur when running sessions, some adjustments at the server level can help to alleviate issues or isolate problems.

One technique that often helps resolve “hanging” sessions is to limit the number of reader buffers that use throttle reader. This is particularly effective if your mapping contains many target tables, or if the session employs constraint-based loading. This parameter closely manages buffer blocks in memory by restricting the number of blocks that can be utilized by the reader.

Note for PowerCenter 5.x and above ONLY: If a session is hanging and it is partitioned, it is best to remove the partitions before adjusting the throttle reader. When a session is partitioned, the server makes separate connections to the source and target for every partition. This can cause the server to manage many buffer blocks. If the session still hangs, try adjusting the throttle reader.

Solution: To limit the number of reader buffers using throttle reader in NT/2000:

• Access file hkey_local_machine\system\currentcontrolset\services\powermart\parameters\miscinfo.

• Create a new string value with value name of 'ThrottleReader' and value data of '10'.

To do the same thing in UNIX:

• Add this line to .cfg file: • ThrottleReader=10


Configuring File Debugging Options

If problems occur when running sessions or if the PowerCenter Server has a stability issue, help technical support to resolve the issue by supplying them with debug files.

To set the debug options on for NT/2000:

1. Select Start, Run, and type “regedit” 2. Go to hkey_local_machine, system, current_control_set, services, powermart,

miscInfo 3. Select Edit, then Add Value 4. Place "DebugScrubber" as the value then hit OK. Insert "4" as the value 5. Repeat steps 4 and 5, but use "DebugWriter", "DebugReader", "DebugDTM" with

all three set to "1"

To do the same in UNIX:

Insert the following entries in the pmserver.cfg file:

• DebugScrubber=4 • DebugWriter=1 • DebugReader=1 • DebugDTM=1

Adjusting Semaphore Settings

When the PowerCenter Server runs on a UNIX platform it uses operating system semaphores to keep processes synchronized and prevent collisions when accessing shared data structures You may need to increase these semaphore settings before installing the server.

The number of semaphores required to run a session is 7. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers.

The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system:

• HP/UX: Use sam (1M) to change the parameters. • Solaris: Use admintool or edit /etc/system to change the parameters. • AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters

Informatica recommends setting the following parameters as high as possible for the Solaris operating system. However, if you set these parameters too high, the machine may not boot. Refer to the operating system documentation for parameter limits. Note


that different Unix operating systems will set these variables in different ways or maybe self tuning:

Parameter Recommended Value for Solaris

Description

SHMMAX 4294967295 Maximum size in bytes of a shared memory segment.

SHMMIN 1 Minimum size in bytes of a shared memory segment.

SHMMNI 100 Number of shared memory identifiers.

SHMSEG 10 Maximum number of shared memory segments that can be attached by a process.

SEMMNS 200 Number of semaphores in the system.

SEMMNI 70 Number of semaphore set identifiers in the system. SEMMNI determines the number of semaphores that can be created at any one time.

SEMMSL equal to or greater than the value of the PROCESSES initialization parameter

Maximum number of semaphores in one semaphore set. Must be equal to the maximum number of processes.

For example, you might add the following lines to the Solaris /etc/system file to configure the UNIX kernel:

set shmsys:shminfo_shmmax = 4294967295

set shmsys:shminfo_shmmin = 1

set shmsys:shminfo_shmmni = 100

set shmsys:shminfo_shmseg = 10

set semsys:shminfo_semmns = 200

set semsys:shminfo_semmni = 70

Always reboot the system after configuring the UNIX kernel.

Configuring Server Variables

One configuration best practice is to properly configure and leverage server variables. The benefits of using server variables include:

• Ease of deployment from development environment to production environment.


• Ease of switching sessions from one server machine to another without manually editing all the sessions to change directory paths.

• All the variables are related to directory paths used by server.

Approach

The WorkFlow Manager and pmrep, can be used edit the server configuration to set or change the variables.

Each registered server has its own set of variables. The list is fixed, not user-extensible.

Server Variable Value $PMRootDir (no default – user must insert a

path) $PMSessionLogDir $PMRootDir/SessLogs $PMBadFileDir $PMRootDir/BadFiles $PMCacheDir $PMRootDir/Cache $PMTargetFileDir $PMRootDir/TargetFiles $PMSourceFileDir $PMRootDir/SourceFiles $PMExtProcDir $PMRootDir/ExtProc $PMTempDir $PMRootDir/Temp $PMSuccessEmailUser (no default – user must insert a

path) $PMFailureEmailUser (no default – user must insert a

path) $PMSessionLogCount 0 $PMSessionErrorThreshold 0 $PMWorkflowLogDir $PMRootDir/WorkflowLogs $PMWorkflowLogCount 0 PMLookupFileDir $PMRootDir/LkpFiles

You can define server variables for each PowerCenter Server you register. Some server variables define the path and directories for workflow output files and caches. By default, the PowerCenter Server places output files in these directories when you run a workflow. Other server variables define session/workflow attributes such as log file count, email user, and error threshold.

The installation process creates directories (SessLogs, BadFiles, Cache, TargetFiles, etc.) in the location where you install the PowerCenter Server. To use these directories as the default location for the session output files, you must first set the server variable $PMRootDir to define the path to the directories.

By using server variables, you simplify the process of changing the PowerCenter Server that runs a workflow. If each workflow in a folder uses server variables, then when you copy the folder to a production repository, the PowerCenter Server in production can run the workflow using the server variables defined in the production repository. It is not necessary to change the workflow/session properties in production again. To ensure a workflow completes successfully, relocate any necessary file source or incremental aggregation file to the default directories of the new PowerCenter Server.


Causes and Analysis of UNIX Core Files

Challenge

This Best Practice explains what UNIX core files are and why they are created, and offers some tips on analyzing them.

Description

Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a ‘core dump’ file is also created, which can be used to analyze the reason for the abnormal termination.

What is a ‘Core’ File and What Causes it to be Created?

UNIX operating systems may terminate a process before its normal, expected exit for several reasons. These reasons are typically for bad behavior by the program, and include attempts to execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory space allocated to the program, attempts to write to memory marked read-only by the operating system and other similar incorrect low level operations. Most of these bad behaviors are caused by errors in programming logic in the program.

UNIX may also terminate a process for some reasons that are not caused by programming errors. The main examples of this type of termination are when a process exceeds its CPU time limit, and when a process exceeds its memory limit.

When UNIX terminates a process in this way, it normally writes an image of the processes memory to disk in a single file. These files are called ‘core files’, and are intended to be used by a programmer to help determine the cause of the failure. Depending on the UNIX version, the name of the file will be ‘core’, or in more recent UNIX versions, it is ‘core.nnnn’ where nnnn is the UNIX process ID of the process that was terminated.

Core files are not created for ‘normal’ runtime errors such as incorrect file permissions, lack of disk space, inability to open a file or network connection, and other errors that a program is expected to detect and handle. However, under certain error conditions a program may not handle the error conditions correctly and may follow a path of execution that causes the OS to terminate it and cause a core dump.


Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.

Core File Locations and Size Limits

A core file is written to the current working directory of the process that was terminated. For PowerCenter, this is always the directory the server was started from. For other applications, this may not be true.

UNIX also implements a per user resource limit on the maximum size of core files. This is controlled by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the total memory size of the process, a partial core file will be written. Refer the Best Practice on UNIX resource limits.

Analyzing Core Files

There is little information in a core file that is relevant to an end user; most of the contents of a core file are only relevant to a developer, or someone who understands the internals of the program that generated the core file. However, there are a few things that an end user can do with a core file in the way of initial analysis.

The first step is to use the UNIX ‘file’ command on the core, which will show which program generated the core file:

file core.27431 core.27431: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'dd'

Core files can be generated by both the PowerCenter executables (i.e., pmserver, pmrepserver, and pmdtm) as well as from other UNIX commands executed by the server, typically from command tasks and per- or post-session commands. If a PowerCenter process is terminated by the OS and a core is generated, the session or server log typically indicates ‘Process terminating on Signal/Exception’ as its last entry.

Using the pmstack Utility

Informatica provides a ‘pmstack’ utility, which can automatically analyze a core file. If the core file is from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to Informatica Customer support for further analysis. The track contains everything necessary to further diagnose the problem. Core files themselves are normally not useful on a system other than the one where they were generated.

The pmstack utility can be downloaded from the Informatica Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com. Once downloaded, run pmstack with the –c option, followed by the name of the core file:


$ pmstack -c core.21896 ================================= SSG pmstack ver 2.0 073004 ================================= Core info : -rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896 core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'pmdtm' Process name used for analyzing the core : pmdtm Generating stack trace, please wait.. Pmstack completed successfully Please send file core.21896.trace to Informatica Technical Support

You can then look at the generated trace file or send it to support.

Pmstack also supports a –p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung, to determine what the process is doing.


Determining Bottlenecks

Challenge

Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management.

Description

The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; use a process of elimination, investigating each area in the order indicated:

1. Target 2. Source 3. Mapping 4. Session 5. System

Attempt to isolate performance problems by running test sessions. You should be able to compare the sessions' original performance with that of the tuned sessions performance.

The swap method is very useful for determining the most common bottlenecks. It involves the following five steps:

1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune the copy before making changes to the original.

2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment.

3. Document the change made to the mapping, session and/or workflow and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric.

4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning.

5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Target Bottlenecks


Relational targets

The most common performance bottleneck occurs when the PowerCenter Server writes to a target database. This type of bottleneck can easily be identified with the following procedure:

1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file.

If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance:

• Drop indexes and key constraints. • Increase checkpoint intervals. • Use bulk loading. • Use external loading. • Increase database network packet size. • Optimize target databases.

Flat file targets

If the session targets a flat file, you probably do not have a write bottleneck. You can optimize session performance by writing to a flat file target local to the PowerCenter Server. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives.

Source Bottlenecks

Relational sources

If the session reads from a relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks.

Using a Filter Transformation. Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck.

Using a Read Test Session. You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing the transformation in the mapping. Use the following steps to create a read test mapping:

1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any

custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to a file target.

Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck.


Using a Database Query. You can identify source bottlenecks by executing a read query directly against the source database, To do so, follow these steps: Copy the read query directly from the session log.

Run the query against the source database with a query tool such as SQL Plus. Measure the query execution time and the time it takes for the query to return the first row.

If there is a long delay between the two time measurements, you have a source bottleneck.

If your session reads from a relational source, review the following suggestions for improving performance:

• Optimize the query. • Create tempdb as in-memory database. • Use conditional filters. • Increase database network packet size. • Connect to Oracle databases using IPC protocol.

Flat file sources

If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the Line Sequential Buffer Length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may help when reading flat file sources. Ensure the flat file source is local to the PowerCenter Server.

Mapping Bottlenecks

If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping.

Add a Filter transformation in the mapping before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks.

High Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks. Follow these steps to identify mapping bottlenecks:

Using a test mapping without transformations

1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any

custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to the target.

High Rowsinlookupcache counters. Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions.


High Errorrows counters. Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors.

For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance

Session Bottlenecks

Session performance details can be used to flag other problem areas. Create performance details by selecting Collect Performance Data in the session properties before running the session.

View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation to help you understand session and mapping efficiency.

To watch the performance details during the session run:

• Right-click the session in the Workflow Monitor • Choose Properties • Click the Properties tab in the details dialog box • To view the file, look for the file session_name.perf in the same directory as the

session log and open the file in any text editor

All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column hold the resulting number or efficiency percentage.

Low buffer input and buffer output counters

If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance.

Aggregator, rank, and joiner readfromdisk and writetodisk counters

If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes.

If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number besides zero because the PowerCenter Server reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the Aggregator_readtodisk and writetodisk


counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes.

PowerCenter Versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level.

For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.

System Bottlenecks

After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the PowerCenter Server.

The PowerCenter Server uses system resources to process transformation, session execution, and reading and writing data. The PowerCenter Server also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks.

Windows NT/2000

Use system tools such as the Performance and Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows.

UNIX

On UNIX, you can use system tools to monitor system performance. Use Lsattr –E –I sys0 to view current system settings; Iostat to monitor loading operation for every disk attached to the database server; Vmstat or sar –w to monitor disk swapping actions; and Sar –u to monitor CPU loading.

For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows NT/2000 Systems.


Managing Repository Size

Challenge

The PowerCenter repository is expected to grow over time as new development and production runs occur. Over time, the repository can be expected to grow to a size that may start slowing performance of the repository or make backups increasingly difficult. This Best Practice discusses methods to manage the size of the repository.

The release of PowerCenter version 7.x added several features that aid in managing the repository size. Although the repository is slightly larger with version 7.x than it was with the previous versions, the client tools have increased functionality to limit-out the dependency on the size of the repository. PowerCenter versions earlier than 7.x require more administration to keep the repository sizes manageable.

Description

Why should we manage the size of the repository?

The repository size affects the following:

• DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit the what gets backed up.

• Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in your repository table performance.

• Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a lengthy period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50-60MB for an empty repository) and grows over time, to upwards of 1GB for a large repository. The type of information stored in the repository includes:

o Versions o Objects o Run statistics o Scheduling information o Variables

Tips for Managing the Size of the Repository


Versions and Objects

Delete old versions or purged objects from the repository. Use your repository queries in the client tools to generate reusable queries that can determine the out-of-date versions and objects for removal.

Old versions and objects not only increase the size of the repository, but also make it more difficult to manage further into the development cycle. Cleaning up the folders makes it easier to determine what is valid and what is not.

Folders

Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may be found in development or test repositories.

Run Statistics

Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save the reports of the data you need. To remove the run statistics, go to the Repository Manager and truncate the logs based on the dates.

Recommendations

Informatica strongly recommends upgrading to the latest version of PowerCenter since the latest release includes such features as backup without run statistics, copying only objects with no history, repository queries in the client tools, and so forth. The repository size in version 7.x and above is larger than the previous versions of PowerCenter but the added size does not dramatically affect the performance of the repository. It is still advisable to analyze the tables or run statistics to optimize the tables.

Informatica recommends against direct access to the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica.


Organizing and Maintaining Parameter Files & Variables

Challenge

Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.

Description

Parameter files are a means of providing run time values for parameters and variables defined in workflow, worklet, session, mapplet or mapping. A parameter file can have values for more than one workflows, sessions and mappings, and can be created using text editors such as notepad or vi.

Variables values are stored in the repository and can be changed within mappings and. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.

Parameter File Contents

A Parameter File contains the values for variables and parameters. Although a parameter file can contain values for more than one workflow (or session), it is advisable to build a parameter file to contain values for a single or logical group of workflows For ease of administration. When using the command line mode to execute workflows, multiple parameter files can also be configured and used for a single workflow if the same workflow needs to be run with different parameters.

Parameter File Name

Name the Parameter File the same as the workflow name with a suffix of “.par”. This helps in identifying and linking the parameter file to a workflow.

Parameter File – Order Of Precedence

While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a file specified at the workflow level will always supersede files specified at session levels.

Parameter File Location


Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command.

The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.

Multiple Parameter Files for a workflow

To run a workflow with different sets of parameter values during every run:

a. create multiple parameter files with unique names.

b. change the parameter file name (to match the parameter file name defined in Session or workflow properties). This can be done manually or by using a pre-session shell (or batch script).

c. run the workflow.

Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps b and c.

Generating Parameter files

Based on requirements, you can obtain the values for certain parameters from relational tables or generate them programmatically. In such cases, the parameter files can be generated dynamically using shell (or batch scripts) or using Informatica mappings and sessions.

Consider a case where a session has to be executed only on specific dates (e.g., the last working day of every month), which are listed in a table. You can create the parameter file containing the next run date (extracted from the table) in more than one way.

Method 1:

1. The workflow is configured to use a parameter file 2. Workflow has a decision task before running the session – comparing the

Current System date against the date in the parameter file. See Figure 1. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to

extract a single date, which is greater than the System Date (today) from the table and write it to a file with required format.

4. The shell script uses pmcmd to run the workflow 5. The shell script is scheduled using cron or an external scheduler to run daily.

See Figure 2.


Figure 1. Shell script to generate parameter file

Figure 2 Generated parameter file

Method 2:

1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is

to run. 3. The workflow has a decision task before running the session, comparing the

Current System date against the date in the parameter file


4. The last task in the workflow generates the parameter file for the next run of the workflow (using a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file, in the required format.

5. Schedule the workflow using Informatica Scheduler, to run daily.

Figure 3 Workflow and parameter definition

Parameter file templates

In some other cases, the parameter values change between runs, but the change can be incorporated into the parameter files programmatically. There is no need to maintain separate parameter files for each run.

Consider, for example, a service provider who gets the source data for each client from flat files located in client specific directories and writes processed data into global database. The source data structure, target data structure, and processing logic are all same. The log file for each client run has to be preserved in a client-specific directory. The directory names have the client id as part of directory structure (e.g., /app/data/Client_ID/)

You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one parameter file per client. However, the number of parameter files may become cumbersome to manage when the number of clients increases.


In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd. See Figure 4.

[PROJ_DP.WF:Cleint_Data]

$InputFile_1=/app/data/Client_ID/input/client_info.dat

$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log Figure 4 Parameter File Template

Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow.


Performance Tuning Databases (Oracle)

Challenge

Database tuning can result in tremendous improvement in loading performance. This Best Practice covers tips on tuning Oracle.

Description

Performance Tuning Tools

Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so we’ve included only a short description of some of the major ones here.

V$ Views

V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the ‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.

Explain Plan

Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them.

Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time


it takes to build a lookup cache to determine if the SQL for these transformations should be tested.

SQL Trace

SQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.

TKPROF

The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.

UTLBSTAT & UTLESTAT

Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying).

‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.

Disk I/O

Disk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.

Memory and Processing

Memory and processing configuration is done in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and will probably never exist.

TIP

Changes made in the init.ora file will take effect after a restart of the instance. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance. Note svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web based Server Manager in Oracle 10g. If you are on Oracle 9i either install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the


initialization parameters.

The settings presented here are those used in a 4-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing of queries and indexes. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (non-Oracle) systems to determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.

HASH_AREA_SIZE = 16777216

• Default value: 2 times the value of SORT_AREA_SIZE • Range of values: any integer • This parameter specifies the maximum amount of memory, in bytes, to be used for

the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter.

• The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt).

• HASH_JOIN_ENABLED o In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to

true. o In Oracle 8i and above hash_join_enabled=true is the default value

• HASH_MULTIBLOCK_IO_COUNT o Allows multiblock reads against the TEMP tablespace o It is advisable to set the NEXT extentsize to greater than the value for

hash_multiblock_io_count to reduce disk I/O o This is the same behavior seen when setting the

db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace

• STAR_TRANSFORMATION_ENABLED o Determines whether a cost based query transformation will be applied to

star queries o When set to TRUE optimizer will consider performing a cost based query

transformation on the n-way join table • OPTIMIZER_INDEX_COST_ADJ

o Numeric parameter set between 0 and 1000 (default 1000) o This parameter lets you tune the optimizer behavior for access path

selection to be more or less index friendly

Optimizer_percent_parallel=33

This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full table scan operation.


The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans.

Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL.

parallel_max_servers=40

• Used to enable parallel query. • Initially not set on Install. • Maximum number of query servers or parallel recovery processes for an instance.

Parallel_min_servers=8

o Used to enable parallel query. o Initially not set on Install. o Minimum number of query server processes for an instance. This is also the

number of query server processes Oracle creates when the instance is started.

SORT_AREA_SIZE=8388608

• Default value: Operating system-dependent • Minimum value: the value equivalent to two database blocks • This parameter specifies the maximum amount, in bytes, of Program Global Area

(PGA) memory to use for a sort. After the sort is complete and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system.

• Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time.

• The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.

IPC as an Alternative to TCP/IP on UNIX

On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In another mapping,


the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000 row write (array inserts), primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).

A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:

DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) )

Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance:

DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc) (KEY=DW) ) (CONNECT_DATA=(SID=DW)) )

Improving Data Load Performance

Alternative to Dropping and Reloading Indexes

Dropping and reloading indexes during very large loads to a data warehouse is often recommended but there is seldom any easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it can be a very tedious process.

Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes.

Run the following to generate output to disable the foreign keys in the data warehouse:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;'


FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'R'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ;

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;


ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ;



Dropping or disabling primary keys will also speed loads. Run the results of this SQL statement after disabling the foreign key constraints:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'



AND CONSTRAINT_TYPE = 'P'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ;

Finally, disable any unique constraints with the following:


SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'



AND CONSTRAINT_TYPE = 'U'



Save the results in a single file and name it something like ‘DISABLE.SQL’

To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-session command.

Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.

TIP

Dropping or disabling foreign keys will often boost loading, but this also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.

Optimizing Query Performance

Oracle Bitmap Indexing

With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/highly duplicated data and may even increase query time. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance.

Keep in mind, however, that b-tree indexing is still the Oracle default. If you don’t specify an index type when creating an index, Oracle will default to b-tree. Also note that for certain columns, bitmaps will be smaller and faster to create than a b-tree index on the same column.

Bitmap indexes are suited to data warehousing because of their performance, size, and ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to


note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-create or re-enable them after the load.

The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is identical.

Bitmap indexes

drop index emp_active_bit;

drop index emp_gender_bit;

create bitmap index emp_active_bit on emp (active_flag);

create bitmap index emp_gender_bit on emp (gender);

B-tree indexes

drop index emp_active;

drop index emp_gender;

create index emp_active on emp (active_flag);

create index emp_gender on emp (gender);

Information for bitmap indexes in stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique.

To enable bitmap indexes, you must set the following items in the instance initialization file:

• compatible = 7.3.2.0.0 # or higher • event = "10111 trace name context forever" • event = "10112 trace name context forever" • event = "10114 trace name context forever"


Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error will appear in your SQL statement; the keyword ‘bitmap’ won't be recognized.

TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word ‘parallel’ appears in the banner text.

Index Statistics

Table method

Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following will improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse:

The following SQL statement can be used to analyze the tables in the database:

SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'

FROM USER_TABLES


This generates the following results:

ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;

ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;

ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;

The following SQL statement can be used to analyze the indexes in the database:

SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'

FROM USER_INDEXES


This generates the following results:

ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;




Save these results as a SQL script to be executed before or after a load.

Schema method

Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, then you can use the following command to update the statistics:

EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');

In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command.

TIP: These SQL statements can be very resource intensive, especially for very large tables. For this reason, we recommend running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.

TIP

These SQL statements can be very resource intensive, especially for very large tables. For this reason, we recommend running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.

Parallelism

Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree.

SQL Level Parallelism

Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors:

SELECT /*+ PARALLEL(order_fact,4) */ …;

SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;

TIP When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message.


Example of improper use of alias:

SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME FROM EMP A Here, the parallel hint will not be used because of the used alias “A” for table EMP. The correct way is:

SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A

Table Level Parallelism

Parallelism can also be defined at the table and index level. The following example demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table:

ALTER TABLE order_fact PARALLEL 4;

Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.

Additional Tips

Executing Oracle SQL scripts as pre and post session commands on UNIX

You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is:

sqlplus –s user_id/password@database @ script_name.sql

For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’), you would execute the following as a post-session command:

sqlplus -s pmuser/pmuser@infadb @ /informatica/powercenter/Scripts/ENABLE.SQL

In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating system’s authentication to log onto the database instance.

In the following example, the Informatica id “pmuser” is used to log onto the Oracle database. Create the Oracle user “pmuser” with the following SQL statement:

CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . .

In the following pre-session command, “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script:


sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL

You may want to use the init.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “external-identified” ones.

DRIVING_SITE ‘Hint’

If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance.

For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as:

SELECT /*+ DRIVING_SITE */ …;


Performance Tuning Databases (SQL Server)

Challenge

Database tuning can result in tremendous improvement in loading performance. This Best Practice covers tips on tuning SQL Server.

Description

Proper tuning of the source and target database is a very important consideration to the scalability and usability of a business analytical environment. Managing performance on an SQL Server encompasses the following points.

• Manage system memory usage (RAM caching) • Create and maintain good indexes • Partition large data sets and indexes • Monitor disk I/O subsystem performance • Tune applications and queries • Optimize active data

Manage RAM Caching

Managing random access memory (RAM) buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same Information from disk. If database I/O (input/output operations to the physical disk subsystem) can be reduced to the minimal required set of data and index pages, these pages will stay in RAM longer. Too much unneeded data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is best utilized.

Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage:

• Max async I/O is used to specify the number of simultaneous disk I/O operations (???) that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000

• SQL Server allows several selectable models for database recovery, these include: o Full Recovery o Bulk-Logged Recovery o Simple Recovery


Create and maintain good indexes

A key factor in maintaining minimum I/O for all database queries is ensuring that good indexes are created and maintained

Partition large data sets and indexes

To reduce overall I/O contention and improve parallel operations, consider partitioning table data and indexes. Multiple techniques for achieving and managing partitions using SQL Server 2000 are addressed in this chapter.

Tune applications and queries

This becomes especially important when a database server will be servicing requests from hundreds or thousands of connections through a given application. Because applications typically determine the SQL queries that will be executed on a database server, it is very important for application developers to understand SQL Server architectural basics and how to take full advantage of SQL Server indexes to minimize I/O.

Partitioning for Performance

The simplest technique for creating disk I/O parallelism is to use hardware partitioning and create a single "pool of drives" that serves all SQL Server database files except transaction log files, which should always be stored on physically separate disk drives dedicated to log files only. See Microsoft Documentation for installation procedures.

Objects For Partitioning Consideration

The following areas of SQL Server activity can be separated across different hard drives, RAID controllers, and PCI channels (or combinations of the three):

• Transaction logs • Tempdb • Database • Tables • Nonclustered Indexes

Note In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.

Segregating the Transaction Log

Transaction log files should be maintained on a storage device physically separate from devices that contain data files. Depending on your database recovery model setting, most update activity generates both data device activity and log activity. If both are set


up to share the same device, the operations to be performed will compete for the same limited resources. Most installations benefit from separating these competing I/O activities.

Segregating tempdb

SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins.

To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands:

alter databasetempdbmodifyfile(name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF') alter databasetempdbmodifyfile(name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')

The master database, msdb, and model databases are not used much during production compared to user databases, so it is typically not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.

Database Partitioning

Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are described below.

Primary filegroup

This filegroup contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup.

User-defined filegroup

This filegroup is any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager.

Default filegroup

The default filegroup contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a


time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup.

Files and filegroups are useful for controlling the placement of data and indexes and to eliminate device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more control over their database backup/recovery strategy.

Horizontal Partitioning (Table)

Horizontal partitioning segments a table into multiple tables, each containing the same number of columns but fewer rows. Determining how to partition the tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance.

When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly.

By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually

Cost Threshold for Parallelism Option

Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP).

Max Degree of Parallelism Option

Use this option to limit the number of processors (a max of 32) to use in parallel plan execution. The default value is 0, which uses the actual number of available CPUs. Set this option to 1 to suppress parallel plan generation. Set the value to a number greater than 1 to restrict the maximum number of processors used by a single query execution.

Priority Boost Option

Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is 0, which is a priority base of seven.

Set Working Set Size Option


Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It will vary dynamically between min server memory and max server memory. Setting ‘set working set’ size means the operating system will not attempt to swap out SQL Server pages even if they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance

When configuring a SQL Server that will contain only a few gigabytes of data and not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for maximum performance. To build larger SQL Server databases however, which will contain hundreds of gigabytes or even terabytes of data and/or that can sustain heavy read/write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.

Partitioning for Performance

For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism.

Partitioning can be done using a variety of techniques. Methods for creating and managing partitions include configuring your storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include:

• Transaction log • Tempdb • Database • Tables • Non-clustered indexes

Using bcp and BULK INSERT

Two mechanisms exist inside SQL Server to address the need for bulk movement of data. The first mechanism is the bcp utility. The second is the BULK INSERT statement.

• Bcp is a command prompt utility that copies data into or out of SQL Server. • BULK INSERT is a Transact-SQL statement that can be executed from within the

database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.

TIP

Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you


attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads

While loading data:

• Remove indexes • Use Bulk INSERT or bcp • Parallel load using partitioned data files into partitioned tables • Run one load stream for each available CPU • Set Bulk-Logged or Simple Recovery model • Use TABLOCK option

While loading data

• Create indexes • Switch to the appropriate recovery model • Perform backups

General Guidelines for Incremental Data Loads

• Load Data with indexes in place • Performance and concurrency requirements should determine locking granularity

(sp_indexoption).

Change from Full to Bulk-Logged Recovery mode unless there is an overriding need to preserve a point–in time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.


Performance Tuning Databases (Teradata)

Challenge

Database tuning can result in tremendous improvement in loading performance. This Best Practice covers tips on tuning Teradata.

Description

Teradata offers several bulk load utilities including FastLoad, MultiLoad, and TPump. FastLoad is used for loading inserts into an empty table. One of TPump’s advantages is that it does not lock the table that is being loaded. MultiLoad supports inserts, updates, deletes, and “upserts” to any table. This best practice will focus on MultiLoad since PowerCenter 5.x can auto-generate MultiLoad scripts and invoke the MultiLoad utility per PowerCenter target.

Tuning MultiLoad

There are many aspects to tuning a Teradata database. With PowerCenter 5.x several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.

Note: In PowerCenter 5.1, the Informatica server transfers data via a UNIX named pipe to MultiLoad, whereas in PowerCenter 5.0, the data is first written to file.

MultiLoad parameters

With PowerCenter 5.x, you can auto-generate MultiLoad scripts. This not only enhances development, but also allows you to set performance options. Here are the MultiLoad-specific parameters that are available in PowerCenter:

• TDPID. A client based operand that is part of the logon string. • Date Format. Ensure that the date format used in your target flat file is

equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database.

• Checkpoint. A checkpoint interval is similar to a commit interval for other databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a


value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed.

• Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running.

• Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table.

• Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables.

• Max Sessions. Available only in PowerCenter 5.1, this parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process).

• Sleep. Available only in PowerCenter 5.1, this parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs

Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for:

• Work tables • Error tables • Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table.

Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:

PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML)

Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.

Monitoring MultiLoad Performance

Here are some tips for analyzing MultiLoad performance:

1. Determine which phase of the MultiLoad job is causing poor performance.


• If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system.

• The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job.

3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU

capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes

(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table.

6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks.

7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems


Performance Tuning UNIX Systems

Challenge

Identify opportunities for performance improvement within the complexities of the UNIX operating environment.

Description

This section provides an overview of the subject area, followed by discussion of detailed usage of specific tools.

Overview

All system performance issues are basically resource contention issues. In any computer system, there are three fundamental resources: CPU, memory, disk IO and network IO. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter Server and its sub processes get adequate resources to execute in a timely and efficient manner.

Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with one another. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initiallytake a baseline measurement and come out with a characterization of the system to provide a good understanding of how it behaves, then evaluate any bottleneck showed on each system resource during your load window and determine the removal of what resource contention offers the greatest opportunity for performance enhancement.

Here is a summary of each system resource area and the problems it can have.

CPU

• On any multiprocessing and multiuser system many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing will reflect a negative impact on performance; the system scheduler will put each process in a queue to wait for CPU availability.


• An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column).

• On SMP (symmetric multiprocessing) architecture servers watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; perhaps binding the process to a specific CPU might improve performance.

Memory

• Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. At this point, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running.

• Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.

Disk IO

• The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each kind of device has its own peculiarities and, therefore, its own problems.

• There are tools for evaluation specific parts of the subsystem o iostat can give you information about the transfer rates for each disk drive o ps and vmstat can give some information about how many processes are

blocked waiting for I/O o sar can provide voluminous information about I/O efficiency o sadp can give detailed information about disk access patterns

Network IO

• It is very likely the source data, the target data or both are connected through an Ethernet channel to the system where PowerCenter is residing. Take into consideration the number of Ethernet channels and bandwidth available to avoid congestion.

o netstat shows packet activity on a network, watch for high collision rate of output packets on each interface.


o nfstat monitors NFS traffic; execute nfstat –c from a client machine (not from the nfs server); watch for high time rate of total call and “not responding” message.

Given that these issues all boil down to access to some computing resource, mitigation of each issue consists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance:

• Adjust execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc.

• Migrating other applications to other hardware will reduce demand on the hardware hosting PowerCenter

• For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions

• Adding hardware resource, such as adding more memory, will make more resource available to all processes

• Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes

Detailed Usage

The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips will be more helpful than others in a particular environment, all are worthy of consideration.

Availability, syntax and format of each will vary across UNIX versions.

Running ps -axu

Run ps -axu to check for the following items:

• Are there any processes waiting for disk access or for paging? If so check the I/O and memory subsystems.

• What processes are using most of the CPU? This may help you distribute the workload better.

• What processes are using most of the memory? This may help you distribute the workload better.

• Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues

Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that excessive paging/swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/swapping. If paging or excessive swapping does occur at any time, increase memory to prevent it. Paging/swapping, on any database system, causes a major


performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server.

Some swapping may occur normally regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased.

Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and check for the following:

• Are pages-outs occurring consistently? If so, you are short of memory. • Are there a high number of address translation faults? (System V only) This

suggests a memory shortage. • Are swap-outs occurring consistently? If so, you are extremely short of memory.

Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should ALWAYS be zero.

If memory seems to be the bottleneck of the system, try following remedial steps:

• Reduce the size of the buffer cache, if your system has one, by decreasing BUFPAGES.

• If you have statically allocated STREAMS buffers, reduce the number of large (2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need.

• Reduce the size of your kernels tables. This may limit the systems capacity (number of files, number of processes, etc.).

• Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much.

• Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily.

• Try to limit the time spent running sendmail, which is a memory hog. • If you dont see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues

Use iostat to check i/o load and utilization, as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how fairly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks?

Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends (bad)?


• Reorganize your file systems and disks to distribute I/O activity as evenly as possible.

• Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention.

• Use your fastest disk drive and controller for your root file system; this will almost certainly have the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system.

• Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD).

• Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance.

• Rebuild your file systems periodically to eliminate fragmentation (backup, build a new file system, and restore).

• If you are using NFS and using remote files, look at your network situation. You don’t have local disk I/O problems.

• Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space, try the following actions:

• Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron.

• Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage.

• Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues

Use uptime or sar -u to check for CPU loading. sar provides more detail, including %usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server.

When you run iostat 5 above, also observe for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload.

• Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help.

• Get users to run jobs at night with at or any queuing system thats available always for help. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning.

• Use nice to lower the priority of CPU-bound jobs will improve interactive performance. Also, using nice to raise the priority of CPU-bound jobs will


expedite them but will hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.

Identifying and Resolving Network IO Issues

You can suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS.

Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network.

If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isn’t a file server.

A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors.

Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmixis roughly equal to timeout, at least one NFS server is overloaded. If timeoutand retrans are high, but badxidis low, some part of the network between the NFS client and server is overloaded and dropping packets.

Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work.

Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system.

Use systems with good network performance as file servers.

lsattr E l sys0 is used to determine some current settings on some UNIX environments. In Solaris you execute prtenv Of particular attention is maxuproc. Maxuproc is the setting to determine the maximum level of user background processes.

On most UNIX environments, this is defaulted to 40 but should be increased to 250 on most systems.

Choose a File System Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, The UNIX System V File System; ufs, The UNIX File System derived from Berkeley (BSD); vxfs, The Veritas File System; and lastly raw devices that, in reality are not a file system at all.

Use PMProcs Utility ( PowerCenter Utility), to view the current Informatica processes. For example:

harmon 125: pmprocs <------------ Current PowerMart processes ---------------> UID PID PPID C STIME TTY TIME CMD powermar 2711 1421 16 18:13:11 ? 0:07 dtm pmserver.cfg 0 202 -289406976 powermar 2713 2711 11 18:13:17 ? 0:05 dtm pmserver.cfg 0 202 -289406976 powermar 1421 1 1 08:39:19 ? 1:30 pmserver powermar 2712 2711 17 18:13:17 ? 0:08 dtm pmserver.cfg 0 202 -289406976 powermar 2714 1421 11 18:13:20 ? 0:04 dtm pmserver.cfg 1 202 -289406976 powermar 2721 2714 12 18:13:27 ? 0:04 dtm pmserver.cfg 1 202 -289406976 powermar 2722 2714 8 18:13:27 ? 0:02 dtm pmserver.cfg 1 202 -289406976 <------------ Current Shared Memory Resources ---------------> IPC status from <running system> as of Tue Feb 16 18:13:55 1999 T ID KEY MODE OWNER GROUP SEGSZ CPID LPID Shared Memory: m 0 0x094e64a5 --rw-rw---- oracle dba 20979712 1254 1273 m 1 0x0927e9b2 --rw-rw---- oradba dba 21749760 1331 2478 m 202 00000000 --rw------- powermar pm4 5000000 1421 2714 m 8003 00000000 --rw------- powermar pm4 25000000 2711 2711 m 4 00000000 --rw------- powermar pm4 25000000 2714 2714 <------------ Current Semaphore Resources --------------->

There are 19 Semaphores held by PowerMart processes

A few points about the Pmprocs utility:

• Pmprocs is a script that combines the ps and ipcs commands • Only available for UNIX • CPID - Creator PID • LPID - Last PID that accessed the resource • Semaphores - used to sync the reader and writer • 0 or 1 - shows slot in LM shared memory


Performance Tuning Windows NT/2000 Systems

Challenge

The Microsoft Windows NT/2000 environment is easier to tune than UNIX environments but offers limited performance options. NT is considered a “self-tuning” operating system because it attempts to configure and tune memory to the best of its ability. However, this does not mean that the NT System Administrator is entirely free from performance improvement responsibilities.

Note: Tuning is essentially the same for both NT and 2000 based systems, with differences for Windows 2000 noted in the last section.

Description

The following tips have proven useful in performance tuning NT-based machines. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration.

The two places to begin tuning an NT server are:

• Performance Monitor. • Performance tab (hit ctrl+alt+del, choose task manager, and click on the

Performance tab).

Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.

Resolving Typical NT Problems

The following paragraphs describe some common performance problems in an NT environment and suggest tuning solutions.

Load reasonableness. Assume that some software will not be well coded, and some background processes (e.g., a mail server or web server) running on a single machine, can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs may be the only recourse.


Device Drivers. The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem.

Memory and services. Although adding memory to NT is always a good solution, it is also expensive and usually must be planned to support the BANK system for EISA and PCI architectures. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources.

I/O Optimization. This is, by far, the best tuning option for database applications in the NT environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too.

Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows NT/2000 disk defragmentation product, regardless of whether the disk is formatted for FAT or NTFS.

Finally, on NT servers, be sure to implement disk stripping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the NT server. NT, by default, sets the disk device priority low. Change the disk priority setting in the Registry at service\lanman\server\parameters and add a key for ThreadPriority of type DWORD with a value of 2.

Monitoring System Performance in Windows 2000

In Windows 2000, the PowerCenter Server uses system resources to process transformation, session execution, and reading and writing of data. The PowerCenter Server also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows 2000, you can use the system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the PowerCenter Server and to identify system bottlenecks.

Windows 2000 provides the following tools (accessible under the Control Panel/Administration Tools/Performance) for monitoring resource usage on your computer:

• System Monitor • Performance Logs and Alerts

These Windows 2000 monitoring tools enable you to analyze usage and detect bottlenecks at the disk, memory, processor, and network level.

System Monitor

The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste


counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.

Note: Typing perfmon.exe at the command prompt causes the system to start System Monitor, not Performance Monitor.

Performance Monitor

The Performance Logs and Alerts tool provides two types of performance-related logs—counter logs and trace logs—and an alerting function.

Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel.Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value, but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system.

Note:You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Queries).

The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time.

If you want to create your own log setting, press the right mouse on one of the log types.


Platform Sizing

Challenge

Determining the appropriate platform size to support the PowerCenter environment based on customer environments and requirements.

Description

The required platform size to support PowerCenter depends on each customer’s unique environment and processing requirements. The PowerCenter engine allocates resources for individual extraction, transformation, and load (ETL) jobs or sessions. Each session has its own resource requirements. The resources required for the PowerCenter engine depend on the number of sessions, what each session does while moving data, and how many sessions run concurrently. This Best Practice outlines the relevant questions pertinent to estimating the platform requirements.

TIP

An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle. Too often, clients size their machines before any ETL is designed or developed, and in many cases these platforms are too small for the resultant system. Thus, it is better to analyze sizing requirements after the data transformation processes have been well defined during the design and development phases.

Environment Questions

When considering a platform size, you should consider the following questions regarding your environment:

• What sources do you plan to access? • How do you currently access those sources? • Have you decided on the target environment (database/hardware/operating

system)? If so, what is it? • Have you decided on the PowerCenter server environment (hardware/operating

system)? • Is it possible for the PowerCenter server to be on the same machine as the target? • How do you plan to access your information (cube, ad-hoc query tool) and what

tools will you use to do this? • What other applications or services, if any, run on the PowerCenter server?


• What are the latency requirements for the PowerCenter loads?

Engine Sizing Questions

When considering the engine size, you should consider the following questions:

• Is the overall ETL task currently being done? If so, how do you do it, and how long does it take?

• What is the total volume of data to move? • What is the largest table (bytes and rows)? Is there any key on this table that

could be used to partition load sessions, if needed? • How often will the refresh occur? • Will refresh be scheduled at a certain time, or driven by external events? • Is there a "modified" timestamp on the source table rows? • What is the batch window available for the load? • Are you doing a load of detail data, aggregations, or both? • If you are doing aggregations, what is the ration of source/target rows for the

largest result set? How large is the result set (bytes and rows)?

The answers to these questions offer an approximate guide to the factors that affect PowerCenter's resource requirements. To simplify the analysis, you can focus on large jobs that drive the resource requirement.

Engine Resource Consumption

This following sections summarize some recommendations on the PowerCenter engine resource consumption.

Processor

1-1.5 CPUs per concurrent non-partitioned session or transformation job.

Memory

• 20 to 30MB of memory for the main engine for session coordination.

• 20 to 30MB of memory per session, if there are no aggregations, lookups, or heterogeneous data joins. Note that 32-bit systems have a operating system limitation of 3GB per session.

• Caches for aggregation, lookups or joins use additional memory: • Lookup tables are cached in full; the memory consumed depends on the size of the

tables. • Aggregate caches store the individual groups; more memory is used if there are

more groups. • Sorting the input to aggregations greatly reduces the need for memory. • Joins cache the master table in a join; memory consumed depends on the size of

the master.

Disk space


Disk space is not a factor if the machine is used only as the PowerCenter engine, unless you have the following conditions:

• Data is staged to flat files on the PowerCenter server machine. • Data is stored in incremental aggregation files for adding data to aggregates. The

space consumed is about the size of the data aggregated. • Temporary space is needed for paging for transformations that require large

caches that cannot be entirely cached by system memory

Sizing analysis

The basic goal is to size the machine so that all jobs can complete within the specified load window. You should consider the answers to the questions in the "Environment" and "Engine Sizing" sections to estimate the required number of sessions, the volume of data that each session moves, and its lookup table, aggregation, and heterogeneous join caching requirements. Use these estimates with the recommendations in the "Engine Resource Consumption" section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window.

Note that the deployment environment often creates performance constraints that hardware capacity cannot overcome. The engine throughput is usually constrained by one or more of the environmental factors addressed by the questions in the "Environment" section. For example, if the data sources and target are both remote from the PowerCenter server, the network is often the constraining factor. At some point, additional sessions, processors, and memory might not yield faster execution because the network (not the PowerCenter server) imposes the performance limit. The hardware sizing analysis is highly dependent on the environment in which the server is deployed. You need to understand the performance characteristics of the environment before making any sizing conclusions.

It is also vitally important to remember that it is likely that other applications in addition to PowerCenter may use the platform. It is very common for PowerCenter to run on a server with a database engine and query/analysis tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools all run on the same machine, the query/analysis tool often drives the hardware requirements. However, if the loading is performed after business hours, the query/analysis tools requirements may not be a sizing limitation.


Recommended Performance Tuning Procedures

Challenge

Sometimes it is necessary to employ a series of performance tuning procedures in order to optimize PowerCenter load times.

Description

When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can be followed to help diagnose any problems that might be aversely affecting all components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases, must also be considered. More often than not, it is an issue external to PowerCenter that is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, it is necessary to execute the performance tuning steps in a specific order. This will allow you to methodically rule out individual pieces and narrow down the specific areas in which to focus your tuning efforts on.

1. Perform Benchmarking

You should always have a baseline of your current load times for a given workflow or session with a similar record count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on other similar tasks currently running faster than the problem process. Use this benchmark to estimate what your desired performance goal should be and tune to this goal. Start with the problem mapping you have created along with a session and workflow that uses all default settings. This allows you to systematically see exactly which changes you make have a positive impact on performance.

2. Identify The Performance Bottleneck Area

This step will help greatly in narrowing down the areas in which to begin focusing. There are five areas to focus on when performing the bottleneck diagnosis. The areas in order of focus are:

• Target


• Source • Mapping • Session/Workflow • System.

The methodology will step you through a series of proven tests using PowerCenter to identify trends that point where next to focus your time. Remember to go through these tests in a scientific manner, running them multiple times before making a conclusion, and also realize that identifying and fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.

3. Optimize "Inside" or "Outside" PowerCenter

Depending on the results of the bottleneck tests, optimize “inside” or “outside” PowerCenter. Be sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is also the order in which you will make any performance changes.

Problems “outside” PowerCenter refers to anything you find that indicates that the source of the performance problem is outside of the PowerCenter mapping design or workflow/session settings. This usually means a source/target database problem, network bottleneck, or a server operating system problem. These are the most common performance problems.

• For Source database related bottlenecks, refer to the Tuning SQL Overrides and Environment for Better Performance

• For Target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server or Teradata

• For operating system problems, refer to the Performance Tuning UNIX Systems or Performance Tuning Windows NT/2000 Systems for more information.

Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that can greatly increase performance. Refer to the Tuning Sessions for Better Performance for more information.

There are certain procedures to look at to optimize mappings; however be careful, because in most cases, the mapping design is dictated by business logic. This means that while there may be a more efficient way to perform the business logic within the mapping, the actual necessary business logic cannot be ignored simply to increase performance. Refer to Tuning Mappings for Better Performance for more information.

4. Re-Execute the Problem Workflow or Session

Re-execute the problem workflow or session, then benchmark the load performance against the baseline. This step is iterative, and should be performed after any performance-based setting is changed. You are trying to answer the question, “Did your performance change make a positive impact?” If so, move on to the next bottleneck. Be sure to make detailed documentation at every step along the way so you have a clear path as to what has and hasn’t been tried.


After the recommended steps have been taken for each relevant performance bottleneck, re-run the problem workflow or session and compare the results to the benchmark. Hopefully, you have met your initial performance goal and made a significant performance impact. While it may seem like there are an enormous amount of areas where a performance problem can arise, if you follow the steps for finding your bottleneck, and apply some of the tuning techniques specific to it, you will achieve your desired performance gain.


Tuning Mappings for Better Performance

Challenge

In general, mapping-level optimization takes time to implement, but can significantly boost performance. Sometimes the mapping is the biggest bottleneck in the load process because business rules determine the number and complexity of transformations in a mapping.

Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally, bringing about a performance increase in all scenarios. The second group of tuning processes may yield only small performance increase, or can be of significant value, depending on the situation.

Some factors to consider when choosing tuning processes at the mapping level include the specific environment, software/ hardware limitations, and the number of rows going through a mapping. This Best Practice offers some guidelines for tuning mappings.

Description

Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations.

For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option.

Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.

Consider Single-Pass Reading


If several mappings use the same data source, consider a single-pass reading. Consolidate separate mappings into one mapping with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows.

Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that function is called in the session.

Optimize SQL Overrides

When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. See the section Tuning SQL Overrides and Environment for Better Performance for more information.

Scrutinize Datatype Conversions

PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary.

In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations.

Eliminate Transformation Errors

Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log.

Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformations. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations

There are a number of ways to optimize lookup transformations that are setup in a mapping.

When to cache lookups

When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-by-row basis. NOTE: All the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are


sufficient to ensure that caches will not page to disks. Information regarding memory and cache sizing for Lookup transformations are covered in Best Practice: Tuning Sessions for Better Performance.

A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example.

In Mapping X, the source and lookup contain the following number of records:

ITEMS (source): 5000 records

MANUFACTURER: 200 records

DIM_ITEMS: 100000 records

Number of Disk Reads

Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario.

Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached.

Use the following eight step method to determine if a lookup should be cached:

1. Code the lookup into the mapping.

Cached Lookup Un-cached Lookup LKP_Manufacturer Build Cache 200 0 Read Source Records 5000 5000 Execute Lookup 0 5000 Total # of Disk Reads 5200 100000 LKP_DIM_ITEMS

Build Cache 100000 0 Read Source Records 5000 5000 Execute Lookup 0 5000 Total # of Disk Reads 105000 10000


2. Select a standard set of data from the source. For example, add a where clause on a relational source to load a sample 10,000 rows.

3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name

than the log created in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the

lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of

the load in seconds and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS.

7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS.

8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.

Sharing lookup caches

There are a number of methods for sharing lookup caches:

• Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads.

• Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs.


• Across different mappings and sessions, the use of a named persistent cache allows sharing of an existing cache file.

Reducing the number of cached rows

There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache.

NOTE: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the lookup condition

In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.

Indexing the lookup table

The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and un-cached lookups.

• In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement.

• In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.

Optimize Filter and Router Transformations

Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance.

Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition.

Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved.

Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow.

Optimize aggregator transformations


Aggregator Transformations often slow performance because they must group data before processing it.

Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports.

Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an Aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option.

Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert and the following rows to update.

Joiner Transformation

Joining data from the same source

You can join data from the same source in the following ways:

• Join two branches of the same pipeline. • Create two instances of the same source and join pipelines from these source

instances.

You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping.

When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.

If you want to join unsorted data, you must create two instances of the same source and join the pipelines.


For example, you have a source with the following ports:

• Employee • Department • Total Sales

In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations:

• Sorter transformation. Sort the data. • Sorted Aggregator transformation. Average the sales data and group by

department. When you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data.

• Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data.

• Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.

The following figure illustrates joining two branches of the same pipeline:


Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformation. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.

Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing.

You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances.

The following figure shows two instances of the same source joined using a Joiner transformation:

Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline.

Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:

• Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once.

• Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data.

• Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.

• Join two instances of a source if one pipeline may process much more slowly than the other pipeline.

Performance Tips

Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema.

Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller.

Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets.


For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process.

For a sorted Joiner transformation, designate as the master sourcethe source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed.

Optimizing Sorted Joiner Transformations with Partitions

When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin

To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you must group and sort data. To group data, ensure that rows with the same key value are routed to the same partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort the data ensures that you maintain grouping and sort the data within each group.

Use n:n partitions

You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not need to cache all of the master data. This reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline, it must then read the data from disk to compare the master and detail pipelines.

Optimize Sequence Generator Transformations

Sequence Generator transformations need to determine the next available sequence number, thus increasing the number of cached values property can increase performance. This property determines the number of values the PowerCenter Server caches at one time. If it is set to cache no values then the PowerCenter Server must query the repository each time to determine the next number to be used. Note that any cached values not used in the course of a session are lost since the sequence generator value in the repository is set when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations


For the most part, making calls to external procedures slows a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures, and Advanced External Procedures.

Field-Level Transformation Optimization

As a final step in the tuning process, you can tune expressions used in transformations. When examining expressions, focus on complex expressions and try to simplify them when possible.

To help isolate slow expressions, do the following:

1. Time the session with the original expression. 2. Copy the mapping and replace half the complex expressions with a constant. 3. Run and time the edited session. 4. Make another copy of the mapping and replace the other half of the complex

expressions with a constant. 5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.

Factoring out common logic

This can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times in a mapping, moving the task upstream in the mapping may allow the logic to be done just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits.

Minimize function calls

Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated.

Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data.

Thus the following expression:

SUM(Column A) + SUM(Column B)

Can be optimized to:

SUM(Column A + Column B)


In general, operators are faster than functions, so operators should be used whenever possible.

For example if you have an expression which involves a CONCAT function such as:

CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)

It can be optimized to:

FIRST_NAME || || LAST_NAME

Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion.

For example:

IIF(FLG_A=Y and FLG_B=Y and FLG_C=Y, VAL_A+VAL_B+VAL_C,

IIF(FLG_A=Y and FLG_B=Y and FLG_C=N, VAL_A+VAL_B,

IIF(FLG_A=Y and FLG_B=N and FLG_C=Y, VAL_A+VAL_C,

IIF(FLG_A=Y and FLG_B=N and FLG_C=N, VAL_A,

IIF(FLG_A=N and FLG_B=Y and FLG_C=Y, VAL_B+VAL_C,

IIF(FLG_A=N and FLG_B=Y and FLG_C=N, VAL_B,

IIF(FLG_A=N and FLG_B=N and FLG_C=Y, VAL_C,

IIF(FLG_A=N and FLG_B=N and FLG_C=N, 0.0))))))))


IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C=Y, VAL_C, 0.0)

The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in 3 IIFs, 3 comparisons and two additions.

Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one:

For example:

IIF(X=1 OR X=5 OR X=9, 'yes', 'no')


IIF(MOD(X, 4) = 1, 'yes', 'no')


Calculate once, use many times

Avoid calculating or testing the same value multiple times. If the same sub-expression is used several times in a transformation, consider making the sub-expression a local variable. The local variable can be used only within the transformation in which it was created. By calculating the variable only once and then referencing the variable in following sub-expressions, performance will be increased.

Choose numeric versus string operations

The PowerCenter Server processes numeric operations faster than string operations. For example, if a lookup is done on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing char-char and char-varchar comparisons

When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of CHAR source fields. (??CORRECT INTERPRETATION??)

Use DECODE instead of LOOKUP

When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance.

Reduce the number of transformations in a mapping

Because there is always overhead involved in moving data between transformations, try, whenever possible, to reduce the number of transformations. Also, resolve unnecessary links between transformations to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.

Use pre- and post-session SQL commands

You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier transformation and in the Properties tab of the target instance in a mapping. To increase the load speed, use these commands to drop indexes on the target before the session runs, then recreate them when the session completes.

Apply the following guidelines when using SQL statements:

• You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may.


• You can use mapping parameters and variables in SQL executed against the source, but not against the target.

• Use a semi-colon (;) to separate multiple statements. • The PowerCenter Server ignores semi-colons within single quotes, double quotes,

or within /* ...*/. • If you need to use a semi-colon outside of quotes or comments, you can escape it

with a back slash (\). • The Workflow Manager does not validate the SQL.

Use environmental SQL

For relational databases, you can execute SQL commands in the database environment when connecting to the database. You can use this for source, target, lookup, and stored procedure connections. For instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the guidelines listed above for using the SQL statements.


Tuning Sessions for Better Performance

Challenge

Running sessions is where the pedal hits the metal. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, this should not be the major or only area of focus when implementing performance tuning.

Description

The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup Transformations (with caching enabled) use caches. Review the memory cache settings for sessions where the mappings contain any of these transformations.

The PowerCenter Server uses the index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows.

You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The Transformation_readfromdisk or Transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Since the data cache is typically larger than the index cache, you should increase the data cache more than the index cache.

The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index].dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. You may encounter performance or reliability problems when you cache large quantities of data on a mapped or mounted drive.


If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, try to configure the index and data cache sizes to store the appropriate amount of data in memory. Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes.

The PowerCenter Server writes to the index and data cache files during a session in the following cases:

• The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation.

• The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time.

• The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache.

• The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.

When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.

If a cache file handles more than two gigabytes of data, the PowerCenter Server creates multiple index and data files. When creating these files, the PowerCenter Server appends a number to the end of the filename, such as PMAGG*.idx1 and PMAGG*.idx2. The number of index and data files is limited only by the amount of disk space available in the cache directory.

Aggregator Caches

Keep the following items in mind when configuring the aggregate memory cache sizes:

• Allocate at least enough space to hold at least one row in each aggregate group.

• Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses memory to process an Aggregator transformation with sorted ports, not cache memory.

• Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation


should be set up so that only new detail records are read with each subsequent run.

• When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data cache size, connect only the necessary output ports to subsequent transformations.

Joiner Caches

When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data.

The number of rows the PowerCenter Server stores in the cache depends on the partitioning scheme, the data in the master source, and whether or not you use sorted input.

After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.

Lookup Caches

Several options can be explored when dealing with Lookup transformation caches.

• Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session which has a lookup that uses a persistent cache is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt.

• Lookup caching should be enabled for relatively small tables. Refer to Best Practice: Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can sometimes increase session performance.

• Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup.

Allocating buffer memory

When the PowerCenter Server initializes a session, it allocates blocks of memory to hold source and target data. Sessions that use a large number of sources and targets may


require additional memory blocks. By default, a session has enough buffer blocks for 83 sources and targets. If you run a session that has more than 83 sources and targets, you can increase the number of available memory blocks by adjusting the following session parameters:

• DTM buffer size - the default setting is 12,000,000 bytes. • Default buffer block size - the default size is 64,000 bytes.

To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks.

If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size

The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns.

If a session's performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance.

Increasing DTM buffer memory allocation generally causes performance to improve initially and then level off. When the DTM buffer memory allocation is increased, you need to evaluate the total memory available on the PowerCenter Server. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. You can increase the DTM buffer size in the Performance settings of the Properties tab.

If you don't see a significant performance increase after increasing DTM buffer memory, then it was not a factor in session performance.

Optimizing the Buffer Block Size

Within a session, you can modify the buffer block size by changing it in the advanced section of the Config tab. This specifies the size of a memory block that is used to move data throughout the pipeline. Each source, each transformation, and each target may have a different row size, which results in different numbers of rows that can be fit into one memory block.

Row size is determined in the server, based on number of ports, their data types, and precisions. Ideally, buffer block size should be configured so that it can hold roughly 20 rows at a time. When calculating this, use the source or target with the largest row


size. The default is 64K. The buffer block size does not become a factor in session performance until the number of rows falls below 10. Informatica recommends that the size of the shared memory (which determines the number of buffers available to the session) should not be increased at all unless the mapping is complex (i.e., more than 20 transformations).

Running concurrent sessions and workflows

The PowerCenter Server can process multiple sessions in parallel and can also process multiple partitions of a pipeline within a session. If you have a symmetric multi-processing (SMP) platform, you can use multiple CPUs to concurrently process session data or partitions of data. This provides improved performance since true parallelism is achieved. On a single processor platform, these tasks share the CPU, so there is no parallelism.

To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available. Each concurrent session will use a maximum of 1.4 CPUs for the first session, and a maximum of 1 CPU for each additional session. Also, it has been noted that simple mappings (i.e., mappings with only a few transformations) do not make the engine CPU-bound, and therefore use a lot less processing power than a full CPU.

If there are independent sessions that use separate sources and mappings to populate different targets, they can be placed in a single workflow and linked concurrently to run at the same time. Alternatively, these sessions can be placed in different workflows that are run concurrently.

If there is a complex mapping with multiple sources, you can separate it into several simpler mappings with separate sources. This enables you to place concurrent sessions for these mappings in a workflow to be run in parallel.

Partitioning sessions

Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you use PowerCenter, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently.

When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline:

• Location of partition points: The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you use PowerCenter, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably.


• Number of partitions: By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system.

• Partition types: The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

1. Round-robin partitioning: PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each partition receives approximately the same number of rows.

2. Hash Keys: the PowerCenter Server uses a hash function to group rows of data among partitions. The PowerCenter Server groups the data based on a partition key. There are two types of hash partitioning:

o Hash auto-keys: The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations.

o Hash User Keys: The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key.

3. Key Range: The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range.

4. Pass-through partitioning: The PowerCenter Server processes data without redistributing rows among partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point.

5. Database Partitioning partition: You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets.

If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum single-partition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session:

• Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition.


• Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition.

• Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the session with one partition.

• Partition the source data evenly. Configure each partition to extract the same number of rows.

• Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time) then performance may improve forthis session by adding a partition.

• Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration.

• Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.

Increasing the target commit interval

One method of resolving target database bottlenecks is to increase the commit interval. Each time the PowerCenter Server commits, performance slows. Therefore, the smaller the commit interval, the more often the PowerCenter Server writes to the target database and the slower the overall performance. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve.

When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate the larger number of rows. One of the major reasons that Informatica has set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (i.e., you may get a database error like "unable to extend rollback segments" in Oracle).

Disabling high precision

If a session runs with high precision enabled, disabling high precision may improve session performance.

The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable high precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server converts data to a double.


Reducing error tracking

If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log.

To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. Terse tracing should only be set if the sessions run without problems and session details are not required. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility.

As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to verbose initialization or verbose data.

• Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics.

• Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation.

However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete.

The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.


Tuning SQL Overrides and Environment for Better Performance

Challenge

Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from source database tables, which positively impacts the overall session performance. This Best Practice explores ways to optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description

SQL Queries Performing Data Extractions

Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL

When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used.

Here is an example of the Oracle NLV function:

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM exp.exp_bio_result bio, sar.sar_data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)


AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log

WHERE load_status = 'P')

Here is the same query in DB2:


FROM bio_result bio, data_load_log log


AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log

WHERE load_status = 'P')

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views

In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation.

You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view

You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause:

SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,

N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,

N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,

N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM DOSE_REGIMEN N,

(SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM EXPERIMENT_PARAMETER R,

NEW_GROUP_TMP TMP


WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID

AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID

) X

WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID

ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause

The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example:

WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P')


FROM bio_result bio, data_load_log log


AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = maxseq. seq_no

Here is another example using a WITH clause that uses recursive SQL:

WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS

(SELECT PERSON_ID, NAME, PARENT_ID

FROM PARENT_CHILD

WHERE NAME IN (‘FRED’, ‘SALLY’, ‘JIM’)

UNION ALL

SELECT C.PERSON_ID, C.NAME, C.PARENT_ID

FROM PARENT_CHILD C, PERSON_TEMP RECURS

WHERE C.PERSON_ID = RECURS.PERSON_ID

AND LEVEL < 5)

SELECT * FROM PERSON_TEMP

The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle)

The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case since it was the only legal way to test a condition in earlier versions.

DECODE is not allowed in DB2.

In Oracle:

SELECT EMPLOYEE, FNAME, LNAME,

DECODE (SALARY)

< 10000, ‘NEED RAISE’,

> 1000000, ‘OVERPAID’,

‘THE REST OF US’) AS COMMENT

FROM EMPLOYEE

In DB2:

SELECT EMPLOYEE, FNAME, LNAME,

CASE

WHEN SALARY < 10000 THEN ‘NEED RAISE’

WHEN SALARY > 1000000 THEN ‘OVERPAID’

ELSE ‘THE REST OF US’

END AS COMMENT

FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset

It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can be commented out or removed after it is put in general use.

DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’

FETCH FIRST 12 ROWS ONLY

Oracle does it this way using the ROWNUM variable:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’

AND ROWNUM <= 12

INTERSECT, INTERSECT ALL, UNION, UNION ALL

Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL return all rows.

System Dates in Oracle and DB2

Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or the date however you want with date functions.

Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy):

SELECT TRUNC(SYSDATE) – 1 FROM DUAL

DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP

Here is an example for DB2:

SELECT FNAME, LNAME, CURRENT DATE AS TODAY

FROM EMPLOYEE

Oracle: Using Hints


Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer control over the execution. Hints are always honored unless execution is not possible. Because the database engine does not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and access method hints are the most common.

In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using older versions of Oracle however, the use of the proper INDEX hints should help performance.

The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method

Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires memory to build the hash table.

Hash joins are most effective when the amount of data is large and one table is much larger than the other.

Here is an example of a select that performs best as a hash join:

SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M

WHERE C.CUST_ID = M.MANAGER_ID

Considerations Join Type Better throughput Sort/Merge Better response time Nested loop Large subsets of data Sort/Merge Index available to support join Nested loop Limited memory and CPU available for sorting Nested loop Parallel execution Sort/Merge or Hash Joining all or most of the rows of large tables Sort/Merge or Hash Joining small sub-sets of data and index available Nested loop Hint Description ALL_ROWS The database engine creates an execution plan that optimizes

for throughput. Favors full table scans. Optimizer favors Sort/Merge

FIRST_ROWS The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as


Hint Description possible. Favors index lookups. Optimizer favors Nested-loops

CHOOSE The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor.

RULE The database engine creates an execution plan based on a fixed set of rules.

USE NL Use nested loops USE MERGE Use sort merge joins HASH The database engine performs a hash scan of the table. This

hint is ignored if the table is not clustered.

Access method hints

Access method hints control how data is accessed. These hints are used to force the database engine to use indexes, hash scans, or row id scans. The following table provides a partial list of access method hints.

Hint Description ROWID The database engine performs a scan of the table based on

ROWIDS. INDEX DO NOT USE in Oracle 9.2 and above. The database engine

performs an index scan of a specific table, but in 9.2 and above, the optimizer does not use any indexes other than those mentioned.

USE_CONCAT The database engine converts a query with an OR condition into two or more queries joined by a UNION ALL statement.

The syntax for using a hint in a SQL statement is as follows:

Select /*+ FIRST_ROWS */ empno, ename

From emp;

Select /*+ USE_CONCAT */ empno, ename

From emp;

SQL Execution and Explain Plan

The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually provide better execution time.


The developer can determine which type of execution is being used by running an explain plan on the SQL query in question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results of that statement are then used as input by the next level statement.

Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full table scans cause degradation in performance.

Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following additional information including:

• The number of executions • The elapsed time of the statement execution • The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.

Using Indexes

The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic

The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme cases, the entire SQL statement may need to be re-written to become more efficient.

Reviewing SQL Syntax

SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:

• EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example:

SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN


(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster

SELECT * FROM DEPARTMENTS D WHERE EXISTS

(SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)

Situation Exists In Index supports subquery Yes Yes No Index to support subquery No

Table scans per parent row

Yes Table scan once

Sub-query returns many rows Probably not Yes Sub-query returns one or a few rows Yes Yes Most of the sub-query rows are eliminated by the parent query

No Yes

Index in parent that match sub-query columns

Possibly not since the EXISTS cannot use the index

Yes – IN uses the index

• Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent.

• Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.

Choosing the Best Join Order

Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the incremental ETL load.

Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause.

Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN

• Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables.

SELECT NAME_ID FROM CUSTOMERS

WHERE NAME_ID NOT IN

(SELECT NAME_ID FROM EMPLOYEES)


• Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan.

SELECT C.NAME_ID FROM CUSTOMERS C

WHERE NOT EXISTS

(SELECT * FROM EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID)

• In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator.

SELECT C.NAME_ID FROM CUSTOMERS C

MINUS

SELECT E.NAME_ID* FROM EMPLOYEES E

• Also consider using outer joins with IS NULL conditions for anti-joins.

SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID (+)

AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine.

• In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders.

• Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,


R.GCW_LOAD_DATE

FROM CDS_SUPPLIER R,

(SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,

L.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG L

WHERE L.LOAD_DATE_PREV IN

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

FROM ETL_AUDIT_LOG Y)

) Z

WHERE

R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE

The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds.

Here is the improved SQL:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,

R.LOAD_DATE

FROM

/* In-line view for lower limit */

(SELECT

R1.BATCH_TRACKING_NO,

R1.SUPPLIER_DESC,

R1.SUPPLIER_REG_NO,

R1.SUPPLIER_REF_CODE,

R1.LOAD_DATE

FROM CDS_SUPPLIER R1,

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

FROM ETL_AUDIT_LOG Y) Z

WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV

ORDER BY R1.LOAD_DATE) R,

/* end in-line view for lower limit */

(SELECT MAX(D.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG D) A /* upper limit /*

WHERE R. LOAD_DATE <= A.LOAD_DATE

Tuning System Architecture

Use the following steps to improve the performance of any system:

1. Establish performance boundaries (baseline). 2. Define performance objectives. 3. Develop a performance monitoring plan. 4. Execute the plan. 5. Analyze measurements to determine whether the results meet the objectives. If

objectives are met, consider reducing the number of measurements because performance monitoring itself uses system resources. Otherwise continue with Step 6.

6. Determine the major constraints in the system. 7. Decide where the team can afford to make trade-offs and which resources can

bear additional load. 8. Adjust the configuration of the system. If it is feasible to change more than one

tuning option, implement one at a time. If there are no options left at any level, this indicates that the system has reached its limits and hardware upgrades may be advisable.

9. Return to Step 4 and continue to monitor the system. 10. Return to Step 1. 11. Re-examine outlined objectives and indicators. 12. Refine monitoring and tuning strategy.


System Resources

The PowerCenter Server uses the following system resources:

• CPU • Load Manager shared memory • DTM buffer memory • Cache memory

When tuning the system, evaluate the following considerations during the implementation process.

• Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases.

• Use multiple PowerCenter Servers on separate systems to potentially improve session performance.

• When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance

• Check hard disks on related machines. Slow disk access on source and target databases, source and target file systems, as well as the PowerCenter Server and repository machines can slow session performance.

• When an operating system runs out of physical memory, it starts paging to disk to free physical memory. Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions.

• In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server.

• In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation.

• In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation.

• In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.

Database Performance Features

Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the many available alternatives is the best implementation choice for the particular database. The project team must have a thorough


understanding of the data, database, and desired use of the database by the end-user community prior to beginning the physical implementation process. Evaluate the following considerations during the implementation process.

• Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables.

• Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts.

• Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings.

• Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads.

• OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority.

• Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput.

• Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.


Understanding and Setting UNIX Resources for PowerCenter Installations

Challenge

This Best Practice explains what UNIX resource limits are, and how to control and manage them.

Description

UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding and setting these resources correctly is essential for PowerCenter installations.

Understanding UNIX Resource Limits

UNIX systems impose limits on several different resources. The resources that can be limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the operating system. In general, all UNIX systems implement per-process limits on the following resources. There may be additional resource limits depending on the operating system.

Resource Description Processor time The maximum amount of processor time that can be

used by a process, usually in seconds. Maximum file size The size of the largest single file a process can create.

Usually specified in blocks of 512 bytes. Process data The maximum amount of data memory a process can

allocate. Usually specified in KB. Process stack The maximum amount of stack memory a process can

allocate. Usually specified in KB. Number of open files The maximum number of files that can be open

simultaneously. Total virtual memory The maximum amount of memory a process can use,

including stack, instructions, and data. Usually specified in KB.

Core file size The maximum size of a core dump file. Usually specified in blocks of 512 bytes.


These limits are implemented on an individual process basis. The limits are also ‘inherited’ by child processes when they are created.

In practice, this means that the resource limits are typically set at logon time, and apply to all processes started from the login shell. In the case of PowerCenter, any limits in effect before the pmserver is started will also apply to all sessions (pmdtm) started from that server. Any limits in effect when the repserver is started will also apply to all repagents started from that repserver.

When a process exceeds its resource limit, UNIX will fail the operation that caused the limit to be exceeded. Depending on the limit that is reached, memory allocations will fail, files can’t be opened, and processes will be terminated when they exceed their processor time.

Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required resources, while preventing problems.

Hard and Soft Limits

Each resource that can be limited actually allows two limits to be specified – a ‘soft’ limit and a ‘hard’ limit. Hard and soft limits can be confusing.

From a practical point of view, the difference between hard and soft limits doesn’t matter to PowerCenter or any other process; the lower value is enforced when it reached, whether it is a hard or soft limit.

The difference between hard and soft limits really only matters when changing resource limits. The hard limits are the absolute maximums set by the system administrator that can only be changed by the system administrator. The soft limits are ‘recommended’ values set by the System Administrator, and can be increased by the user, up to the maximum limits.

UNIX Resource Limit Commands

The standard interface to UNIX resource limits is the ‘ulimit’ shell command. This command displays and sets resource limits. The C shell implements a variation of this command called ‘limit’, which has different syntax but the same functions.

ulimit –a Displays all soft limits

ulimit –a –H Displays all hard limits in effect

Recommended ulimit settings for a PowerCenter server:

Resource Description Processor time Unlimited. This is needed for the pmserver and

pmrepserver that run forever. Maximum file size Based on what’s needed for the specific application. This


Resource Description is an important parameter to keep a session from filling a whole filesystem, but needs to be large enough to not affect normal production operations.

Process data 1GB to 2GB Process stack 32MB Number of open files At least 256. Each network connection counts as a ‘file’

so source, target, and repository connections, as well as cache files all use file handles.

Total virtual memory The largest expected size of a session. 1GB should adequate, unless sessions are expected to create large in-memory aggregate and lookup caches that require more memory than this.

Core file size Unlimited, unless disk space is very tight. The largest core files could be ~2-3GB but after analysis they should be deleted, and there really shouldn’t be multiple core files lying around.

Setting Resource Limits

Resource limits are normally set in the login script, either .profile for the Korn shell or .bash_profile for the bash shell. One ulimit command is required for each resource being set, and usually the soft limit is set. A typical sequence is:

ulimit -S -c unlimited ulimit -S -d 1232896 ulimit -S -s 32768 ulimit -S -t unlimited ulimit -S -f 2097152 ulimit -S -n 1024 ulimit -S -v unlimited

after running this, the limits are changed:

% ulimit –S –a core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1232896 file size (blocks, -f) 2097152 max memory size (kbytes, -m) unlimited open files (-n) 1024 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited virtual memory (kbytes, -v) unlimited

Setting or Changing Hard Resource Limits

Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set the initial hard limits in the file /etc/profile, which must be changed by a System Administrator. In some cases, it is necessary to run a system utility such as smit on AIX to change the global system limits.


Upgrading PowerCenter

Challenge

Upgrading an existing version of PowerCenter to a later one encompasses upgrading the repositories, implementing any necessary modifications, testing, and configuring new features. The challenge here is to tackle the upgrade exercise in a structured fashion and minimize risks to the repository and project work.

Some of the challenges typically encountered during an upgrade are:

• Limiting the development downtime to a minimum • Ensuring that development work performed during the upgrade is accurately

migrated to the upgraded repository. • Ensuring that all the elements of all the various environments (e.g., Development,

Test, and Production) are upgraded.

Description

Some typical reasons for an upgrade include:

• To take advantage of the new features in PowerCenter to enhance development productivity and administration

• To solve more business problems • To achieve data processing performance gains

Upgrade Team

Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade is key to completing the process within schedule and budgetary guidelines. Typically, the upgrade team needs the following key players:

• PowerCenter Administrator • Database Administrator • System Administrator • Informatica team - the business and technical users that "own" the various areas

in the Informatica environment. These users are necessary for knowledge transfer and to verify results after the upgrade is complete.


Upgrade Paths

The specific upgrade process depends on which of the existing PowerCenter versions you are upgrading from and which version you are moving to. The following bullet items summarize the upgrade paths for the various PowerCenter versions:

• PowerCenter 7.0 (available since December 2003) o Direct upgrade for PowerCenter 5.x to 7.x o Direct upgrade for PowerCenter 6.x to 7.x

• Other versions: o For version 4.6 or earlier - upgrade to 5.x, and then to 7.x (or 6.x) o For version 4.7 - upgrade to 5.x or 6.x, and then to 7.x

Upgrade Tips

Some of the following items may seem obvious, but adhering to these tips should help to ensure that the upgrade process goes smoothly. Be sure to have sufficient memory and disk space (database).

• Remember that the version 7.x repository is 10 percent larger than the version 6.x repository and as much as 35 percent larger than the version 5.x repository.

• Always read the upgrade log file. • Backup Repository Server and PowerCenter Server configuration files prior to

beginning the upgrade process. • Remember that version 7.x uses registry while version 6.x used win.ini - and plan

accordingly for the change. • Test the AEP/EP (Advanced External Procedure/External Procedure) prior to

beginning the upgrade. Recompiling may be necessary. • If PowerCenter is running on Windows, you will need another Windows-based

machine to setup a parallel Development environment since two servers cannot run on the same Windows machine.

• If PowerCenter is running on a UNIX platform, you can setup a parallel Development environment in a different directory, with a different user and modified profile.

• Ensure that all repositories for upgrade are backed up and that they can be restored successfully. Repositories can be restored to the same database in different schemas to allow an upgrade to be carried out in parallel. This is especially useful if PowerCenter test and development environments reside in a single repository.

Upgrading multiple projects

Be sure to consider the following items if the upgrade involves multiple projects:

• All projects sharing a repository must upgrade at same time (test concurrently). • Projects using multiple repositories must all upgrade at same time. • After upgrade, each project should undergo full regression testing.

Upgrade project plan


The full upgrade process from version 5.x to 7.x can be extremely time consuming for a large development environment. Informatica strongly recommends developing a project plan to track progress and inform managers and team members of the tasks that need to be completed.

Scheduling the upgrade

When an upgrade is scheduled in conjunction with other development work, it is prudent to have it occur within a separate test environment that mimics production. This reduces the risk of unexpected errors and can decrease the effort spent on the upgrade. It may also allow the development work to continue in parallel with the upgrade effort, depending on the specific site setup.

Upgrade Process

Informatica recommends using the following approach to handle the challenges inherent in an upgrade effort.

Choosing an appropriate environment

It is advisable to have three separate environments: one each for Development, Test, and Production.

The Test environment is generally the best place to start the upgrade process since it is likely to be the most similar to Production. If possible, select a test sandbox that parallels production as closely as possible. This will enable you to carry out data comparisons between PowerCenter versions. And, if you begin the upgrade process in a test environment, development can continue without interruption. Your corporate policies on development, test, and sandbox environments and the work that can or cannot be done in them will determine the precise order for the upgrade and any associated development changes. Note that if changes are required as a result of the upgrade, they will need to be migrated to Production. Use the existing version to backup the PowerCenter repository, then ensure that the backup works by restoring it to a new schema in the repository database.

Alternatively, you can begin the upgrade process in the Development environment or set up a parallel environment in which to start the effort. The decision to use or copy an existing platform depends on the state of project work across all environments. If it is not possible to set up a parallel environment, the upgrade may start in Development, then progress to the Test and Production systems. However, using a parallel environment is likely to minimize development downtime. The important thing is to understand the upgrade process and your own business and technical requirements, then adapt the approaches described in this document to one that suits your particular situation.

Organizing the upgrade effort

Begin by evaluating the entire upgrade effort in terms of resources, time, and environments. This includes training, availability of database, operating system and PowerCenter administrator resources as well as time to do the upgrade and carry out the necessary testing in all environments. Refer to the release notes to help identify


mappings and other repository objects that may need changes as a result of the upgrade.

Provide detailed training for the Upgrade team to ensure that everyone directly involved in the upgrade process understands the new version and is capable of using it for their own development work and assisting others with the upgrade process.

Run regression tests for all components on the old version. If possible, store the results so that you can use them for comparison purposes after the upgrade is complete.

Before you begin the upgrade, be sure to backup the repository and server caches, scripts, logs, bad files, parameter files, source and target files, and external procedures. Also be sure to copy backed-up server files to the new directories as the upgrade progresses.

If you are working in a UNIX environment and have to use the same machine for existing and upgrade versions, be sure to use separate users, directories and ensure that profile path statements do not overlap between the new and old versions of PowerCenter. For additional information, refer to the system manuals for path statements and environment variables for your platform and operating system.

Installing and configuring the software

• Install the new version of the PowerCenter components on the server. • Ensure that the PowerCenter client is installed on at least one workstation to be

used for upgrade testing and that connections to repositories are updated if parallel repositories are being used.

• Re-compile the AEP/EP if needed and test them • Configure and start the repository server, (ensure that licensing keys are entered

in the Repository Server Administration Console for version 7.x.x). • Upgrade the repository using the Repository Manager, Repository Server

Administration Console, or the pmrepagent command depending on the version you are upgrading to. Note that the Repository Server needs to be started in order run the upgrade.

• Configure the server details in the Workflow Manager and complete the server configuration on the server (including license keys for version 7.x.x).

• Start the PowerCenter server pmserver on UNIX or the Informatica service on the a Microsoft Windows operating system.

• Analyze upgrade activity logs to identify areas where changes may be required, rerun full regression tests on the upgraded repository.

• Run through test plans. Ensure that there are no failures and all the loads run successfully in the upgraded environment.

• Verify the data to ensure that there are no changes and no additional or missing records.

Implementing changes and testing

If changes are needed, decide where those changes are going to be made. It is generally advisable to migrate work back from test to an upgraded development environment. Complete the necessary changes, then migrate forward through test to production. Assess the changes when the results from the test runs are available. If you

decide to deviate from best practice and make changes in test and migrate them forward to production, remember that you'll still need to implement the changes in development. Otherwise, these changes will be re-identified the next time work is migrated to the test environment.

When you are satisfied with the results of testing, upgrade the other environments by backing up and restoring the appropriate repositories. Be sure to closely monitor the Production environment and check the results after the upgrade. Also remember to archive and remove old repositories from the previous version.

After the Upgrade

• Make sure “Use Repository Privilege” is assigned properly. • Create a team-based environment with deployment groups, labels and/or queries. • Create a server grid to test performance gains. • Start measuring data quality by creating a sample data profile. • If LDAP is in use, associate LDAP users with PowerCenter users. • Install PowerCenter Metadata Reporter and configure the built-in reports for the

PowerCenter repository.

Repository versioning

After upgrading to version 7, you can set the repository to versioned or non-versioned if the Team-Based Management option has been purchased and is enabled by the license

Once the repository is set to versioned, it cannot be set back to non-versioned.

Upgrading folder versions

After upgrading to version 7.x, you'll need to remember the following:

• There are no more folder versions in version 7. • The folder with the highest version number becomes the current folder. • Other versions of the folders will be folder_<folder_version_number>. • Shortcuts will be created to mappings from the current folder.

Upgrading repository privileges

Version 7 includes a repository privilege called “Use Repository Manager”, which enables users to use new features incorporated in version 7. Users with “Use Designer” and “Use WFM” get this new privilege.

Upgrading Pmrep and Pmcmd scripts

• No more folder versions for pmrep and pmrepagent scripts • Need to make sure the workflow/session folder names match the upgraded names • Note that pmcmd command structure changes significantly after version 5. Version

5 pmcmd commands will still run in version 7 but is not guaranteed to be backwards compatible in future versions.


Advanced external procedure transformations

AEPs are upgraded to Custom Transformation - a non-blocking transformation. To use feature, the procedure must be recompiled. The old DLL/library can be used when recompilation is not required.

Upgrading XML definitions

• Version 7 supports XML schema. • The upgrade removes namespaces and prefixes for multiple namespaces • Circular reference definitions are read-only after the upgrade • Some datatypes are changed in XML definitions by the upgrade

Upgrading transaction control mappings

Version 7 does not support concatenation of pipelines or branches with transaction control transformations. After the upgrade, fix mappings and re-save.


Assessing the Business Case

Challenge

Developing a solid business case for the project that includes both the tangible and intangible potential benefits of the project.

Description

The Business Case should include both qualitative and quantitative assessments of the project.

The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1) and focuses on discussions with the project beneficiaries of expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities.

The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI. This may involve the following calculations:

• Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results.

• Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment (i.e., a net present value) of $2,311.44.

• Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return of equity or return on capital employed.

• Payback Period - Determines how much time will pass before an initial capital investment is recovered.

The following are steps to calculate the quantitative business case or ROI:


Step 1. Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible participants, requirements, and systems involved. A data integration initiative or amendment may require estimating customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers and complexity of target data systems (data marts or operational databases, for example) and data sources, types of sources, and size of data set. A data migration project may require customer participation, legacy system migrations, and retirement procedures. The types of estimations vary by project types and goals, It is important to note that the more details you have for estimations, the more precise your phased solutions will be. The scope of the project should also be made known in the deployment map.

Step 2. Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations, Include in this step, costs that will be avoided from the deployment of this project.

Step 3. Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits will be allocated throughout the organization over time, using the enterprise deployment map as a guide.

Step 4. Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:

• Hardware • Networks • RDBMS software • Back-end tools • Query/reporting tools • Internal labor • External labor • Ongoing support • Training

Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted.

Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:

• Scope creep, which can be mitigated by thorough planning and tight project scope • Integration complexity, which may be reduced by standardizing on vendors with

integrated product sets or open architectures • Architectural strategy that is inappropriate


• Current support infrastructure may not meet the needs of the project • Conflicting priorities may impact resource availability • Other miscellaneous risks from management or end users who may withhold

project support; from the entanglements of internal politics; and from technologies that don't function as promised

• Unexpected data quality, complexity, or definition issues often are discovered late, during the course of the project, and can adversely affect effort, cost and schedule. This can be somewhat mitigated by early source analysis.

Step 7. Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total incremental revenue plus cost savings).


Defining and Prioritizing Requirements

Challenge

Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.

Description

The following three steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery

Gathering business requirements is one of the most important stages of any data integration project. Business requirements affect virtually every aspect of the data integration project starting from Project Planning and Management to End-User Application Specification. They are like a hub that sits in the middle and touches the various stages (spokes) of the data integration project. There are two basic techniques for gathering requirements and investigating the underlying operational data: interviews and facilitated sessions.

Interviews

It is important to conduct pre-interview research before starting the requirements gathering process. Interviewees can be categorized into business management and Information Systems (IS) management.

Business Interviewees: Depending on the needs of the project, even though you may be focused on a single primary business area, it is always beneficial to interview horizontally to get a good cross functional perspective of the enterprise. This also provides insight into how extensible your project is across the enterprise. Before you interview, be sure to develop an interview questionnaire, schedule the interview time and place, prepare the interviewees by sending a sample agenda. When interviewing business people it is always important to start with the upper echelons of management so as to understand the overall vision, assuming you have the business background, confidence and credibility to converse at those levels. If not adequately prepared, the safer approach is to interview middle management. If you are interviewing across multiple teams, you might want to scramble interviews among teams. This way if you


hear different perspectives from finance and marketing, you can resolve the discrepancies with a scrambled interview schedule. A note to keep in mind is that business is sponsoring the data integration project and will be the end-users of the application. They will decide the success criteria of your data integration project and determine future sponsorship. Questioning during these sessions should include the following:

• What are the target business functions, roles, and responsibilities? • What are the key relevant business strategies, decisions, and processes (in brief)? • What information is important to drive, support, and measure success for those

strategies/processes? What key metrics? What dimensions for those metrics? • What current reporting and analysis is applicable? Who provides it? How is it

presented? How is it used? How can it be improved?

IS interviewees: The IS interviewees have a different flavor than the business user community. Interviewing the IS team is generally very beneficial because it is composed of data gurus who deal with the data on a daily basis. They can provide great insight into data quality issues, help in systematic exploration of legacy source systems, and understanding business user needs around critical reports. If you are developing a prototype, they can help get things done quickly and address important business reports. Questioning during these sessions should include the following:

• Request an overview of existing legacy source systems. How does data current flow from these systems to the users?

• What day-to-day maintenance issues does the operations team encounter with these systems?

• Ask for their insight into data quality issues. • What business users do they support? What reports are generated on a daily,

weekly, or monthly basis? What are the current service level agreements for these reports?

• How can the DI project support the IS department needs?

Facilitated Sessions

The biggest advantage of facilitated session is that they provide quick feedback by gathering all the people from the various teams into a meeting and initiating the requirements process. You need a facilitator in these meetings to ensure that all the participants get a chance to speak and provide feedback. During individual (or small group) interviews with high-level management, there is often focus and clarity of vision that may be hindered in large meetings.

The biggest challenge to facilitated sessions is matching everyone’s busy schedules and actually getting them into a meeting room. However, this part of the process must be focused and brief or it can become unwieldy with too much time expended just trying to coordinate calendars among worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. The questions asked during facilitated sessions are similar to the questions asked to business and IS interviewees.

Step 2: Validation and Prioritization


The Business Analyst, with the help of the Project Architect, documents the findings of the discovery process after interviewing the business and IS management. The next step is to define the business requirements specification. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements. Defining the business requirements is a time consuming process and should be facilitated by forming a working group team. A working group team usually consists of business users, business analysts, project manager, and other individuals who can help to define the business requirements. The working group should meet weekly to define and finalize business requirements. The working group helps to:

• Design the current state and future state • Identify supply format and transport mechanism • Identify required message types • Develop Service Level Agreement(s), including timings • Identify supply management and control requirements • Identify common verifications, validations, business validations and transformation

rules • Identify common reference data requirements • Identify common exceptions • Produce the physical message specification

At this time also, the Architect develops the Information Requirements Specification to clearly represent the structure of the information requirements. This document, based on the business requirements findings, will facilitate discussion of informational details and provide the starting point for the target model definition.

The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.

Step 3: The Incremental Roadmap

Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project.

As general technical feasibility is compared to the prioritization from Step 2, the Project Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap).

This is presented to the Project Sponsor for approval and becomes the first "Increment" or starting point for the Project Plan.


Developing a Work Breakdown Structure (WBS)

Challenge

Developing a comprehensive work breakdown structure (WBS) that clearly depicts all of the various tasks and subtasks required to complete a project. Because project time and resource estimates are typically based on the WBS, it is critical to develop a thorough, accurate WBS.

Description

The WBS is a “divide and conquer” approach to project management. It is a hierarchical tree that allows large task to be visualized as a group of related smaller, more manageable sub-tasks. These task can be more easily monitored and communicated; they also make identifying accountability a more direct and clear process. The WBS serves as a starting point for both the project estimate and the project plan.

One challenge in developing a thorough WBS is obtaining the correct balance between enough detail, and too much detail. The WBS shouldn't be a 'grocery list' of every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. Also, when naming these task take care that all organizations that will be participating in the project understand how task are broken down. If department A typically breaks a certain task up among three groups and department B assigns it to one, there can be potential issues when tasks are assigned.

It is also important to remember that the WBS is not necessarily a sequential document. Tasks in the hierarchy are often completed in parallel. At this stage of project planning, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks. For example, you may have multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). So, although subtasks 4.3.1 through 4.3.4 may have sequential requirements that force you to complete them in order, subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements. However, it is important to remember that a task is not complete until all of its corresponding subtasks are completed - whether sequentially or in parallel. For example, the Build phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the Deploy phase long before the Build phase is complete.

The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the phases,


tasks, and subtasks that make up the Informatica Methodology. The Project Manager can use this WBS as a starting point, but should review it carefully to ensure that it corresponds to the specific development effort, removing any steps that aren't relevant or adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the development effort.

If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort of developing the WBS.

After the WBS has been loaded into the selected project management tool and refined for the specific project needs, the Project Manager can begin to estimate the level of effort involved in completing each of the steps. When the estimate is complete, individual resources can be assigned and scheduled. The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan.


Developing and Maintaining the Project Plan

Challenge

Developing the first-pass of a project plan that incorporates all of the necessary components but which is sufficiently flexible to accept the inevitable changes.

Description

Use the following steps as a guide for developing the initial project plan:

1. Define the project's major milestones based on the Project Scope. 2. Break the milestones down into major tasks and activities. The Project

Plan should be helpful as a starting point or for recommending tasks for inclusion.

3. Continue the detail breakdown, if possible, to a level at which tasks are of about one to three days' duration. This level provides satisfactory detail to facilitate estimation and tracking. If the detail tasks are too broad in scope, estimates are much less likely to be accurate.

4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable).

5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another).

6. Define the resources based on the role definitions and estimated number of resources needed for each role.

7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.

At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities.

The initial definition of tasks and effort and the resulting schedule should be an exercise in pragmatic feasibility unfettered by concerns about ideal completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to be a hard sell to company management.

This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources, if necessary, to improve the schedule.


When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, and such. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan

Once the Project Sponsor and company managers agree to the initial plan, it becomes the basis for assigning tasks to individuals on the project team and for setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions.

One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline.

Regular status reporting should include any changes to the schedule, beginning with team members' notification that dates for task completions are likely to change or have already been exceeded. These status report updates should trigger a regular plan update so that project management can track the effect on the overall schedule and budget.

Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample Deliverable.), or changes in priority or approach, as they arise to determine if they impact the plan. It may be necessary to modify the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.


Developing the Business Case

Challenge

Identifying the departments and individuals that are likely to benefit directly from the project implementation. Understanding these individuals, and their business information requirements, is key to defining and scoping the project.

Description

The following four steps summarize business case development and lay a good foundation for proceeding into detailed business requirements for the project.

1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization.

• Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation.

• Deliverable - Organization chart of corporate beneficiaries and participants.

2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries.

• Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project.

• Deliverable - Problem/Need Statement

3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to


gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information.

• Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project.

• Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not.

The assumptions statements are optional and may include qualifiers on the scope, such as assumptions of feasibility, specific roles and responsibilities, or availability of resources or data.

• Activity -Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor.

• Deliverable - Project Scope and Assumptions statement


Managing the Project Lifecycle

Challenge

Providing a structure for on-going management throughout the project lifecycle.

Description

It is important to remember that the quality of a project can be directly correlated to the amount of review that occurs during its lifecycle.

Project Status and Plan Reviews

In addition to the initial project plan review with the Project Sponsor, schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates.

Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Status Report .

Project Content Reviews

The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel.

Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code").

Reviews may include:

• Project scope and business case review • Business requirements review • Source analysis and business rules reviews • Data architecture review • Technical infrastructure review (hardware and software capacity and configuration

planning) • Data integration logic review (source to target mappings, cleansing and

transformation logic, etc.)


• Source extraction process review • Operations review (operations and maintenance of load sessions, etc.) • Reviews of operations plan, QA plan, deployment and support plan

Change Management

Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan.

• Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change.

• Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth.

• Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.

The Project Manager should institute this type of change management process in response to any issue or request that appears to add or alter expected activities and has the potential to affect the plan. Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.

Issues Management

Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible.

Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution.

Significant or "showstopper" issues should also be mentioned on the status report.

Project Acceptance and Close

Rather than simply walking away from a project when it seems complete, there should be an explicit close procedure. For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion.

• Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing:

o What was accomplished o Any justification for tasks expected but not completed o Recommendations


• Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans.

• Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.


Using Interviews to Determine Corporate Analytics Requirements

Challenge

Data warehousing projects are usually initiated out of a business need for a certain type of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”). Except in the case of narrowly-focused, departmental data marts however, this is not enough guidance to drive a full analytic solution. Further, a successful, single-purpose data mart can build a reputation such that, after a relatively brief period of proving its value to users, business management floods the technical group with requests for more data marts in other areas. The only way to avoid silos of data marts is to think bigger at the beginning and canvas the enterprise (or at least the department, if that’s your limit of scope) for a broad analysis of analytic requirements.

Description

Determining the analytic requirements in satisfactory detail and clarity is a difficult task however, especially while ensuring that the requirements are representative of all the potential stakeholders. This Best Practice summarizes the recommended interview and prioritization process for this requirements analysis.

Process Steps

The first step in the process is to identify and interview “all” major sponsors and stakeholders. This typically includes the executive staff and CFO since they are likely to be the key decision makers who will depend on the analytics. At a minimum, figure on 10 to 20 interview sessions.

The next step in the process is to interview representative information providers. These individuals include the decision makers who provide the strategic perspective on what information to pursue, as well as details on that information, and how it is currently used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors and stakeholders regarding the findings of the interviews and the recommended subject areas and information profiles. It is often helpful to facilitate a Prioritization Workshop with the major stakeholders, sponsors, and information providers in order to set priorities on the subject areas.

Conduct Interviews


The following paragraphs offer some tips on the actual interviewing process. Two sections at the end of this document provide sample interview outlines for the executive staff and information providers.

Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A focused, consistent interview format is desirable. Don't feel bound to the script, however, since interviewees are likely to raise some interesting points that may not be included in the original interview format. Pursue these subjects as they come up, asking detailed questions. This approach often leads to “discoveries” of strategic uses for information that may be exciting to the client and provide sparkle and focus to the project.

Questions to the “executives” or decision-makers should focus on what business strategies and decisions need information to support or monitor them. (Refer to Outline for Executive Interviews at the end of this document). Coverage here is critical if key managers are left out, you may miss a critical viewpoint and may miss an important buy-in.

Interviews of information providers are secondary but can be very useful. These are the business analyst-types who report to decision-makers and currently provide reports and analyses using Excel or Lotus or a database program to consolidate data from more than one source and provide regular and ad hoc reports or conduct sophisticated analysis. In subsequent phases of the project, you must identify all of these individuals, learn what information they access, and how they process it. At this stage however, you should focus on the basics, building a foundation for the project and discovering what tools are currently in use and where gaps may exist in the analysis and reporting functions.

Be sure to take detailed notes throughout the interview process. If there are a lot of interviews, you may want the interviewer to partner with someone who can take good notes, perhaps on a laptop to save note transcription time later. It is important to take down the details of what each person says because, at this stage, it is difficult to know what is likely to be important. While some interviewees may want to see detailed notes from their interviews, this is not very efficient since it takes time to clean up the notes for review. The most efficient approach is to simply consolidate the interview notes into a summary format following the interviews.

Be sure to review previous interviews as you go through the interviewing process, You can often use information from earlier interviews to pursue topics in later interviews in more detail and with varying perspectives.

The executive interviews must be carried out in “business terms.” There can be no mention of the data warehouse or systems of record or particular source data entities or issues related to sourcing, cleansing or transformation, It is strictly forbidden to use any technical language. It can be valuable to have an industry expert prepare and even accompany the interviewer to provide business terminology and focus. If the interview falls into “technical details,” for example, into a discussion of whether certain information is currently available or could be integrated into the data warehouse, it is up to the interviewer to re-focus immediately on business needs. If this focus is not maintained, the opportunity for brainstorming is likely to be lost, which will reduce the quality and breadth of the business drivers.


Because of the above caution, it is rarely acceptable to have IS resources present at the executive interviews. These resources are likely to engage the executive (or vice versa) in a discussion of current reporting problems or technical issues and thereby destroy the interview opportunity.

Keep the interview groups small. One or two Professional Services personnel should suffice with at most one client project person. Especially for executive interviews, there should be one interviewee. There is sometimes a need to interview a group of middle managers together, but if there are more than two or three, you are likely to get much less input from the participants.

Distribute Interview Findings and Recommended Subject Areas

At the completion of the interviews, compile the interview notes and consolidate the content into a summary.This summary should help to breakout the input into departments or other groupings significant to the client. Use this content and your interview experience along with “best practices” or industry experience to recommend specific, well-defined subject areas.

Remember that this is a critical opportunity to position the project to the decision-makers by accurately representing their interests while adding enough creativity to capture their imagination. Provide them with models or profiles of the sort of information that could be included in a subject area so they can visualize its utility. This sort of “visionary concept” of their strategic information needs is crucial to drive their awareness and is often suggested during interviews of the more strategic thinkers. Tie descriptions of the information directly to stated business drivers (e.g., key processes and decisions) to further accentuate the “business solution.”

A typical table of contents in the initial Findings and Recommendations document might look like this:

I. Introduction II. Executive Summary

A. Objectives for the Data Warehouse B. Summary of Requirements C. High Priority Information Categories D. Issues

III. Recommendations A. Strategic Information Requirements B. Issues Related to Availability of Data C. Suggested Initial Increments D. Data Warehouse Model

IV. Summary of Findings A. Description of Process Used B. Key Business Strategies [this includes descriptions of processes, decisions,

other drivers) C. Key Departmental Strategies and Measurements D. Existing Sources of Information E. How Information is Used F. Issues Related to Information Access

V. Appendices A. Organizational structure, departmental roles


B. Departmental responsibilities, and relationships

Conduct Prioritization Workshop

This is a critical workshop for consensus on the business drivers. Key executives and decision-makers should attend, along with some key information providers. It is advisable to schedule this workshop offsite to assure attendance and attention, but the workshop must be efficient — typically confined to a half-day.

Be sure to announce the workshop well enough in advance to ensure that key attendees can put it on their schedules. Sending the announcement of the workshop may coincide with the initial distribution of the interview findings.

The workshop agenda should include the following items:

• Agenda and Introductions • Project Background and Objectives • Validate Interview Findings: Key Issues • Validate Information Needs • Reality Check: Feasibility • Prioritize Information Needs • Analytics Plan • Wrap-up and Next Steps

Keep the presentation as simple and concise as possible, and avoid technical discussions or detailed sidetracks.

Validate information needs

Key business drivers should be determined well in advance of the workshop, using information gathered during the interviewing process. Prior to the workshop, these business drivers should be written out, preferably in display format on flipcharts or similar presentation media, along with relevant comments or additions from the interviewees and/or workshop attendees.

During the validation segment of the workshop, attendees need to review and discuss the specific types of information that have been identified as important for triggering or monitoring the business drivers. At this point, it is advisable to compile as complete a list as possible; it can be refined and prioritized in subsequent phases of the project. As much as possible, categorize the information needs by function, maybe even by specific driver (i.e., a strategic process or decision). Considering the information needs on a function by function basis fosters discussion of how the information is used and by whom.

Reality check: feasibility

With the results of brainstorming over business drivers and information needs listed (all over the walls, presumably), take a brief detour into reality before prioritizing and planning. You need to consider overall feasibility before establishing the first priority


information area(s) and setting a plan to implement the data warehousing solution with initial increments to address those first priorities.

Briefly describe the current state of the likely information sources (SORs). What information is currently accessible with a reasonable likelihood of the quality and content necessary for the high priority information areas? If there is likely to be a high degree of complexity or technical difficulty in obtaining the source information, you may need to reduce the priority of that information area (i.e., tackle it after some successes in other areas).

Avoid getting into too much detail or technical issues. Describe the general types of information that will be needed (e.g., sales revenue, service costs, customer descriptive information, etc.), focusing on what you expect will be needed for the highest priority information needs.

Analytics plan

The project sponsors, stakeholders, and users should all understand that the process of implementing the data warehousing solution is incremental.. Develop a high-level plan for implementing the project, focusing on increments that are both high-value and high-feasibility. Implementing these increments first provides an opportunity to build credibility for the project. The objective during this step is to obtain buy-in for your implementation plan and to begin to set expectations in terms of timing. Be practical though; don't establish too rigorous a timeline!

Wrap-up and next steps

At the close of the workshop, review the group's decisions (in 30 seconds or less), schedule the delivery of notes and findings to the attendees, and discuss the next steps of the data warehousing project.

Document the Roadmap

As soon as possible after the workshop, provide the attendees and other project stakeholders with the results:

• Definitions of each subject area, categorized by functional area • Within each subject area, descriptions of the business drivers and information

metrics • Lists of the feasibility issues • The subject area priorities and the implementation timeline.

Outline for Executive Interviews

I. Introductions II. General description of information strategy process

A. Purpose and goals B. Overview of steps and deliverables

• Interviews to understand business information strategies and expectations


• Document strategy findings • Consensus-building meeting to prioritize information requirements

and identify “quick hits” • Model strategic subject areas • Produce multi-phase Business Intelligence strategy

III. Goals for this meeting A. Description of business vision, strategies B. Perspective on strategic business issues and how they drive information

needs

• Information needed to support or achieve business goals • How success is measured

IV. Briefly describe your roles and responsibilities?

• The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add.

A. What are your key business strategies and objectives?

• How do corporate strategic initiatives impact your group? • These may include “MBOs” (personal performance objectives), and

workgroup objectives or strategies.

B. What do you see as the Critical Success Factors for an Enterprise Information Strategy?

• What are its potential obstacles or pitfalls?

C. What information do you need to achieve or support key decisions related to your business objectives?

D. How will your organization’s progress and final success be measured (e.g., metrics, critical success factors)?

E. What information or decisions from other groups affect your success? F. What are other valuable information sources (i.e., computer reports,

industry reports, email, key people, meetings, phone)? G. Do you have regular strategy meetings? What information is shared as

you develop your strategy? H. If it is difficult for the interviewee to brainstorm about information needs,

try asking the question this way: "When you return from a two-week vacation, what information do you want to know first?"

I. Of all the information you now receive, what is the most valuable? J. What information do you need that is not now readily available? K. How accurate is the information you are now getting? L. To whom do you provide information? M. Who provides information to you? N. Who would you recommend be involved in the cross-functional

Consensus Workshop?

Outline for Information Provider Interviews


I. Introductions II. General description of information strategy process

A. Purpose and goals B. Overview of steps and deliverables

• Interviews to understand business information strategies and expectations

• Document strategy findings and model the strategic subject areas • Consensus-building meeting to prioritize information requirements

and identify “quick hits” • Produce multi-phase Business Intelligence strategy

III. Goals for this meeting A. Understanding of how business issues drive information needs B. High-level understanding of what information is currently provided to whom

• Where does it come from • How is it processed • What are its quality or access issues

IV. Briefly describe your roles and responsibilities?

• The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add.

A. Who do you provide information to? B. What information do you provide to help support or measure the

progress/success of their key business decisions? C. Of all the information you now provide, what is the most requested or

most widely used? D. What are your sources for the information (both in terms of systems and

personnel)? E. What types of analysis do you regularly perform (i.e., trends,

investigating problems)? How do you provide these analyses (e.g., charts, graphs, spreadsheets)?

F. How do you change/add value to the information? G. Are there quality or usability problems with the information you work

with? How accurate is it?


PowerExchange Installation (for Mainframe)

Challenge

Installing and configuring PowerExchange on a mainframe, ensuring that the process is both efficient and effective.

Description

PowerExchange installation is very straight-forward and can generally be accomplished in a timely fashion. When considering a PowerExchange installation, be sure that the appropriate resources are available. These include, but are not limited to:

• MVS systems operator • Appropriate database administrator; this depends on what (if any) databases are

going to be sources/and or targets (e.g., IMS, IDMS, etc.). • MVS Security resources

Be sure to follow the sequence of the following steps to successfully install PowerExchange. Note that in this very typical scenario, the mainframe source data is going to be “pulled” across to a server box.

1. Complete the PowerExchange pre-install checklist and obtain valid license keys. 2. Install PowerExchange on the mainframe. 3. Start the PowerExchange jobs/tasks on the mainframe. 4. Install the PowerExchange client (Navigator) on a workstation. 5. Test connectivity to the mainframe from the workstation. 6. Install PowerExchange on the UNIX/NT server. 7. Test connectivity to the mainframe from the server.

Complete the PowerExchange Pre-install Checklist and Obtain Valid License Keys

This is a prerequisite. Reviewing the environment and recording the information in this detailed checklist facilitates the PowerExchange install. The checklist can be found in the Velocity appendix. Be sure to complete all relevant sections.

You will need a valid license key in order to run any of the PowerExchange components. This is a 44-byte key that uses hyphens every 4 bytes. For example:


1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1

The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F). Keys are valid for a specific time period and are also linked to an exact or generic TCP/IP address. They also control access to certain databases and determine if the PowerCenter Mover can be used. You cannot successfully install PowerExchange without a valid key for all required components.

Note: When copying software from one machine to another, you may encounter license key problems since the license key is IP specific. Be prepared to deal with this eventuality, especially if you are going to a backup site for disaster recovery testing.

Install PowerExchange on the Mainframe

Step 1: Create a folder c:\Detail on the workstation. Copy file “DETAIL_V5xx\software\MVS\ dtlosxx.v5xx” from the PowerExchange CD to this directory. Double click the file to unzip its contents to c:\Detail folder.

Step 2: Create a PDS “HLQ.DTLV5xx.RUNLIB” on the mainframe in order to pre-allocate the Detail library. Ensure sufficient space for the required jobs/tasks by setting the Cylinders to 150.

Step 3: Run the “MVS_Install” file in the c:\Detail folder. This displays the MVS Install Assistant (as shown below). Configure the IP Address, Logon ID, Password, HLQ, and Default volume setting on the display screen. Also, enter the license key.


Click the Custom buttons to configure the desired data sources.

Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from step 2).

Save these settings and click Process. This creates the JCL libraries and opens the following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process.


Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution class, message class, etc.)

Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success).

Step 6: If implementing change capture, APF authorize the .LOAD and the .LOADLIB libraries. This is required for external security and change capture only.

Step 7: If implementing change capture, copy the Agent from the PowerExchange PROCLIB to the system site PROCLIB. In addition, when the Agent has been started, run job SETUP2 (for change capture only).

Start The PowerExchange Jobs/Tasks on the Mainframe

The installed PowerExchange Listener can be run as a normal batch job or as a started task. Informatica recommends that it initially be submitted as a batch job: RUNLIB(STARTLST)

It should return: DTL-00607 Listener VRM 5.x.x Build V5xx_P0x started.

If implementing change capture, start the PowerExchange Agent (as a started task):

/S DTLA

It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization.

Install The PowerExchange Client (Navigator) on a Workstation

Step 1: Run file “\DETAIL_V5xx\software\Windows\detail_pc_v5xx.exe” on the DETAIL installation CD and follow the prompts.

Step 2: Enter the license key.

Step 3: Follow the wizard to complete the install and reboot the machine.

Step 4: Add a Node entry to the configuration file “\Program Files\Striva\DETAIL\dbmover.cfg” to point to the Listener on the mainframe.

node = (mainframe location name, TCPIP, mainframe IP address, 2480)

Test Connectivity to the Mainframe from the Workstation

Ensure communication to the PowerExchange Listener on the mainframe by entering the following in DOS on the workstation:

DTLREXE PROG=PING LOC=mainframe location

It should return: DLT-00755 DTLREXE Command OK!

Install PowerExchange on the UNIX Server

Step 1: Create a user for the PowerExchange installation on the UNIX box.

Step 2: Create a UNIX directory “/opt/inform/dtlv5xxp0x”.

Step 3: FTP the file “\DETAIL_V5xx\software\Unix\dtlxxx_v5xx.tar” on the DETAIL installation CD to the DETAIL installation directory on UNIX.

Step 4: Use the UNIX tar command to extract the files. The command is “tar –xvf dtlxxx_v5xx.tar”.

Step 5: Update the logon profile with the correct path, library path, and DETAIL_HOME environment variables.

Step 6: Update the license key file on the server.

Step 7: Update the configuration file on the server (dbmover.cfg) by adding a Node entry to point to the Listener on the mainframe.

Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the odbc.ini file on the server by adding data source entries that point to PowerExchange-accessed data:

[striva_mvs_db2]

DRIVER=<DETAIL install dir>/libdtlodbc.so


DESCRIPTION=MVS DB2

DBTYPE=db2

LOCATION=mvs1

DBQUAL1=DB2T

Test Connectivity to the Mainframe from the Server

Ensure communication to the PowerExchange Listener on the mainframe by entering the following on the UNIX server:

DTLREXE PROG=PING LOC=mainframe location

It should return: DLT-00755 DTLREXE Command OK!


Running Sessions in Recovery Mode

Challenge

Use the Load Manager architecture for manual error recovery, by suspending and resuming the workflows and worklets when an error is encountered.

Description

When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. Load Manager architecture offers an alternative to this scenario: the workflow can be suspended and the user can fix the error rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run.

Configure Mapping for Recovery

For consistent recovery, the mapping needs to produce the same result, and in the same order, in the recovery execution as in the failed execution. This can be achieved by sorting the input data using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using a sorter transformation with distinct rows option immediately after source qualifier transformation. Additionally, ensure that all the targets received data from transformations that produce repeatable data.

Configure Session for Recovery

Enable the session for recovery by setting the enable recovery option in the Config Object tab of Session Properties.


For consistent data recovery, the session properties for the recovery session must be the same as the session properties of the failed session.

Configure Workflow for Recovery

The Suspend on Error option directs the PowerCenter Server to suspend the workflow while the user fixes the error, and then resumes the workflow.


The server suspends the workflow when any of the following tasks fail:

• Session • Command • Worklet • Email • Timer

If any of the above tasks fail during the execution of a workflow, execution suspends at the point of failure. The PowerCenter Server does not evaluate the outgoing links from the task. If no other task is running in the workflow, the Workflow Monitor displays a status of Suspended for the workflow. However, if other tasks are being executed in the workflow when a task fails, the workflow is considered partially suspended or partially running and the Workflow Monitor displays the status as Suspending.

When a user discovers that a workflow is either partially or completely suspended, he or she can fix the cause of the error(s). The workflow will then resume execution from the point of suspension, with the PowerCenter Server running the resumed actions as if they never ran.

The following table lists the possible combinations for suspend and resume.

SUSPEND/RESUME Scenarios:

Resumeworkflow Resumeworklet Startworkflow Runs the whole workflow Runs the whole workflow


Startworkflow from Runs the whole workflow from specified task

Runs the whole workflow from specified worklet

Starttask Runs only the suspended task (workflow task)

Runs only the suspended task (worklet task)

Truncate Target Table

If the truncate table option is enabled in a recovery enabled session, the target table will not be truncated during recovery process.

Session Logs

In a suspended workflow scenario, the PowerCenter Server uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.

Suspension Email

The workflow can be configured to send an email when the PowerCenter Server suspends the workflow. When a task fails, the server suspends the workflow and sends the suspension email. The user can then fix the error and resume the workflow. If another task fails while the PowerCenter Server is suspending the workflow, the server does not send another suspension email. The server only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.


Suspending Worklets

When the "Suspend On Error" option is enabled for the parent workflow, the PowerCenter Server also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the server stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".

Assume that the suspension always occurs in the worklet and you issue a Resume command after the error is fixed. The following table describes various suspend and


resume scenarios with reference to the diagram above. Note that the worklet contains a start task and session3:

Initial Command Resume Workflow Resume Worklet Startworkflow Workflow

Session1 and Session2 run. Worklet runs and suspends. Session4 does not run.

Runs Worklet and Session4


Startworkflow from Worklet

Worklet runs and suspends. Session1, Session2, and Session4 do not run.



Starttask

Worklet runs and suspends. No other tasks run.

Runs only the suspended task (Worklet)

Runs only the suspended task (Worklet)

Starting Recovery

The recovery process can be started using Workflow Manager Client tool or Workflow Monitor client tool. Alternatively, the recovery process can be started using pmcmd in command line mode or using a script.

Recovery Tables and Recovery Process

When sessions are enabled for recovery, the PowerCenter Server creates two tables (PM_RECOVERY and PM_TGT_RUN_ID) at the target database. During regular session runs, the server updates these tables with target load status. The session will fail, if the PowerCenter Server cannot create these tables due to insufficient privileges. Once they are created, these tables will be re-used.

When a session is run in recovery mode, the PowerCenter Server uses the information in these tables to determine the point of failure, and continues to load target data from that point. If the recovery tables (PM_RECOVERY and PM_TGT_RUN_ID) are not present in the target repository, the recovery session will fail.

Unrecoverable Sessions

The following session configurations are not supported by PowerCenter for session recovery:

• Sessions using partitioning other than pass-through partitioning. • Sessions using database partitioning. • Recovery using debugger.


• Test load using a recovery enabled session.

Inconsistent Data During Recovery Process

For recovery to be effective, the recovery session must produce the same set of rows and in the same order. Any change after initial failure – in mapping, session and/or in the server– that changes the ability to produce repeatable data will result in inconsistent data during recovery process.

The following cases may produce inconsistent data during a recovery session:

• Session performs incremental aggregation and server stops unexpectedly. • Mapping uses sequence generator transformation. • Mapping uses a normalizer transformation. • Source and/or target changes after initial session failure. • Data movement mode change after initial session failure. • Code page (server, source or target) changes, after initial session failure. • Mapping changed in a way that causes server to distribute or filter or aggregate

rows differently. • Session configurations are not supported by PowerCenter for session recovery. • Mapping uses a lookup table and the data in the look up table changes between

session runs. • Session sort order changes, when server is running in Unicode mode.

Complex Mappings and Recovery

In the case of complex mappings that load to more than one target that are related (i.e., primary key – foreign key relationship), the session failure and subsequent recovery may lead to data integrity issues. In such cases, it is necessary to check the integrity of the target tables to be checked and fixed prior to starting the recovery process.


Configuring Security

Challenge

Configuring a PowerCenter security scheme to prevent unauthorized access to mappings, folders, sessions, workflows, repositories, and data in order to ensure system integrity and data confidentiality.

Description

Configuring security is one of the most important components of building a data warehouse. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-users access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design.

Implement security with the goals of easy maintenance and scalability. When establishing repository security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles:

• Create users and groups • Define access requirements • Grant privileges and permissions

Before implementing security measures, ask and answer the following questions:

• Who will administer the repository? • How many projects need to be administered? Will the administrator be able to

manage security for all PowerCenter projects or just a select few? • How many environments will be supported in the repository? • Who needs access to the repository? What do they need the ability to do? • How will the metadata be organized in the repository? How many folders will be

required? • Where can we limit repository privileges by granting folder permissions instead? • Who will need Administrator or Super User-type access?

After you evaluate the needs of the repository users, you can create appropriate user groups, assign repository privileges and folder permissions. In most implementations,


the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.

Repository Security Overview

A security system needs to properly control access to all sources, targets, mappings, reusable transformations, tasks and workflows in both the test and production repositories. A successful security model needs to support all groups in the project lifecycle and also consider the repository structure.

Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i.e., repository-level tasks).

Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Server and Repository Agent over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Server.

Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined.

Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.

Informatica Security Architecture

The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the repository, Repository Server, PowerCenter Server and the command-line utilities pmrep and pmcmd.

As shown in the below diagram, the repository server is the central component when using default security. It sits between the PowerCenter repository and all client applications, including GUI tools, command line tools and the PowerCenter server. Each application must be authenticated against metadata stored in several tables within the repository. The repository server requires one database account which all security data will be stored as part of its metadata. This is a second layer of security which only the repository server will use to access. It will authenticate all client applications against this metadata.


Repository server security

Connection to the PowerCenter repository database is one level of security. All client connectivity to the repository is handled by the Repository Server and Repository Agent over a TCP/IP connection. The Repository Server process is installed in a Windows or UNIX environment, typically on the same physical server as the PowerCenter Server. It can be installed under the same or different operating system account as the PowerCenter Server.

When the Repository Server is installed, the database connection information is entered for the metadata repository. At this time you need to know the database user id and password to access the metadata repository. The database user id must be able to read and write to all tables in the database. As a developer creates, modifies, executes mappings and sessions, this information is continuously updating the metadata in the repository. Actual database security should be controlled by the DBA responsible for that database, in conjunction with the PowerCenter Repository Administrator. After the Repository Server is installed and started, all subsequent client connectivity is


automatic. The users are simply prompted for the name of the Repository Server and host name. The database id and password are transparent at this point.

PowerCenter server security

Like the Repository Server, the PowerCenter Server communicates with the metadata repository when it executes workflows or when users are using Workflow Monitor. During configuration of the PowerCenter Server, the repository database is identified with the appropriate user id and password to use. This information is specified in the PowerCenter configuration file (pmserver.cfg). Connectivity to the repository is made using native drivers supplied by Informatica.

Certain permissions are also required to use the command line utilities pmrep and pmcmd.

Connection Object Permissions

Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all types of connection objects. This controls who can create, view, change, and execute workflow tasks that use those specific connections, providing another level of security for these global repository objects.

Users with ‘User Workflow Manager’ can create and modify connection objects. Connection objects allow the PowerCenter server to read and write to source and target databases. Any database the server will access will require a connection definition. As shown below, connection information is stored in the repository. Users executing workflows will require execution permission on all connections used by the workflow. The PowerCenter server looks up the connection information in the repository, and verifies permission for the required action. If permissions are properly granted, the server will read and write to the defined databases as defined by the workflow.


Users

Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter repository should have a unique user account. Informatica does not recommend creating shared accounts; Unique accounts should be created for each user. Each repository user needs a user name and password to access the repository, which should be provided by the PowerCenter Repository Administrator.

Users are created and managed through Repository Manager. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. Passwords can be reset by the user if they are granted the privilege ‘Use Repository Manager’.

When you create the repository, the repository automatically creates two default users:

• Administrator. The default password for Administrator is Administrator. • Database user. The username and password used when you created the

repository.

These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed.

To administer repository users, you must have one of the following privileges:

• Administer Repository • Super User

LDAP (Lightweight Directory Access Protocol)

In addition to default repository user authentication, LDAP can be used to authenticate users. Using LDAP authentication, the repository maintains an association between the repository user and the external login name. When a user logs into the repository, the security module authenticates the user name and password against the external directory. The repository maintains a status for each user. Users can be enabled or disabled by modifying this status.

Prior to implementing LDAP, the administrator must know:

• Repository server username and password • An administrator or superuser user name and password for the repository • An external login name and password

Configuring LDAP

• Edit ldap_authen.xml, modify the following attributes: o NAME – the .dll that implements the authentication o OSTYPE – Host operating system

• Register ldap_authen.xml in the Repository Server Administration Console. • In the Repository Server Administration Console, configure the authentication

module.


User Groups

When you create a repository, the Repository Manager creates two repository user groups. These two groups exist so you can immediately create users and begin developing repository objects.

The default repository user groups are:

• Administrators • Public

The Administrators group has super user access. The Public group has a subset of default repository privileges. These groups cannot be deleted from the repository nor have their configured privileges changed.

You should create custom user groups to manage users and repository privileges effectively. The number and types of groups that you create should reflect the needs of your development teams, administrators, and operations group. Informatica recommends minimizing the number of custom user groups that you create in order to facilitate the maintenance process.

A starting point is to create a group for each type of combination of privileges needed to support the development cycle and production process. This is the recommended method for assigning privileges. After creating a user group, you assign a set of privileges for that group. Each repository user must be assigned to at least one user group. When you assign a user to a group, the user:

• Receives all group privileges. • Inherits any changes to group privileges. • Loses and gains privileges if you change the user group membership.

You can also assign users to multiple groups, which grants the user the privileges of each group. Use the Repository Manager to create and edit repository user groups.

Folder Permissions

When you create or edit a folder, you define permissions for the folder. The permissions can be set at three different levels:

1. owner 2. owners group 3. repository - remainder of users within the repository.

o First, choose an owner (i.e., user) and group for the folder. If the owner belongs to more than one group, you must select one of the groups listed.

o Once the folder is defined and the owner is selected, determine what level of permissions you would like to grant to the users within the group.

o Then determine the permission level for the remainder of the repository users.


The permissions that can be set include: read, write, and execute. Any combination of these can be granted to the owner, group or repository.

Be sure to consider folder permissions very carefully. They offer the easiest way to restrict users and/or groups from having access to folders or restricting access to folders. The following table gives some examples of folders, their type, and recommended ownership.

Folder Name Folder Type Proposed Owner DEVELOPER_1 Initial development,

temporary work area, unit test

Individual developer

DEVELOPMENT Integrated development Development lead, Administrator or Super User

UAT Integrated User Acceptance Test

UAT lead, Administrator or Super User

PRODUCTION Production Administrator or Super User PRODUCTION SUPPORT Production fixes and

upgrades Production support lead, Administrator or Super User

Repository Privileges

Repository privileges work in conjunction with folder permissions to give a user or group authority to perform tasks. Repository privileges are the most granular way of controlling a user’s activity. Consider the privileges that each user group requires, as well as folder permissions, when determining the breakdown of users into groups. Informatica recommends creating one group for each distinct combination of folder permissions and privileges.

When you assign a user to a user group, the user receives all privileges granted to the group. You can also assign privileges to users individually. When you grant a privilege to an individual user, the user retains that privilege even if his or her user group affiliation changes. For example, you have a user in a Developer group who has limited group privileges, and you want this user to act as a backup administrator when you are not available. For the user to perform every task in every folder in the repository, and to administer the PowerCenter Server, the user must have the Super User privilege. For tighter security, grant the Super User privilege to the individual user, not the entire Developer group. This limits the number of users with the Super User privilege, and ensures that the user retains the privilege even if you remove the user from the Developer group.

The Repository Manager grants a default set of privileges to each new user and group for working within the repository. You can add or remove privileges from any user or group except:

• Administrators and Public (the default read-only repository groups) • Administrator and the database user who created the repository (the users

automatically created in the Administrators group)

The Repository Manager automatically grants each new user and new group the default privileges. These privileges allow you to perform basic tasks in Designer, Repository


Manager, Workflow Manager, and Workflow Monitor. The following table lists the default repository privileges:

Default Repository Privileges

Default Privilege

Folder Permission

Connection Object Permission

Grants the Ability to

Use Designer N/A N/A

• Connect to the repository using the Designer.

• Configure connection information.

Read N/A

• View objects in the folder. • Change folder versions. • Create shortcuts to objects in the

folder. • Copy objects from the folder. • Export objects.

Read/Write N/A

• Create or edit metadata. • Create shortcuts from shared folders. • Copy objects into the folder. • Import objects.

Browse Repository N/A N/A

• Connect to the repository using the Repository Manager.

• Add and remove reports. • Import, export, or remove the

registry. • Search by keywords. • Change your user password.

Read N/A

• View dependencies. • Unlock objects, versions, and folders

locked by your username. • Edit folder properties for folders you

own. • Copy a version. (You must also have

Administer Repository or Super User privilege in the target repository and write permission on the target folder.)

• Copy a folder. (You must also have Administer Repository or Super User privilege in the target repository.)

Use Workflow Manager

N/A N/A

• Connect to the repository using the Workflow Manager.

• Create database, FTP, and external loader connections in the Workflow



Default Privilege

Folder Permission



Manager. • Run the Workflow Monitor.

N/A Read/Write

• Edit database, FTP, and external loader connections in the Workflow Manager.

Read N/A

• Export sessions. • View workflows. • View sessions. • View tasks. • View session details and session

performance details.

Read/Write N/A

• Create and edit workflows and tasks. • Import sessions. • Validate workflows and tasks.

Read/Write Read • Create and edit sessions.

Read/Execute N/A • View session log.

Read/Execute Execute • Schedule or unschedule workflows. • Start workflows immediately.

Execute N/A

• Restart workflow. • Stop workflow. • Abort workflow. • Resume workflow.

Use Repository Manager

N/A N/A • Remove label references.

Write Deployment group

• Delete from deployment group.

Write Folder

• Change objects version comments if not owner.

• Change status of object. • Check in. • Check out/undo check-out. • Delete objects from folder. • Mass validation (needs write

permission if options selected). • Recover after delete.



Default Privilege

Folder Permission



Read Folder • Export objects.

Read/Write Folder Deployment Groups

• Add to deployment group.

Read/Write Original folders Target folder

• Copy objects. • Import objects.

Read/Write/ Execute

Folder Label • Apply label

Extended Privileges

In addition to the default privileges listed above, Repository Manager provides extended privileges that you can assign to users and groups. These privileges are granted to the Administrator group by default. The following table lists the extended repository privileges:

Extended Repository Privileges

Extended Privilege

Folder Permission

Connection Object

Permission Grants the Ability to

Workflow Operator

N/A N/A • Connect to the Informatica Server.

Execute N/A

• Restart workflow. • Stop workflow. • Abort workflow. • Resume workflow.

Execute Execute

• Use pmcmd to start workflows in folders for which you have execute permission.

Read Execute • Start workflows immediately.

Read N/A

• Schedule and unschedule workflows. • View the session log. • View session details and

performance details.

Administer Repository

N/A N/A

• Connect to the repository using the Repository Manager.

• Connect to the Repository Server. • Create, upgrade, backup, delete,

and restore the repository.


Extended Repository Privileges

Extended Privilege

Folder Permission

Connection Object

Permission Grants the Ability to

• Start, stop, enable, disable, and check the status of the repository.

• Manage passwords, users, groups, and privileges.

• Manage connection object permissions.

Read Read

• Copy a folder within the same repository.

• Copy a folder into a different repository when you have Administer Repository privilege on the destination repository.

Read N/A • Edit folder properties.

Read/Write N/A • Copy a folder into the repository.

Administer Server

N/A N/A

• Register Informatica Servers with the repository.

• Edit server variable directories. • Start the Informatica Server. (The

user entered in the Informatica Server setup must have this repository privilege.)

• Stop the Informatica Server through the Workflow Manager.

• Stop the Informatica Server using the pmcmd program.

Super User N/A N/A

• Perform all tasks, across all folders in the repository.

• Manage connection object permissions.

Extended privileges allow you to perform more tasks and expand the access you have to repository objects. Informatica recommends that you reserve extended privileges for individual users and grant default privileges to groups.

Audit trails

Audit trails can be accessed through the Repository Server Administration Console. The repository agent logs security changes in the repository server installation directory.


The audit trail can be turned on or off through the Configuration tab of the Properties window for each individual repository. Audit trails can be toggled on or off by choosing the ‘SecurityAuditTrail’ checkbox as shown in the following illustration.

The audit log contains the following information:

• Changing the owner, owner's group, or permissions for a folder. • Changing the password of another user. • Adding or removing a user. • Adding or removing a group. • Adding or removing users from a group. • Changing global object permissions. • Adding or removing user and group privileges.

Sample Security Implementation

The following steps provide an example of how to establish users, groups, permissions and privileges in your environment. Again, the requirements of your projects and production systems need to dictate how security is established.


1. Identify users and the environments they will support (development, UAT, QA, production, production support, etc).

2. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed in Step 1, e.g., development, UAT, QA, production, etc).

3. Identify what users need to exist in each repository. 4. Define the groups that will exist in each PowerCenter Repository. 5. Assign users to groups. 6. Define privileges for each group.

The following table provides an example of groups and privileges that may exist in the PowerCenter repository. This example assumes one PowerCenter project with three environments co-existing in one PowerCenter repository.


Informatica P

owerCenter Security Administration

As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual should be responsible for a number of tasks in the Informatica environment, including security. To summarize, here are the security-related tasks an administrator should be responsible for:

• Creating user accounts. • Defining and creating groups. • Defining and granting folder permissions. • Defining and granting repository privileges.

GROUP NAME FOLDER FOLDER PERMISSIONS

PRIVILEGES

ADMINISTRATORS All All

Super User (all privileges)

DEVELOPERS

Individual development folder; integrated development folder

Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

DEVELOPERS UAT Read


UAT UAT working folder



UAT Production Read


OPERATIONS Production Read, Execute

Browse Repository, Workflow Operator

PRODUCTION SUPPORT

Production maintenance folders



PRODUCTION SUPPORT Production Read

Browse Repository


• Enforcing changes in passwords. • Controlling requests for changes in privileges. • Creating and maintaining database, FTP, and external loader connections in

conjunction with database administrator. • Working with operations group to ensure tight security in production environment.

Remember, you must have one of the following privileges to administer repository users:

• Administer Repository • Super User

Summary of Recommendations

When implementing your security model, keep the following recommendations in mind:

• Create groups with limited privileges. • Do not use shared accounts. • Limit user and group access to multiple repositories. • Customize user privileges. • Limit the Super User privilege. • Limit the Administer Repository privilege. • Restrict the Workflow Operator privilege. • Follow a naming convention for user accounts and group names. • For more secure environments, turn Audit Trail logging on.


Custom XConnect Implementation

Challenge

Each XConnect extracts metadata from a particular repository type and loads it into the SuperGlue warehouse. The SuperGlue Configuration Console is used to run each XConnect. Custom XConnect is the process of loading metadata, for tools or processes for which Informatica does not provide any out-of-the-box metadata solution.

Description

To integrate custom metadata, complete the steps for the following tasks:

• Design the metamodel • Implement the metamodel design • Setup and run the custom XConnect • Configure the reports and schema

Prerequisites for Integrating Custom Metadata

To integrate custom metadata, install SuperGlue and the other required applications. The custom metadata integration process assumes knowledge of following topics:

• Common Warehouse Metamodel CWM and Informatica-defined metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica-defined metamodel supplements the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about CWM, see http://www.omg.org/cwm/. For more information about the Informatica-defined metamodel components, run and review the metamodel reports.

• PowerCenter functionality. Metadata integration process requires configuring and running PowerCenter workflows that extract custom metadata from source repositories and loading it into the SuperGlue warehouse. PowerCenter can be used to build a custom XConnect.

• PowerAnalyzer functionality. SuperGlue embeds PowerAnalyzer functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in PowerAnalyzer is required. A knowledge of creating, modifying, and deleting


table definitions, metrics, and attributes is required to update the schema with new or changed objects.

Design the Metamodel

The objective of this phase is to design the metamodel. A UML modeling tool can be used to help define the classes, class properties, and associations.

This task consists of the following steps:

1. Identify Custom Classes. Identify all custom classes. To identify classes, determine the various types of metadata in the source repository that needs to be loaded into the SuperGlue warehouse. Each type of metadata corresponds to one class.

2. Identify Custom Class Properties. For each class identified in step 1, identify all class properties that need to be tracked in the SuperGlue warehouse.

3. Map Custom Classes to CWM Classes. SuperGlue prepackages all CWM classes, class properties, and class associations. To quickly develop a custom metamodel and reduce redundancy, reuse the predefined class properties and associations instead of recreating them. To determine which custom classes can inherit properties from CWM classes, map custom classes to the packaged CWM classes. For all properties that cannot be inherited, define them in SuperGlue.

4. Determine the Metadata Tree Structure. Configure the way the metadata tree displays objects. Configure the metadata tree structure for a class when defining the class in the next task "Implement the Metamodel Design". Configure classes of objects to display in the metadata tree along with folders and the objects they contain.

5. Identify Custom Class Associations. The metadata browser uses class associations to display metadata. For each identified class association, determine if a predefined association from a CWM base class can be reused or if an association needs to be defined manually in SuperGlue.

6. Identify Custom Packages. A package contains related classes and class associations. Import and export packages of classes and class associations from SuperGlue. Assign packages to repository types to define the structure of the contained metadata. In this step, identify packages to group the custom classes and associations you identified in previous steps.

Implement the Metamodel Design

Using the metamodel design specifications from the previous task, implement the metamodel in SuperGlue. To complete the steps in this task, you will need one of the following roles:

• Advanced Provider • Schema Designer • System Administrator

This task includes the following steps.


1. Create Custom Metamodel Originator in SuperGlue. The SuperGlue warehouse may contain many metamodels that store metadata from a variety of source systems. When creating a new metamodel, enter the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in SuperGlue, select ‘Customer’ as the originator type.

2. Create Custom Packages in SuperGlue. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together.

3. Create Custom Classes in SuperGlue. In this step, create custom classes identified in the metamodel design task.

4. Create Custom Class Associations in SuperGlue. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations that cannot be reused.

5. Create Custom Repository Type in SuperGlue. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type contains workflows and mappings, but a PowerAnalyzer business intelligence repository type does not.

6. Associate packages to Custom Repository Type. To maintain the uniqueness of each repository type, define repository types in SuperGlue, and for each repository type, assign packages of classes and class associations to it.

Setup and Run the XConnect

The objective of this task is to set up and run the custom XConnect. Transform source metadata into the required format specified in the IME interface files. The custom XConnect then extracts the metadata from the IME interface file and loads it into the SuperGlue warehouse.

This task includes the following steps:

1. Determine which SuperGlue warehouse tables to load. Based on the type of metadata that needs to be viewed in the metadata directory and reports, determine which SuperGlue warehouse tables are required for the metadata load. To stop the metadata load into particular SuperGlue warehouse tables, disable the worklets that load those tables.

2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms with the format specified in each required IME interface file. Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Metadata can be loaded into a SuperGlue warehouse table using more than one of the accepted source type formats. For example, loading metadata into the IMW_ELEMENT table from a database view and a flat file.

3. Register the Source Repository Instance in SuperGlue. Before extracting metadata, you must first register the source repository in SuperGlue. Register the repository under the custom repository type created in previous task. All


packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. When registering the repository, define a repository ID that must uniquely identify the repository. If the source repository stores a repository ID, use that value for the repository ID. Once the repository is registered in SuperGlue, SuperGlue adds an XConnect in the Configuration Console for the repository. To register a repository in SuperGlue, you will need one of the following roles: Advanced Provider, Schema Designer, or System Administrator.

4. Configure the Custom Parameter File. SuperGlue prepackages the parameter files for each XConnect. Update the parameter file by specifying the following information: source type (e.g., database table, database view, or flat file), name of the database views or tables used to load the SuperGlue warehouse, list of all flat files used to load a particular SuperGlue warehouse table, frequency at which the SuperGlue warehouse is updated, worklets that need to be enabled and disabled, and the method used to determine field datatypes.

5. Configure the Custom XConnect. Once the custom repository type is defined in SuperGlue, the SuperGlue Server registers the corresponding XConnect in the Configuration Console. Specify the following information in the Configuration Console to configure the XConnect: repository type to which the custom repository belongs, workflows required to load the metadata, name of the XConnect, and parameter file used by the workflows to load the metadata.

6. Run the Custom XConnect. Using the Configuration Console, run the XConnect and ensure that the metadata loads correctly.

7. Reset the $$SRC_INCR_DATE Parameter. After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every 5 days. The value depends on how often the SuperGlue warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting.

Configure the Reports and Schema

The objective of this task is to set up the reporting environment, which needs to run reports on the metadata stored in the SuperGlue warehouse. How you set up the reporting environment depends on the reporting requirements. The following options are available for creating reports:

• Use the existing schema and reports. SuperGlue contains packaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. SuperGlue also provides impact analysis and lineage reports that provide information on any type of metadata.

• Create new reports using the existing schema. Build new reports using the existing SuperGlue metrics and attributes.

• Create new SuperGlue warehouse tables and views to support the schema and reports. If the packaged SuperGlue schema does not meet the reporting requirements, create new SuperGlue warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new SuperGlue warehouse tables or views, register the tables in the SuperGlue schema and create new metrics/attributes in the


SuperGlue schema. Note that SuperGlue schema is built on the SuperGlue views.

After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.


Customizing the SuperGlue Interface

Challenge

Customizing the SuperGlue presentation layer to meet specific business needs.

Description

Configuring Metamodels

It may be necessary to configure metamodels for a repository type in order to integrate additional metadata into a SuperGlue Warehouse and/or to adapt to changes in metadata reporting and browsing requirements. For more information about creating a metamodel for a new repository type, see the SuperGlue Custom Metadata Integration Guide.

Use SuperGlue to define a metamodel, which consists of the following objects:

• Originator - the party that creates and owns the metamodel. • Packages - contain related classes that model metadata for a particular application

domain or specific application. Multiple packages can be defined under the newly defined originator. Each package stores classes and associations that represent the metamodel.

• Classes and Class Properties - define a type of object, with its property, contained in a repository. Multiple classes can be defined under a single package. Each class has multiple properties associated to it. These properties can be inherited from one or many base classes already available. Additional properties can be defined directly under the new class.

• Associations - defines the relationship between classes and their objects. Associations help define relationships across individual classes. The cardinality helps define 1-1, 1-n or n-n relationships. These relationships mirror real life associations of logical, physical, or design level building blocks of systems and processes.

For more information about metamodels, originators, packages, classes, and associations, see “SuperGlue Concepts” in the SuperGlue InstallationandAdministration Guide

After the metamodel is defined, it needs to be associated with a repository type. When registering a repository under a repository type, all classes and associations assigned to the repository type through packages apply to the repository.

The Metamodel Management task area on the Administration tab in SuperGlue provides the following options for configuring metamodels:

Repository types

You can configure types of repositories for the metadata you want to store and manage in the SuperGlue Warehouse. You must configure a repository type when you develop an XConnect. You can modify some attributes for existing Xconnects and XConnect repository types. For more information, see “Configuring Repository Types” in the SuperGlue Installation and Administration Guide.

Displaying Objects of an Association in the Metadata Tree

SuperGlue displays many objects in the metadata tree by default because of the predefined associations among metadata objects. Associations determine how objects display in the metadata tree.

If you want to display an object in the metadata tree that does not already display, add an association between the objects in the IMM.properties file.

For example, Object A displays in the metadata tree and Object B does not. To display Object B under Object A in the metadata tree, perform the following actions:

• Create an association from Object B to Object A. From Objects in an association display as parent objects; To Objects display as child objects. The To Object displays in the metadata tree only if the From Object in the association already displays in the metadata tree. For more information about adding associations, refer to “Adding an Association” in the SuperGlue InstallationandAdministration Guide

• Add the association to the IMM.properties file. SuperGlue only displays objects in the metadata tree if the corresponding association between their classes is included in the IMM.properties file.

Note: Some associations are not explicitly defined among the classes of objects. Some objects reuse associations based on the ancestors of the classes. The metadata tree displays objects that have explicit or reused associations. For more information about ancestors and reusing associations, see “Reusing Class Associations of a Base Class or Ancestor” in the SuperGlue InstallationandAdministration Guide

To add the association to the IMM.properties file

1. Open the IMM.properties file. The file is located in the following directory:

• For WebLogic: <WebLogic_Home>\wlserver6.1 • For WebSphere: <WebSphere_Home>\AppServer

2. Add the association ID under findtab.parentChildAssociations.


• To determine the ID of an association, click the association on the Associations page.

• To access the Associations page, click Administration > Metamodel Management > Associations.

• Save and close the IMM.properties file. • Stop and then restart the SuperGlue Server to apply the changes.

Customizing SuperGlue Metadata Browser

The Metadata Browser, on the Metadata Directory page, is used for browsing source repository metadata stored in the SuperGlue Warehouse. The following figure shows a sample metadata directory page on the Find Tab of SuperGlue.


The Metadata Directory page consists of the following areas:

• Query task area - allows you to search for metadata objects stored in the SuperGlue Warehouse.

• Metadata Tree task area - allows you to navigate to a metadata object in a particular repository.

• Results task area - displays metadata objects based on an object search in the Query task area or based on the object selected in the Metadata Tree task area.

• Details task area - displays properties about the selected object. You can also view associations between the object and other objects, and run related reports from the Details task area.

For more information about the Metadata Directory page on the Find tab, refer “Accessing Source Repository Metadata” chapter in the SuperGlue User Guide.

You can perform the following customizations while browsing the source repository metadata:

Configure the display properties

SuperGlue displays a set of default properties for all items in the Results task area. The default properties are generic properties that apply to all metadata objects stored in the SuperGlue Warehouse.


By default, SuperGlue displays the following properties in the Results task area for each source repository object:

• Class - Displays an icon that represents the class of the selected object. The class name appears when you place the pointer over the icon.

• Label - Label of the object. • Source Update Date - Date the object was last updated in the source repository. • Repository Name - Name of the source repository from which the object originates. • Description - Description of the object.

The default properties that appear in the Results task area can, however, be rearranged, added, and/or removed for a SuperGlue user account. For example, you can remove the default Class and Source Update Date properties, move the Repository Name property to precede the Label property, and add a different property, such as the Warehouse Insertion Date, to the list.

Additionally, you can add other properties that are specific to the class of the selected object. With the exception of Label, all other default properties can be removed. You can select up to ten properties to display in the Results task area. SuperGlue displays them in the order specified while configuring.

If there are more than ten properties to display, SuperGlue displays the first ten, displaying common properties first in the order specified and then all remaining properties in alphabetical order based on the property display label.

Applying Favorite Properties for Multiple Classes of Objects Property

The modified property display settings can be applied to any class of objects displayed in the Results task area. When selecting an object in the metadata tree, multiple classes of objects may appear in the Results task area. The following figure shows how to apply the modified display settings for each class of objects in the Results task area:


The same settings can be applied to the other classes of objects that currently display in the Results task area.

If the settings are not applied to the other classes, then the settings apply to the objects of the same class as the object selected in the metadata tree.

Configuring Object Links

Object links are created to link related objects without navigating the metadata tree or searching for the object. Refer to the SuperGlue User Guide to configure the object link.

Configuring Report Links

Report Links can be created to run reports on a particular metadata object. When creating a report link, assign a SuperGlue report to a specific object. While creating a report link, you can also create a run report button to run the associated report. The run report button appears in the top, right corner of the Details task area. When you


create the run report button, you also have the option of applying it to all objects of the same class. You can create a maximum of three run report buttons per object.

Customizing Superglue Packaged Reports, Dashboards and Indicators

You can create new reporting elements and attributes under ‘Schema Design’. These new elements can be used in new reports or existing report extensions. You can also extend or customize "out-of-the-box" reports, indicators, or dashboards. Informatica recommends using the ‘Save As’ new report option for such changes in order to avoid any conflicts during upgrades.

Further, you can create new reports using the 1-2-3-4 report creation wizard of Informatica PowerAnalyzer. Informatica recommends saving such reports in a new report folder to avoid conflict during upgrades.

Customizing Superglue ODS Reports

Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although, these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for which SuperGlue does not package an XConnect.


Estimating SuperGlue Volume Requirements

Challenge

Understanding the relationship between various inputs for the SuperGlue solution so as to be able to estimate volumes for the SuperGlue Warehouse.

Description

The size of SuperGlue warehouse is directly proportional to the size of metadata being loaded into it. The size is also dependent on the number of element attributes being captured in source metdata and the associations defined in the metamodel.

When estimating volume requirements for a SuperGlue implementation, consider the following SuperGlue components:

• SuperGlue Server • SuperGlue Console • SuperGlue Integration Repository • SuperGlue Warehouse

NOTE: Refer to the SuperGlue Installation Guide for complete information on minimum system requirements for server, console and integration repository.

Considerations

Volume estimation for SuperGlue is an iterative process. Use the SuperGlue development environment to get accurate size estimate for SuperGlue production environment. The required steps are as follow:

1. Identify the source metadata that needs to be loaded in the SuperGlue Production warehouse.

2. Size the SuperGlue Development warehouse based on the initial sizing estimates (as explained in next section of this document).

3. Run the XConnects and monitor the disk usage. If the XConnect run fails due to insufficient volume, add the same number of space as per the initial sizing estimate recommendations.

4. Restart the XConnect.

Go through steps 1 through 4 until the XConnect run is successful.


Following are the initial sizing estimates for a typical SuperGlue implementation:

SuperGlue Server

SuperGlue Console

SuperGlue Integration Repository


SuperGlue Warehouse

The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes consider the expected SuperGlue Warehouse Target size to increase in direct proportion.

XConnect INPUT Size Expected SuperGlue Warehouse Target Size

Metamodel and other tables

- 50MB

PowerCenter 1MB 10MB PowerAnalyzer 1MB 4MB Database 1MB 5MB Other XConnect 1MB 4.5MB


SuperGlue Metadata Load Validation

Challenge

In the same way that knowing that all data for the current load cycle has loaded correctly is essential for good data warehouse management, the same goes for validating that all metadata extractions (XConnects) loaded correctly into the SuperGlue warehouse. If metadata extractions do not execute successfully, the SuperGlue warehouse will not be current with the most up-to-date metadata.

Description

The process for validating the SuperGlue metadata loads is very simple using the SuperGlue Configuration Console. In the SuperGlue Configuration Console, you can view the run history for each of the XConnects. For those who are familiar with PowerCenter, the “Run History” portion of the SuperGlue Configuration Console is similar to the Workflow Monitor in PowerCenter.

To view XConnect run history, first log into the SuperGlue Configuration Console.

After logging into the console, click XConnects > Execute Now (or click on the “Execute Now” shortcut on the left navigation panel).


The XConnect run history is displayed (see below)on the “Execute Now” screen. A SuperGlue Administrator should log into the SuperGlue Configuration Console on a regular basis and verify that all XConnects that were scheduled ran to successful completion.

If any XConnects have a status of “Failure” as noted above, the issue should be investigated to correct it and the XConnect should be re-executed. XConnects can fail for a variety of reasons common in IT such as unavailability of the database, network failure, improper configuration, etc.

More detailed error messages can be found in the event log or in the workflow log files. By clicking on the “Schedule” shortcut on the left navigation pane in the SuperGlue Configuration Console, you can view the logging options that are set up for the XConnect. In most cases, the logging is setup to write to the <SUPERGLUE_HOME>/Console/SuperGlue_Log.log file.


After investigating and correcting the issue, the XConnect that failed should be re-executed at the next available time in order to load the most recent metadata.


Using SuperGlue Console to Tune the XConnects

Challenge

Improving the efficiency and reducing the run-time of your XConnects through the parameter settings of the SuperGlue console

Description

Remember that the minimum system requirements for a machine hosting the SuperGlue console are:

• Windows operating system (2000, NT 4.0 SP 6a) • 400MB disk space • 128MB RAM (256MB recommended) • 133 MHz processor.

If the system meets or exceeds the minimal requirements, but an XConnect is still taking a inordinately long time to run, use the following steps to try to improve its performance.

To improve performance of your XConnect loads from database catalogs:

• Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion)

• Carefully examine how many old objects the project needs by default. Modify the “sysdate -5000” to a smaller value to reduce the result set.

To improve performance of your XConnect loads from the PowerCenter repository:

• Load only the production folders that are needed for a particular project. • Run the Xconnects with just one folder at a time, or select the list of folders for a

particular run.

Best Informatica Practices26064718

Documents

Transcript of Best Informatica Practices26064718