Using Informatica to Load Teradata

5

Click here to load reader

description

informtica

Transcript of Using Informatica to Load Teradata

Page 1: Using Informatica to Load Teradata

Using Informatica to Load Teradata at Cisco

U sing Informatica With Teradata Load Utilities

Cisco uses Informatica for data extracts and loads. Informatica has the ability to load data into Teradata databases, both by using an ODBC connection, and by building and launching Teradata load utility scripts. The Teradata load utilities are designed to load massive amounts of data in a short amount of time. Loading using ODBC should be considered for very small tables only. This document discusses using three Teradata utilities: fastload, multiload, and tpump, and is valid through Informatica version 7.1.2.

Fastload: Fastload inserts large volumes of data very rapidly into Teradata tables. It can load one table from multiple input files. The biggest restriction with Fastload is that the table being loaded must be empty. This is useful for initial loads, or loading tables that are emptied prior to scheduled loads. But it can’t be used for incremental updates. Fastload will not load duplicate rows into a table, even if the table is created as a multiset table. Completely duplicate input rows don’t cause errors; they are simply dropped during the load process. A table being fastloaded is not available to users for queries.

Multiload: Multiload supports insert, update, delete, and upsert operations for up to five target tables. It can apply conditional logic to determine what updates to apply. Its speed approaches that of Fastload. Multiload is limited to one input file. Tables being multiloaded are available for select access only.

Tpump: Tpump is generally used for low volume maintenance of large tables, and/or near realtime maintenance. It does row-at-a-time processing using SQL, and is slower than Fastload and Multiload. A table being maintained by tpump is available for other updates while at the same time the tpump is running against the table. Tpump does not support multiple input files.

When deciding which load utility to select, you must consider the volume of data, the frequency of the load, and what type of availability is needed for the table while it is being loaded. All three utilities provide some level of restartability following errors.

The table on the next page compares the features of the three load utilities.

11/07/2005 Page 1

Page 2: Using Informatica to Load Teradata

Using Informatica to Load Teradata at Cisco

Feature Fastload Multiload TpumpDDL Functions Limited All AllDML Functions Insert Ins/Upd/Del Ins/Upd/DelMultiple DML No Yes YesMultiple Tables No Yes YesMultiple Sessions

Yes Yes Yes

Protocol Used FASTLOAD MULTILOAD SQLConditional Expressions

No Yes Yes

Arithmetic Calculations

No Yes No

Data Conversion

1 per column Yes Yes

Error Files Yes Yes YesError Limits Yes Yes YesUser-written Routines

Yes Yes Yes

Informatica/Teradata Connections

The load method for an Informatica mapping is set on the mapping tab of the session, under TARGET. For Teradata load utilities, Writer is set to File Writer, Connection Type is set to Loader, and Value is set to the name of the connection. Connections are set up using the Connections tab in Workflow Designer.

Attribute Description Fastload Multiload TpumpTDPID Teradata server varies – td0

for POCvaries – td0 for POC

varies – td0 for POC

Database Name

Database containing the target table.

varies varies varies

Date Format Leave blank, assuming the value loaded into a date column in a target is a date/time type in Informatica.

N/A blank N/A

Error Limit Max # of rows that can be rejected before the job is aborted. (0 = no limit)

0 0 0

Checkpoint # of rows (>= 60) or minutes (1-59) between checkpoints. If IS STAGED is selected, select a reasonable # of records or

0 0 not staged, >=10,000 staged

0

11/07/2005 Page 2

Page 3: Using Informatica to Load Teradata

Using Informatica to Load Teradata at Cisco

Attribute Description Fastload Multiload Tpumpamount of time based on the size of the output file. If the connection is not staged, this should be set to 0 (no checkpoints).

Tenacity # of hours the job will keep trying to logon the required sessions.

4 4 4

Load Mode Insert, Update, Delete, Upsert, or Data Driven. Data driven uses the property set in the update strategy transformation in the mapping.

N/A Upsert Upsert

Drop Error Tables

Specifies whether or not to drop the error tables prior to starting the loader.

No No No

External Loader Executable

Name of the loader executable. fastload mload tpump

Max Sessions Default to one per AMP 80 for POC 80 for POC 10 for POCSleep # of minutes between logon

tries.6 6 6

Packing Factor

# of statements to pack into a multi-statement request. Max is 600, default is 20.

N/A N/A 1

Statement Rate

Maximum rate at which statements are sent to Teradata per minute. Unlimited if not specified.

N/A N/A blank

Serialize If set, actions to a given row are executed in order.

N/A N/A On

Robust If off, simple restart logic is used (restart after last checkpoint).

N/A N/A Off

No Monitor If set, prevents Tpump from checking for statement rate changes to send to the monitor.

N/A N/A On

Truncate Target Table

If set, all rows in target table are deleted prior to the load job starting.

Off Off Off

Is Staged Data is written to a flat file before the load job starts.

Off Off Off

Error Database

Database where error tables will be created.

Varies (dw_errlog)

Varies (dw_errlog)

Varies (dw_errlog)

Work Table Database where work tables N/A Varies N/A

11/07/2005 Page 3

Page 4: Using Informatica to Load Teradata

Using Informatica to Load Teradata at Cisco

Attribute Description Fastload Multiload TpumpDatabase will be created. (dw_errlog)Log Table Database

Database where log table will be created.

N/A Varies (dw_errlog)

N/A

Staged vs. Not Staged

When a loader connection has IS STAGED selected, Informatica will write output to a flat file on the Informatica server. Data is sent to the target database only after Informatica has completed creating the flat file. Informatica does not delete the flat file after the loader has completed.

If a loader connection is not staged, Informatica will start sending data to the target database using named pipes as soon as it has data to send. After job completion, there is no flat file.

Source disk space requirements and restartability requirements need to be considered when choosing which option to use.

Restarting Load Jobs

Multiload

Staged: If a job abends prior to the application phase, you can choose to restart the job, or abandon the job. If it is restarted, it will pick up after the last checkpoint. To abandon the job, execute a RELEASE MLOAD statement against the target table, and drop the error and log tables. If the job has entered the application phase, you either have to restart it, or drop the target table, recreate it, and restore the data from a backup.

Not Staged: If a job abends prior to the application phase, it can’t be restarted. Since there isn’t an input file, there’s no way to guarantee that the input will match the original input, and data corruption can occur. If the job abends in the application phase, it must be restarted, or dropped and recreated.

Fastload

The same considerations apply regarding staged and not staged input. It’s usually easiest to drop/recreate the table and start from the top.

Tpump

Staged: Restart the tpump job. It will use the error and log tables to determine where it left off.

Not Staged: The job can’t be restarted.

11/07/2005 Page 4

Page 5: Using Informatica to Load Teradata

Using Informatica to Load Teradata at Cisco

Troubleshooting

When Informatica launches a Teradata load job, the session waits for a return code from the load job. If a zero return code is received, the session will be reported as successful; non-zero will result in a failure. But a successful load job doesn’t necessarily mean that all rows were loaded successfully. Some or all of the rows may have been rejected and sent to the error table. Or rows that were assumed to be inserts were actually updates due to duplicate keys in the input data.

Following any load job, its log should be checked to determine the actual results of the job. The log files are written to the …/TgtFiles directory, with an extension of ‘ldrlog’. There are two areas to look to find the relevant information. The number of inserts, updates, and deletes will be reported in the application section of the log. Entries in the clean-up section will report the number of rows sent to the error table(s).

The error tables are created in the database specified in the Informatica connection. They are dropped at the end of the job if they are empty, so the existence of an error table after a load job indicates that at least one row was rejected. Look at the rows in the error table to find the error code.

When a load job is running much more slowly than expected, it’s a good idea to check the number of rows in the associated error tables. Rows are written one at a time into the error table, as opposed to the much faster writes to the target tables. If all or most of the rows are being rejected, the writes to the error tables will slow down the load job. If this number is very high, you may want to abort the load job, fix the problem, then rerun it. The most common causes of rows being rejected are not null violations resulting from failed lookup transformations, or data conversion errors.

11/07/2005 Page 5