Xi3 Ds Performance Optimization Guide En

7/31/2019 Xi3 Ds Performance Optimization Guide En

http://slidepdf.com/reader/full/xi3-ds-performance-optimization-guide-en 1/198

Data Services Performance

Optimization Guide

BusinessObjects Data Services XI 3.0 (12.0.0)



© 2008 Business Objects. All rights reserved. Business Objectsowns the followingU.S. patents, which may cover products that are offered and licensed by Business

Copyright

Objects: 5,555,403; 5,857,205; 6,289,352; 6,247,008; 6,490,593; 6,578,027;6,831,668; 6,768,986; 6,772,409; 6,882,998; 7,139,766; 7,299,419; 7,194,465;7,222,130; 7,181,440 and 7,181,435. Business Objects and the Business Objects

logo, BusinessObjects, Business Objects Crystal Vision, Business Process OnDemand, BusinessQuery, Crystal Analysis, Crystal Applications, Crystal Decisions,Crystal Enterprise, Crystal Insider, Crystal Reports, Desktop Intelligence, Inxight,the Inxight Logo, LinguistX, Star Tree, Table Lens, ThingFinder, Timewall, Letthere be light, Metify, NSite, Rapid Marts, RapidMarts, the Spectrum Design, WebIntelligence, Workmail and Xcelsius are trademarks or registered trademarks inthe United States and/or other countries of Business Objects and/or affiliatedcompanies. All other names mentioned herein may be trademarks of their respectiveowners.

Business Objects products in this release may contain redistributions of software

licensed from third-party contributors. Some of these individual components may

Third-party

Contributorsalso be available under alternative licenses. A partial listing of third-partycontributors that have requested or permitted acknowledgments, as well as requirednotices, can be found at: http://www.businessobjects.com/thirdparty

2008-03-12



ContentsWelcome to Data Services 9Chapter 1

Welcome....................................................................................................10

Documentation set for Data Services........................................................10

Accessing documentation..........................................................................12

Accessing documentation on Windows................................................13

Accessing documentation on UNIX......................................................13

Accessing documentation from the Web..............................................13

Business Objects information resources...................................................14

Environment Test Strategy 17Chapter 2

The source OS and database server.........................................................18

Operating system.................................................................................18

Database..............................................................................................18

The target OS and database server..........................................................19

Operating system.................................................................................19Database..............................................................................................19

The network...............................................................................................20

Data Services Job Server OS and job options..........................................20

Operating system.................................................................................20

Data Services jobs...............................................................................21

Measuring Data Services Performance 23Chapter 3

Data Services processes and threads.......................................................24Data Services processes .....................................................................24

Data Services threads..........................................................................25

Measuring performance of Data Services jobs..........................................25

Data Services Performance Optimization Guide 3



Checking system utilization..................................................................26

Analyzing log files for task duration......................................................30

Reading the Monitor Log for execution statistics ................................31

Reading the Performance Monitor for execution statistics...................32

To view the Performance Monitor.........................................................33

Reading Operational Dashboards for execution statistics...................34

To compare execution times for the same job over time......................34

Tuning Overview 37Chapter 4

Strategies to execute Data Services jobs..................................................38

Maximizing push-down operations to the database server..................38Improving Data Services throughput....................................................39

Using advanced Data Services tuning options.....................................40

Maximizing Push-Down Operations 41Chapter 5

Push-down operations...............................................................................42

Full push-down operations...................................................................42

Partial push-down operations...............................................................43

Operations that cannot be pushed down.............................................44Push-down examples................................................................................45

Scenario 1: Combine transforms to push down ..................................45

Scenario 2: Full push down from the source to the target....................46

Scenario 3: Full push down for auto correct load to the target ............46

Scenario 4: Partial push down to the source .......................................47

To view SQL...............................................................................................47

Data_Transfer transform for push-down operations..................................50

Scenario 1: Push down an operation after a blocking operation..........50

Scenario 2: Using Data_Transfer tables to speed up auto correctloads.....................................................................................................53

Database link support for push-down operations across datastores.........57

4 Data Services Performance Optimization Guide

Contents



Software support..................................................................................58

To take advantage of linked datastores................................................58

Example of push-down with linked datastores.....................................59

Generated SQL statements..................................................................60

Tuning performance at the data flow or Job Server level.....................61

Using Caches 63Chapter 6

Caching data..............................................................................................64

Caching sources...................................................................................65

Caching joins........................................................................................66

To change the cache type for a data flow.............................................66Caching lookups...................................................................................67

Caching table comparisons..................................................................68

Specifying a pageable cache directory................................................69

Using persistent cache..............................................................................69

Using persistent cache tables as sources ...........................................70

Monitoring and tuning caches....................................................................70

Using statistics for cache self-tuning ...................................................71

To have Data Services automatically choose the cache type..............71

To monitor and tune in-memory and pageable caches........................71

Using parallel Execution 77Chapter 7

Parallel data flows and work flows.............................................................78

Parallel execution in data flows.................................................................79

Table partitioning..................................................................................79

Degree of parallelism ..........................................................................86

Combining table partitioning and a degree of parallelism....................94

File multi-threading...............................................................................96


Contents



Distributing Data Flow Execution 101Chapter 8

Splitting a data flow into sub data flows...................................................102

Run as a separate process option ....................................................103

Examples of multiple processes for a data flow.................................104

Data_Transfer transform....................................................................111

Examples of multiple processes with Data_Transfer.........................111

Using grid computing to distribute data flows execution..........................117

Server Group......................................................................................117

Distribution levels for data flow execution..........................................117

Using Bulk Loading 123Chapter 9

Bulk loading in Oracle..............................................................................124

Bulk-loading methods.........................................................................124

Bulk-loading modes............................................................................125

Bulk-loading parallel-execution options..............................................125

Bulk-loading scenarios.......................................................................126

Using bulk-loading options.................................................................127

Bulk loading in DB2 Universal Database.................................................129When to use each DB2 bulk-loading method.....................................130

Using the DB2 CLI load method.........................................................132

To configure your system to use the CLI load method.......................132

To use the CLI load method in a Data Services job...........................132

Using the DB2 bulk load utility............................................................133

To configure your system to use the load utility.................................133

To use the load utility in a Data Services job......................................136

Using the import utility .......................................................................137

Bulk loading in Sybase ASE....................................................................138

Bulk loading in Sybase IQ........................................................................138

Bulk loading in Teradata..........................................................................139


Contents



When to use each Teradata bulk load method...................................139

How Data Services and Teradata use the file options to load............148

Using the UPSERT bulk operation.....................................................150

Warehouse Builder method................................................................151

Teradata standalone utilities ..............................................................157

Bulk loading in Netezza...........................................................................177

Netezza bulk-loading process............................................................177

Options overview................................................................................178

Netezza log files.................................................................................179

Configuring bulk loading for Netezza ................................................179

To configure the target table...............................................................180

Other Tuning Techniques 183Chapter 10

Source-based performance options........................................................185

Join ordering.......................................................................................185

Minimizing extracted data...................................................................188

Using array fetch size.........................................................................189

To set the Array fetch size parameter ...............................................189

Target-based performance options..........................................................190

Loading method .................................................................................190

Rows per commit................................................................................191

Job design performance options.............................................................192

Loading only changed data ...............................................................192

Minimizing data type conversion........................................................192

Minimizing locale conversion..............................................................193

Precision in operations ......................................................................193

Index 195


Contents



Welcome to Data Services

1



What this document providesDocument

Information for third-party developers to access DataServices functionality

Integrator's Guide

Information about how to use Data Services Adminis-trator

Management Console: Administrator

Guide

Information about how to use Data Services MetadataReports

Management Console: Metadata Re-

ports Guide

Release-specific product behavior changes fromearlier versions of Data Services to the latest release

Migration Considerations

Information about how to migrate from Data Quality

to Data ServicesMigration Guide

Information about how to improve the performanceof Data Services

Performance Optimization Guide

Detailed reference material for Data Services Design-er

Reference Guide

A compiled “master” PDF of core DataServices bookscontaining a searchable master table of contents andindex:

• Installation Guide• Getting Started Guide

• Designer Guide

• Reference Guide

• Management Console: Metadata Reports Guide

• Management Console: Administrator Guide

• Performance Optimization Guide

• Advanced Development Guide

• Supplement for J.D. Edwards

• Supplement for Oracle Applications

• Supplement for PeopleSoft

• Supplement for Siebel

• Supplement for SAP

Technical Manuals

A step-by-step introduction to using Data ServicesTutorial


1Welcome to Data Services

Documentation set for Data Services



In addition, you may need to refer to several Adapter Guides andSupplemental Guides.

What this document providesDocument

Information about how to install, configure, and use the DataServices Adapter for JMS

JMS Adapter Interface

Information about how to install, configure, and use the DataServices Salesforce.com Adapter Interface

Salesforce.com Adapter

Interface

Information about license-controlled interfaces between DataServices and J.D. Edwards World and J.D. Edwards OneWorld

Supplement for J.D. Ed-

wards

Information about the license-controlled interface between Data

Services and Oracle Applications

Supplement for Oracle Ap-

plications

Information about license-controlled interfaces between DataServices and PeopleSoft

Supplement for PeopleSoft

Information about license-controlled interfaces between DataServices, SAP ERP and R/3, and SAP BI/BW

Supplement for SAP

Information about the license-controlled interface between DataServices and Siebel

Supplement for Siebel

Accessing documentation

You can access the complete documentation set for Data Services in severalplaces.

Note: For the latest tips and tricks on Data Services, access our KnowledgeBase on the Customer Support site at http://technicalsupport.businessob

jects.com. We have posted valuable tips for getting the most out of your Data

Services product.


Welcome to Data Services1 Accessing documentation



Accessing documentation on Windows

After you install Data Services, you can access the documentation from theStart menu.

1. Choose Start > Programs > BusinessObjects XI 3.0 >BusinessObjects Data Services > Data Services Documentation.

Note: Only a subset of the documentation is available from the Startmenu. The documentation set for this release is available inLINK_DIR \Doc\Books\en.

2. Click the appropriate shortcut for the document that you want to view.

Accessing documentation on UNIX

After you install Data Services, you can access the online documentation bygoing to the directory where the printable PDF files were installed.

1. Go toLINK_DIR /doc/book/en/.

2. Using Adobe Reader, open the PDF file of the document that you wantto view.

Accessing documentation from the Web

You can access the complete documentation set for Data Services from theBusiness Objects Customer Support site.

1. Go to www.businessobjects.com

2. From the "Support" pull-down menu, choose Documentation.

3. On the "Documentation" screen, choose Product Guides and navigateto the document that you want to view.

You can view the PDFs online or save them to your computer.



Accessing documentation



Business Objects information resources

Customer support, consulting, and training

A global network of Business technology experts provides customer support,education, and consulting to ensure maximum business intelligence benefitto your business.

Useful addresses at a glance

ContentAddress

Information about the full range of

Business Objects products.

Product information

http://www.businessobjects.com

Business Objects product documenta-

tion, including the Business Objects

Documentation Roadmap.

Product documentation

http://www.businessobjects.com/sup

port

Send us feedback or questions about

your Business Objects documentation.

Documentation mailbox

documentation@businessobjects.com

Information on Customer Support pro-

grams, as well as links to technical arti-

cles, downloads, and online forums.

Online Customer Support

http://www.businessobjects.com/sup

port

An online resource for sharing and

learning about Data Services with your

developer colleagues.

Online Developer Community

http://diamond.businessobjects.com/

Information about how Business Ob-

jects can help maximize your business

intelligence investment.

Consulting services

http://www.businessobjects.com/ser

vices/consulting/

Information on Business Objects train-

ing options and modules.

Education services

http://www.businessobjects.com/ser

vices/training


Welcome to Data Services1 Business Objects information resources

http://www.businessobjects.com/

http://www.businessobjects.com/support





http://www.businessobjects.com/services/consulting/


http://www.businessobjects.com/services/training











http://www.businessobjects.com/



Online Customer Support

The Business Objects Customer Support web site contains information aboutCustomer Support programs and services. It also has links to a wide range

of technical information including Knowledge Base articles, downloads, andsupport forums. http://www.businessobjects.com/support

Looking for training options?

From traditional classroom learning to targeted e-learning seminars, BusinessObjects can offer a training package to suit your learning needs and preferredlearning style. Find more information on the Business Objects Educationweb site: http://www.businessobjects.com/services/training

Send us your feedback

Do you have a suggestion on how we can improve our documentation? Isthere something that you particularly like or have found useful? Drop us aline, and we will do our best to ensure that your suggestion is considered for the next release of our documentation: [email protected].

Note: If your issue concerns a Business Objects product and not thedocumentation, please contact our Customer Support experts.



Business Objects information resources








Welcome to Data Services1 Business Objects information resources



Environment Test Strategy

2



This section covers suggested methods of tuning source and target databaseapplications, their operating systems, and the network used by your DataServices environment. It also introduces key Data Services job execution

options.This section contains the following topics:

• The source OS and database server on page 18

• The target OS and database server on page 19

• The network on page 20

• Data Services Job Server OS and job options on page 20

To test and tune Data Services jobs, work with all four of these components

in the order shown above.In addition to the information in this section, you can use your UNIX or Windows operating system and database server documentation for specifictechniques, commands, and utilities that can help you measure and tune theData Services environment.

The source OS and database server

Tune the source operating system and database to quickly read data from

disks.

Operating system

Make the input and output (I/O) operations as fast as possible. Theread-ahead protocol, offered by most operating systems, can greatly improveperformance. This protocol allows you to set the size of each I/O operation.Usually its default value is 4 to 8 kilobytes which is too small. Set it to at least64K on most platforms.

Database

Tune your database on the source side to perform SELECTs as quickly aspossible.


Environment Test Strategy2 The source OS and database server



In the database layer, you can improve the performance of SELECTs inseveral ways, such as the following:

• Create indexes on appropriate columns, based on your Data Services

data flows.• Increase the size of each I/O from the database server to match the OS

read-ahead I/O size.

• Increase the size of the shared buffer to allow more data to be cached inthe database server.

• Cache tables that are small enough to fit in the shared buffer. For example,if jobs access the same piece of data on a database server, then cachethat data. Caching data on database servers will reduce the number of I/O operations and speed up access to database tables.

See your database server documentation for more information abouttechniques, commands, and utilities that can help you measure and tune thethe source databases in your Data Services jobs.

The target OS and database server

Tune the target operating system and database to quickly write data to disks.

Operating system

Make the input and output operations as fast as possible. For example, theasynchronous I/O, offered by most operating systems, can greatly improveperformance. Turn on the asynchronous I/O.

Database

Tune your database on the target side to perform INSERTs and UPDATESas quickly as possible.

In the database layer, there are several ways to improve the performanceof these operations.

Here are some examples from Oracle:


2Environment Test Strategy

The target OS and database server



• Consider a design that will run one al_engine process per CPU at atime.

• Tune the Job Server OS so that Data Services threads spread to all

available CPUs.

For more information, see Checking system utilization on page 26.

Data Services jobs

You can tune Data Services job execution options after:

• Tuning the database and operating system on the source and the targetcomputers

• Adjusting the size of the network buffer

• Your data flow design seems optimal

You can tune the following execution options to improve the performance of Data Services jobs:

• Monitor sample rate

• Collect statistics for optimization and Use collected statistics

Setting Monitor sample rate

During job execution, Data Services writes information to the monitor log fileand updates job events after processing the number of rows specified inMonitor sample rate. Default value is 1000. Increase Monitor sample rate

to reduce the number of calls to the operating system to write to the log file.

When setting Monitor sample rate, you must evaluate performanceimprovements gained by making fewer calls to the operating system againstyour ability to view more detailed statistics during job execution. With a higher

Monitor sample rate, Data Services collects more data before calling theoperating system to open the file, and performance improves. However, witha higher monitor rate, more time passes before you can view statistics during

job execution.


2Environment Test Strategy

Data Services Job Server OS and job options



This section contains the following topics:

• Data Services processes and threads on page 24

• Measuring performance of Data Services jobs on page 25

Data Services processes and threads

Data Services uses processes and threads to execute jobs that extract datafrom sources, transform the data, and load data into a data warehouse. Thenumber of concurrently executing processes and threads affects theperformance of Data Services jobs.

Data Services processes

The processes Data Services uses to run jobs are:

• al_jobserver

The al_jobserver initiates one process for each Job Server configured ona computer. This process does not use much CPU power because it isonly responsible for launching each job and monitoring the job's execution.

• al_engine

For batch jobs, an al_engine process runs when a job starts and for eachof its data flows. Real-time jobs run as a single process.

The number of processes a batch job initiates also depends upon thenumber of:

• parallel work flows

• parallel data flows

• sub data flows

For an example of the Data Services monitor log that displays the processes,see Analyzing log files for task duration on page 30.


Measuring Data Services Performance3 Data Services processes and threads



Data Services threads

A data flow typically initiates one al_engine process, which creates onethread per data flow object. A data flow object can be a source, transform,or target. For example, two sources, a query, and a target could initiate four threads.

If you are using parallel objects in data flows, the thread count will increaseto approximately one thread for each source or target table partition. If youset the Degree of parallelism (DOP) option for your data flow to a valuegreater than one, the thread count per transform will increase. For example,a DOP of 5 allows five concurrent threads for a Query transform. To runobjects within data flows in parallel, use the following Data Services features:

• Table partitioning

• File multithreading

• Degree of parallelism for data flows

Related Topics

• Using parallel Execution on page 77

Measuring performance of Data Services jobs

You can use several techniques to measure performance of Data Services jobs:

• Checking system utilization on page 26

• Analyzing log files for task duration on page 30

• Reading the Monitor Log for execution statistics on page 31

• Reading the Performance Monitor for execution statistics on page 32

• Reading Operational Dashboards for execution statistics on page 34


3Measuring Data Services Performance




For example, if you run a job and see that the CPU utilization is very high,you might decrease the DOP value or run less parallel jobs or data flows.Otherwise, CPU thrashing might occur.

For another example, if you run a job and see that only half a CPU is beingused, or if you run eight jobs on an eight-way computer and CPU usage isonly 50%, you can be interpret this CPU utilization in several ways:

• One interpretation might be that Data Services is able to push most of the processing down to source and/or target databases.

• Another interpretation might be that there are bottlenecks in the databaseserver or the network connection. Bottlenecks on database servers donot allow readers or loaders in jobs to use Job Server CPUs efficiently.

To determine bottlenecks, examine:

• Disk service time on database server computers

Disk service time typically should be below 15 milliseconds. Consultyour server documentation for methods of improving performance.For example, having a fast disk controller, moving database server log files to a raw device, and increasing log size could improve diskservice time.

• Number of threads per process allowed on each database server operating system. For example:

• On HPUX, the number of kernel threads per process is configurable.The CPU to thread ratio defaults to one-to-one. Business Objectsrecommends setting the number of kernel threads per CPU tobetween 512 and 1024.

• On Solaris and AIX, the number of threads per process is notconfigurable. The number of threads per process depends onsystem resources. If a process terminates with a message like"Cannot create threads," you should consider tuning the job.

For example, use the Run as a separate process option to splita data flow or use the Data_Transfer transform to create two subdata flows to execute sequentially. Since each sub data flow isexecuted by a different Data Services al_engine process, thenumber of threads needed for each will be 50% less than in your previous job design.






For example, if your data flow reads data from a table, joins it to a file,and then groups it to calculate an average, the group by operation mightbe occurring in memory. If you stage the data after the join and beforethe group by into a database on a different computer, then when a subdata flow reads the staged data and continues with the group processing,it can utilize memory from the database server on a different computer.This situation optimizes your system as a whole.

For information about how to stage your data, see Data_Transfer transform

on page 111. For more information about distributing sub data flows todifferent computers, see Using grid computing to distribute data flows

execution on page 117.

• Large amount of memory but it is under-utilized.

In this case, you might cache more data. Caching data can improve theperformance of data transformations because it reduces the number of times the system must access the database.

Data Services provides two types of caches: in-memory and pageable.For more information, see Caching data on page 64.

• Paging occurs.

Pageable cache is the default cache type for data flows. On Windowsand Linux, the virtual memory available to the al_engine process is 1.5gigabytes (500 megabytes of virtual memory is reserved for other engine

operations, totaling 2GB). On UNIX, Data Services limits the virtualmemory for the al_engine process to 3.5 gigabytes (500MB is reservedfor other engine operations, totaling 4GB). If more memory is neededthan these virtual memory limits, Data Services starts paging to continueexecuting the data flow.

If your job or data flow requires more memory than these limits, you cantake advantage of one of the following Data Services features to avoidpaging:

• Split the data flow into sub data flows that can each use the amount

of memory set by the virtual memory limits.

Each data flow or each memory-intensive operation within a data flowcan run as a separate process that uses separate memory from eachother to improve performance and throughput. For more information,see Splitting a data flow into sub data flows on page 102.






• Push-down memory-intensive operations to the database server sothat less memory is used on the Job Server computer. For moreinformation, see Push-down operations on page 42.

Analyzing log files for task duration

The trace log shows the progress of an execution through each component(object) of a job. The following sample Trace log shows a separate ProcessID (Pid) for the Job, data flow, and each of the two sub data flows.


Measuring Data Services Performance3 Measuring performance of Data Services jobs



Note: Memory statistics (Cache Size column) display in the PerformanceMonitor only if you select the Collect statistics for monitoring option whenyou execute the job.

The following sample Performance Monitor shows the following information:• The Query_Lookup transform used 110 kilobytes of memory.

• The first sub data flow processed 830 rows, and the second sub dataflow processed 35 rows.

To view the Performance Monitor

1. Access the Management Console with one of the following methods:

• In the Designer top menu bar, click Tools and select Data Services

Management Console.

• Click Start > Programs > BusinessObjects Data Services >Management Console.

2. On the launch page, click Administrator .

3. Select Batchrepository

4. On the Batch Job Statuspage, find a job execution instance.

5. Under Job Information for an instance, click Performance > Monitor .






For SQL sources and targets, Data Services creates database-specific SQLstatements based on the data flow diagrams in a job. Data Services generatesSQL SELECT statements to retrieve the data from source databases. To

optimize performance, Data Services pushes down as many SELECToperations as possible to the source database and combines as manyoperations as possible into one request to the database. Data Services canpush down SELECT operations such as joins, Group By, and commonfunctions such as decode and string functions.

Data flow design influences the number of operations that Data Servicescan push to the database. Before running a job, you can view the SQL thatData Services generates and adjust your design to maximize the SQL thatis pushed down to improve performance.

You can use database links and the Data_Transfer transform to pushdownmore operations.

This section discusses:

• Push-down operations

• Push-down examples

• Viewing SQL

• Data_Transfer transform for push-down operations

• Database link support for push-down operations across datastores

Push-down operations

By pushing down operations to the source database, Data Services reducesthe number of rows and operations that the engine must retrieve and process,which improves performance. When determining which operations to pushto the database, Data Services examines the database and its environment.

Full push-down operations

The Data Services optimizer always first tries to do a full push-down operation. A full push-down operation is when all Data Services transform operationscan be pushed down to the databases and the data streams directly fromthe source database to the target database. Data Services sends SQL


Maximizing Push-Down Operations5 Push-down operations



Push-down examples

The following are typical push-down scenarios.

Scenario 1: Combine transforms to push down

When determining how to push operations to the database, Data Servicesfirst collapses all the transforms into the minimum set of transformationsexpressed in terms of the source table columns. Next, Data Services pushesall possible operations on tables of the same database down to that DBMS.

For example, the following data flow extracts rows from a single source table.

The first query selects only the rows in the source where column A containsa value greater than 100. The second query refines the extraction further,reducing the number of columns returned and further reducing the qualifyingrows.

Data Services collapses the two queries into a single command for the DBMSto execute. The following command uses AND to combine the WHEREclauses from the two queries:

SELECT A, MAX(B), CFROM sourceWHERE A > 100 AND B = CGROUP BY A, C

Data Services can push down all the operations in this SELECT statementto the source DBMS.


5Maximizing Push-Down Operations

Push-down examples



Data Services only shows the SELECT generated for table sources andINSERT INTO... SELECT for targets. Data Services does not show theSQL generated for SQL sources that are not table sources, such as:

• Lookup function

• Key_generation function

• Key_Generation transform

• Table_Comparison transform

4. Select a name from the list of datastores on the left to view the SQL thatthis data flow applies against the corresponding database or application.

The following example shows the optimized SQL for the second datastorewhich illustrates a full push-down operation (INSERT INTO... SELECT).This data flows uses a Data_Transfer transform to create a transfer tablethat Data Services loads directly into the target. For more information,see "Data_Transfer transform for push-down operations" on page 1667


Maximizing Push-Down Operations5 To view SQL



3. On the General tab of the Properties window, select Pageable in thedrop-down list for the Cache type option.

Caching lookups

You can also improve performance by caching data when looking upindividual values from tables and files.

There are two methods of looking up data:

• Using a Lookup function in a query on page 67

• Using a source table and setting it as the outer join on page 68

Using a Lookup function in a query

Data Services has three Lookup functions: lookup, lookup_seq, andlookup_ext. The lookup and lookup_ext functions have cache options.Caching lookup sources improves performance because Data Servicesavoids the expensive task of creating a database query or full file scan oneach row.

You can set cache options when you specify a lookup function. There arethree caching options:

• NO_CACHE — Does not cache any values.

• PRE_LOAD_CACHE — Preloads the result column and compare columninto memory (it loads the values before executing the lookup).

• DEMAND_LOAD_CACHE — Loads the result column and comparecolumn into memory as the function executes.

Use this option when looking up highly repetitive values that are a smallsubset of the data and when missing values are unlikely.

Demand-load caching of lookup values is helpful when the lookup resultis the same value multiple times. Each time Data Services cannot findthe value in the cache, it must make a new request to the database for that value. Even if the value is invalid, Data Services has no way of knowing if it is missing or just has not been cached yet.


6Using Caches

Caching data



At runtime, Data Services translates this data flow to:

The source threads execute in parallel and the Case, Merge, and targetsexecute in parallel.

Viewing, creating, and enabling table partitions

Oracle databases support range, list, and hash partitioning. You can importthis information as table metadata and use it to extract data in parallel. Youcan use range and list partitions to load data to Oracle targets. You can alsospecify logical range and list partitions using Data Services metadata for Oracle tables.

In addition, Data Services provides the ability to specify logical rangepartitions for DB2, Microsoft SQL Server, Sybase ASE, and Sybase IQ tables

by modifying imported table metadata.

Data Services uses partition information by instantiating a thread at runtimefor each partition. These threads execute in parallel. To maximizeperformance benefits, use a multi-processor environment.

To view partition information

1. Import a table into Data Services.

2. In the Datastores tab of the object library, right-click the table name and

select Properties.

3. Click the Partitions tab.

When you import partitioned tables from your database, you will find thesepartitions displayed on the Partitions tab of the table's Properties window.


7Using parallel Execution

Parallel execution in data flows



• soundex()• is_group_changed()

• sql()• key_generation()

• sum()• mail_to()

• total_rows()• max()

To enable stored procedures to run in parallel

Use the Enable parallel execution option to set functions to run in parallelwhen the transforms in which they are used execute in parallel.

1. In the Datastores tab of the object library, expand a Datastore node.

2. Expand its Function node.

3. Right-click a function and select Properties.

4. In the Properties window, click the Function tab.

5. Click the Enable Parallel Execution check box.

6. Click OK.

To enable custom functions to run in parallel

1. In the Custom Functions tab of the object library, right-click a functionname and select Properties.

2. In the Properties window, click the Function tab.

3. Click the Enable Parallel Execution check box.

4. Click OK.

Tips

DOP can degrade performance if you do not use it judiciously. The bestvalue to choose depends on the complexity of the flow and number of CPUsavailable. For example, on a computer with four CPUs, setting a DOP greater


7Using parallel Execution

Parallel execution in data flows



general rule. This produces a data flow without the Merge Round Robin Splitand each partition pipes the data directly into the consuming transform.

Two source partitions, DOP of two, and two target partitions

The best case situation is when the following conditions exist:

• The source and target are partitioned the same way.

• The source and target have the same number of partitions.

• DOP is equal to the same number of partitions.

When a source has two partitions, it replicates twice. Because the DOP valueis two, the query transform replicates twice. When a target has two partitions,it replicates twice. The following figure shows that each source partition feedsdirectly into a replicated query transform, and the output from each queryfeeds directly into a replicated target.

File multi-threading

You can set the number of threads Data Services uses to process flat filesources and targets. The Parallel process threads option is available onthe:

• File format editor

• Source file editor

• Target file editor

• Properties window of an SAP ERP or R/3 data flow.

Without multi-threading:

• With delimited file reading, Data Services reads a block of data from thefile system and then scans each character to determine if it is a column


Using parallel Execution7 Parallel execution in data flows



Run as a separate process option

If your data flow contains multiple resource-intensive operations, you canrun each operation as a separate process (sub data flow) that uses separateresources (memory and computer) from each other to improve performanceand throughput. When youspecify multiple Run as separate process optionsin objects in a data flow, Data Services splits the data flow into sub dataflows that run in parallel.

The Run as a separate process option is available on resource-intensiveoperations that including the following:

• Hierarchy_Flattening transform

• Associate transform

• Country ID transform

• Global Address Cleanse transform

• Global Suggestion Lists transform

• Match Transform

• United States Regulatory Address Cleanse transform

• User-Defined transform

• Query operations that are CPU-intensive and memory-intensive:

• Join

• GROUP BY

• ORDER BY

• DISTINCT

• Table_Comparison transform

• Lookup_ext function

• Count_distinct function


8Distributing Data Flow Execution

Splitting a data flow into sub data flows



Scenario 2: Run multiple sub data flows with DOP greater than

1

When Degree Of Parallelism (DOP) is set to a value greater than 1, eachtransform defined in the data flow replicates for use on a parallel subset of data. For more information, see Degree of parallelism on page 86.

Set DOP to a value greater than 1 on the data flow Properties window.


8Distributing Data Flow Execution

Splitting a data flow into sub data flows



4. After you save the data flow and click ValidationDisplay Optimized

SQL..., the Optimized SQL window shows that the join between thetransfer table and source Orders table is pushed down to the database.

Data Services can push down many operations without using theData_Transfer transform. For more information, see Push-down operations

on page 42.

5. When you execute the job, the Trace Log shows messages such as thefollowing that indicate that Data Services created two sub data flows with


Distributing Data Flow Execution8 Splitting a data flow into sub data flows



Data Services supports capabilities present in Oracle, Microsoft SQL Server,Informix, DB2, Netezza, Sybase ASE, Sybase IQ, and Teradata that enableyou to load data in bulk rather than using SQL statements. Some general

considerations when using bulk loading are:• You specify bulk-loading options on the Data Services target table editor

in the Bulk Loader Options tab.

• If you are using bulk loading, parallel loading is available only in the Oracleand Informix environment.

• When you use a template table, bulk loading is disabled.

Related Topics

• Reference Guide:Target

Bulk loading in Oracle

Data Services supports Oracle bulk loading.

Bulk-loading methods

You can bulk load to Oracle using an API or a staging file:

• If you select the API method, Data Services accesses the direct pathengine of Oracle's database server associated with the target table andconnected to the target database. Using Oracle's Direct-Path Load API,


Using Bulk Loading9 Bulk loading in Oracle



Note: The Enable partitioning check box does not appear on the Options

tab if the target table is not partitioned.

Bulk-loading scenarios

With two bulk-loading methods, two load modes, and two parallel load options,there are several scenarios you can configure:

Parallel Load Op-

tionsLoad ModeMethodScenario

Enable partitions

is not selected(One loader isused by default)

Direct-path API1

Enable partitionsis selected

Direct-path API2

Number of load-ers = 1

Direct-pathFile3

Number of load-ers > 1

Direct-pathFile4


Direct-pathFile5

Number of load-

ers = 1ConventionalFile6

Number of load-ers > 1

ConventionalFile7


Using Bulk Loading9 Bulk loading in Oracle



Parallel Load Op-

tionsLoad ModeMethodScenario


ConventionalFile8

Here are some tips for using these scenarios:

• The API method always uses the direct-path load type, and when it isused with a partitioned target table, Data Services processes loads inparallel. Data Services instantiates multiple loaders based on the number of partitions in a table. Each loader receives rows that meet the conditionsspecified by the partition.

• With the File method, direct-path is faster than conventional load, but theFile method is slower than using an API because of the need to generatea staging file, logs, and invoke Oracle's SQL*Loader.

• With the File method, when you use a value of greater than one for either the Number of Loaders or the Enable partitioning option, loads cannottruly run in parallel. The creation of a staging file and log for each loader is serialized.

Using bulk-loading options

As seen in the table on the previous page, there are many ways to set upbulk loading for an Oracle database. The following sections describe twoscenarios in detail.

Direct-path loads using Number of Loaders and File method

In the Options tab of the target table editor, when you enter a value for Number of loaders, Data Services instantiates multiple loaders. Each loader receives rows equal to the amount specified in the Rows per commit boxon the Bulk Loader Options tab. The loaders pipe rows to a staging file,then call the SQL*Loader to load the staging file contents into the table.


9Using Bulk Loading

Bulk loading in Oracle



• Copy target directory—Defines the directory of the copy files when youenable both the database forward log recovery and select theRecoverable option. Data Services supports only the copy files optionfor the DB2 CLI load method.

• Data file on client computer —When you select this option, Data Serviceswrites to the data file on the DB2 client computer. Data Services doesnot need to FTP the data file because the DB2 client transfers the datadirectly to the DB2 server during the load process. To use this option:

• You must use DB2 Version 8.x or later.

• The target DB2 cannot be a DB2 enterprise (extended editionenvironment).

• The target table and database must not be partitioned.

• This option is only applicable if Data Services and DB2 are on differentservers.

When you execute the DB2 bulk load utility, DB2 automatically generatesthe following files:

• Local message file (named .log) in the bulk loader working directory. DB2writes output messages into this log file.

• "Dump" file (named .bad) in the DB2 server working directory. DB2 writesrejected input rows into this .bad file. If you clear the Data file on client

machine option, Data Services uses FTP to send the .bad file to the bulkloader working directory and deletes it after the load completessuccessfully.

Check the trace log to find either of these files.

Related Topics

• Reference Guide: Data Services Objects, Target tables (Teradata target

table options)

Using the import utility

Data Services also supports bulk loading in the DB2 Universal Database 5.2environment using the import utility. For Data Services to initiate DB2 bulkloading by this method directly, the Data Services Job Server and DB2 must


9Using Bulk Loading

Bulk loading in DB2 Universal Database



be located on the same system. If they are not, use the following procedureto initiate bulk loading:

1. Generate a control file and data file. Check Generate files only in the

target table editor on the Bulk Loader Options tab.2. Manually move the control file and data file to the system where the target

database is located.

3. Start the execution of the bulk loader manually.

Bulk loading in Sybase ASE

Data Services supports bulk loading of Sybase ASE databases through the

Sybase ASE bulk copy utility. For detailed information about the Sybase ASEbulk loader options and their behavior in the Sybase ASE DBMS environment,see the relevant Sybase ASE product documentation.

Bulk loading in Sybase IQ

Data Services supports bulk loading Sybase IQ databases via the SybaseIQ LOAD TABLE SQL command. For detailed information about the SybaseIQ LOAD TABLE parameters and their behavior in the Sybase IQ databaseenvironment, see the relevant Sybase IQ product documentation.

For Sybase IQ version 12.6 only, if the job server and Sybase IQ databaseserver are on the same computer, you can use named pipes for bulk loadinginstead of files. Using named pipes eliminates disk access time and usuallyimproves the performance of the bulk loader.

If the job server and Sybase IQ database server are not on the samecomputer, Data Services generates a data file and transfers it via FTP to thedatabase computer for loading. You set FTP parameters in the Sybase IQdatastore editor.

After a job executes, Data Services stores the Sybase IQ message and rowlogs in the Bulk loader directory specified in the datastore editor (regardlessof the setting for the JS and DB on same machine option). A data file willalso be present if you do not use the named pipe option.

• message log—Records constraint violations specified in the Error

handling section of the target table Bulk Loader Options tab.


Using Bulk Loading9 Bulk loading in Sybase ASE



RestrictionsAdvantagesFile OptionBulk loader

method

• User must pro-vide the load-ing script.

• Cannot useData Servicesparallel pro-cessing

Load utilities arefaster than IN-SERT statementsthrough the ODBCdriver.

File — Loads databy writing to a datafile that it passesto the Teradataserver.

Load utility

• User must pro-vide the load-ing script.

• Cannot useData Servicesparallel pro-

cessing• A job that uses

a generic pipeis notrestartable.

• Load utilitiesare faster thanINSERT state-

ments throughthe ODBC driv-er.

• Named pipesare faster thandata files be-cause:

• As soon asData Ser-vices writesto a pipe,

Teradatacan readfrom thepipe.

• On Win-dows, noI/O to an in-termediatedata file oc-curs be-cause a

pipe is inmemory

Generic named

pipe — Loads alarge volume of

data by writing toa pipe from whichTeradata reads.

Load utility


Using Bulk Loading9 Bulk loading in Teradata



RestrictionsAdvantagesFile OptionBulk loader

method

• Load utilitiesare faster thanINSERT orUP-SERT state-ments throughthe ODBC driv-er.

• Named pipesshould befaster than da-ta files be-

cause:• As soon as

Data Ser-vices writesto a pipe,Teradatacan readfrom thepipe.

• On Win-dows, no

I/O to an in-termediatedata file oc-curs be-cause apipe is inmemory

• The job isrestartable.For details, see"Automatically

recovering jobs" in the Da-

ta Services De-

signer Guide.

• Data Servicescreates the


9Using Bulk Loading

Bulk loading in Teradata



When you use the Warehouse Builder method, you can leverage DataServices' powerful parallel processing capabilities to specify a maximumnumber of files or pipes for Data Services to use in processing large quantitiesof data. For details, see Parallel processing with Teradata Warehouse Builder

on page 154.

With Teradata Warehouse Builder, you can choose from four types of loadoperators. Each operator processes your data in a slightly different way.

DescriptionOperator

Loads a large amount of data at highspeed into an empty table on the

Teradata RDBMS. Use this operator when initially loading tables in thedata warehouse.

Load

Inserts data into a specified existingtable on the Teradata RDBMS. Asingle SQL session can insert data,while other operators require multiplesessions. Because it uses the fewestRDBMS resources, this is the least

intrusive and slowest method of loading data.

SQL Inserter





To use Warehouse Builder bulk loading with Data Services

parallel processing

1. In your target table Options tab, specify the Number of loaders to controlthe number of data files or named pipes. Data Services will write data tothese files in batches of 999 rows.

2. In the Bulk Loader Options tab, choose Warehouse Builder as your bulkloader.

3. In File Option, choose the type of file (Data File, Generic named pipe,or Named pipe access module) to contain the data to bulk load.

4. If you chose Data Fileor Generic named pipein File Option, specify thenumber of read and load instances in the loading scripts.


9Using Bulk Loading




Teradata standalone utilities

In addition to the Warehouse Builder (Parallel Transporter) interface, DataServices supports several Teradata utilities that load to and extract from theTeradata database. Each load utility is a separate executable designed tomove data into a Teradata database. Choose from the following bulk loader utilities:

DescriptionUtility

Loads large quantities of data into

populated tables. MultiLoad alsosupports bulk inserts, updates, up-serts, and deletions against populat-ed tables.

MultiLoad on page 158

Loads unpopulated tables only. Boththe client and server environmentssupport FastLoad. Provides a high-performance load (inserts only) toone empty table each session.

FastLoad on page 162

Uses standard SQL/DML to maintaindata in tables. It also contains amethod that you can use to specifythe percentage of system resourcesnecessary for operations on tables.

Allows background maintenance for insert, update, upsert, and deleteoperations to take place at any time

you specify. Used with small datavolumes.

TPump on page 165


9Using Bulk Loading




DescriptionUtility

Invokes one of the above utilities(MultiLoad, FastLoad, or TPump) withthe interface prior to Data Servicesversion 11.5.1.

Load on page 177

MultiLoad

To bulk load a table using the Teradata MultiLoad utility1. Ensure that your Teradata datastore specifies a value in TdpId (Teradata

Director Program Identifier). This option identifies the name of the Teradatadatabase to load and is mandatory for bulk loading.

2. In the Bulk Loader Options tab of your target table editor, chooseMultiLoad in the Bulk loader drop-down list.

3. In File Option, choose the type of file (Data File, Generic named pipe,or Named pipe access module) to contain the data to bulk load. Thedefault is Data File.

4. You can specify the following MultiLoad parameters:

Short descriptionMultiLoad parameter

The default is not to reduce printoutput.

Reduced print output

The default is not to encrypt allsessions.

Data Encryption

Particular mapping betweencharac-ters and byte strings (such as ASCIIor UTF-8).

Character set





For more information about these parameters, see the Teradata MultiLoadReference.

5. In Attributes, you can usually use the default settings for the following

attributes in the MultiLoad script that Data Services generates.

Short descriptionScript attribute

Table in which Teradata stores theload job status. Specify the restartlog table that will maintain thecheckpoint information for your MultiLoad job.

LogTable

Identifier, of up to 30 characters,associated with the user name thatwill logon to the Teradata database.

AccountId

Teradata uses this table to stageinput data.

WorkTable

Teradata uses this table to storeerrors that it detects during the ac-quisition phase of the MultiLoadimport task.

ErrorTable1

Teradata uses this table to storeerrors that it detects during the ap-plication phase of the MultiLoadimport task.

ErrorTable2


9Using Bulk Loading





Maximum number of rejectedrecords that Teradata can write tothe error table 1 during the acquisi-tion phase of the MultiLoad importtask. If used with ErrorPercentage,ErrorLimit specifies the number of records that must be sent to theTeradata database beforeErrorPer-

centage takes effect.

ErrorLimit

Approximate percentage (ex-pressed as an integer) of totalrecords sent so far (ErrorLimit) tothe Teradata database that the ac-quisition phase might reject.

ErrorPercentage

Interval between checkpoint opera-tions during the acquisition phase.Express this value as either:

• The number of rows read fromyour client system or sent to theTeradata database

• An amount of time in minutes

CheckpointRate

Maximum number of MultiLoadsessions for the load job.

MaxSessions

Minimum number of MultiLoadsessions required for the load jobto continue.

MinSessions






Number of hours that the MultiLoadutility continues trying to logonwhen the maximum number of load

jobs are already running on theTeradata database.

TenacityHours

Number of minutes that the Multi-Load utility waits before it retries alogon operation. The default is six

minutes.

TenacitySleep

Number of hours that MultiLoadcontinues trying to start when oneof the target tables is being loadedby some other job.

TableWait

Specifies how MultiLoad should re-spond when an Access Module

Processor (AMP) is down.

AmpCheck

Select IgnoreDuplicate to not placeduplicate rows in error table 2. Thedefault is to load the duplicate rows.

IgnoreDuplicate

Note: By default, Data Services uses the bulk loader directory to storethe script, data, error, log, and command (bat) files.

For more information about these parameters, see the Teradata MultiLoadReference.

6. If you specified Data file in File Option, you can increase the Number

of loaders in the Options tab which increase the number of data files.Data Services can use parallel processing to write data to multiple datafiles in batches of 999 rows.


9Using Bulk Loading




If you specified Generic named pipe or Named pipe access module, DataServices supports only one loader and disables the Number of loadersoption.

Related Topics

• Reference Guide: Data Services Objects, DBC driver manager for UNIX

• Reference Guide: Data Services Objects, Target tables (Teradata target

table options)

FastLoad

To bulk load a table using the Teradata FastLoad utility

1. Ensure that your Teradata datastore specifies a value in TdpId (TeradataDirector Program Identifier). This option identifies the name of the Teradatadatabase to load and is mandatory for bulk loading.

2. In the Bulk Loader Options tab of your target table editor, chooseFastLoad in the Bulk loader drop-down list.


4. You can specify the following FastLoad parameters:

Short descriptionFastLoad parameter

Encrypt data and requests in allsessions used by the job.

The default is not to encrypt allsessions.

Data Encryption

Prints every request sent to the

Teradata database. The default isnot to reduce print output.Print all requests






Number of kilobytes for the outputbuffer that FastLoad uses for mes-sages to the Teradata database.The default is 63 kilobytes which isalso the maximum size.

Buffer size


Character set

For more information about these parameters, see the Teradata FastLoadReference.

5. In Attributes, you can usually use the default settings for the followingattributes in the FastLoad script that Data Services generates.


Identifier, of up to 30 characters,associated with the user name thatwill logon to the Teradata database.

AccountId

The number of rows sent to theTeradata database betweencheckpoint operations. The defaultis not to checkpoint.

CheckpointRate

Maximum number of rejectedrecords that Teradata can write tothe error table 1 while inserting intoa FastLoad table.

ErrorLimit


9Using Bulk Loading





FastLoad uses this table to storerecords that were rejected for errorsother than unique primary index or duplicate row violation.

ErrorTable1

FastLoad uses this table to storerecords that violated the uniqueprimary index constraint.

ErrorTable2

Maximum number of FastLoadsessions for the load job.

MaxSessions

Minimum number of FastLoad ses-sions required for the load job tocontinue.

MinSessions

Number of hours that the FastLoad

utility continues trying to logonwhen the maximum number of load jobs are already running on theTeradata database.

TenacityHours

Number of minutes that the Fast-Load utility waits before it retries alogon operation. The default is sixminutes.

TenacitySleep


For more information about these parameters, see the Teradata FastLoadReference.








Related Topics


TPump

To bulk load a table using the Teradata TPump utility

1. Ensure that your Teradata datastore specifies a value in TdpId (TeradataDirector Program Identifier). This option identifies the name of the Teradatadatabase to load and is mandatory for bulk loading.

2. In the Bulk Loader Options tab of your target table editor, choose TPump

in the Bulk loader drop-down list.


4. You can specify the following TPump parameters:


Reduce the print output of TPumpto the minimal information requiredto determine the success of the job.

The default is not to reduce printoutput.

Reduced print output


9Using Bulk Loading





Keep macros that were createdduring the job run. You can usethese macrosas predefined macrosfor subsequent runs of the same

job.

Retain Macros

Encrypt data and requests in allsessions used by the job.

The default is not to encrypt all

sessions.

Data Encryption

Number of request buffers thatTPump uses for SQL statements tomaintain the Teradata database.

Number of buffers


Character set

Configuration file for the TPump job.

Configuration file

Controls the rate at which TPumptransfers SQL statements to theTeradata database. Value can bebetween 1 and 600, which specifies

the number of periods per minute.The default value is 4 15-secondperiods per minute.

Periodicity value






Turns on verbose mode which pro-vides additional statistical data inaddition to the regular statistics.

Print all requests

For more information about these parameters, see the Teradata ParallelData Pump Reference.

5. In Attributes, you specify Data Services parameters that correspond toTeradata parameters in TPump commands. You can usually use the

default settings for the following parameters in the TPump script that DataServices generates.

Short description

DataServices param-

eter in Attributes

pane

TPump command

Identifier, of up to 30characters, associated

with the user name thatwill logon to the Terada-ta database.

AccountIdNAME


9Using Bulk Loading




Short description

DataServices param-

eter in Attributes

pane

TPump command

Use the error tablespecified in ErrorTable.If the table does notexist, TPump createsit. If the structure of theexisting error table isnot compatible with theerror table TPump cre-

ates, the job will run in-to an error whenTPump tries to insert or update the error table.

AppendBEGIN LOAD

Number of minutes be-tween checkpoint oper-ations. Value must bean unsigned integer from 0 through 60, in-

clusive.

The default is tocheckpoint every 15minutes.

CheckpointRateBEGIN LOAD





Short description

DataServices param-

eter in Attributes

pane

TPump command

Maximum number of rejected records thatTPump can write to theerror table while main-taining a table. The de-fault is no limit.

If you specify ErrorPer-centage , ErrorLimit

specifies the number of records that must besent to the Teradatadatabase beforeError-

Percentage takes ef-fect. For example, if ErrorLimit is 100 andErrorPercentage is 5,100 records must besent to the Teradata

database before theapproximate 5% rejec-tion limit is applied. If only 5 records were re-

jected when the 100threcord is sent, the limitis not exceeded. How-ever, if six recordswere rejected when the100th record is sent,TPump stops process-ing because the limit isexceeded.

ErrorLimitBEGIN LOAD


9Using Bulk Loading




Short description

DataServices param-

eter in Attributes

pane

TPump command

Integer value that repre-sents the approximatepercent of the totalnumber of records sentto the TeradataDatabase that might berejected during theTPump task. You can-

not specify this param-eter without ErrorLimit.

ErrorPercentageBEGIN LOAD

Name of the table inwhich TPump storesinformation about er-rors and the rejectedrecords.

ErrorTableBEGIN LOAD

Name of macro to exe-cute. Using predefinedmacros saves time be-cause TPump does notneed to create anddrop new macros eachtime you run a TPump

job script.

ExecuteMacroEXECUTE

Select Ignore dupli-

cate inserts to notplace duplicate rows inthe error table.

Ignore duplicate in-

sertsDML LABEL





Short description

DataServices param-

eter in Attributes

pane

TPump command

Character string thatidentifies the name of a job. The maximumlength is 16 characters.

JobNameNAME

Number of secondsthat the oldest record

resides in the buffer before TPump flushesit to the Teradatadatabase. Value can-not be less than onesecond.

If the SerializeOn isnot specified, only thecurrent buffer can pos-sibly be stale. If you

specify SerializeOn,the number of stalebuffers can range fromzero to the number of sessions.

LatencyBEGIN LOAD

Name of the table touse to write checkpointinformation that is re-quired for the safe and

automatic restart of aTPump job.

The default name hasthe following format:

owner.table _LT

LogTableOther TPump com-mands


9Using Bulk Loading




Short description

DataServices param-

eter in Attributes

pane

TPump command

Name of database tocontain any macrosTPump uses or builds.The default is to placemacros in the samedatabase that containstheTPump target table.

MacroDatabaseBEGIN LOAD

Maximum number of sessions for TPump touse to update the Tera-data database. DataServices uses a defaultof 3.

MaxSessionsBEGIN LOAD

Minimum number of sessions for TPump to

use to update the Tera-data database.

MinSessionsBEGIN LOAD

Do not drop the error table, even if it is emp-ty, at the end of a job.You can use NoDrop

with Append to persistthe error table, or youcan use it alone.

NoDropBEGIN LOAD





Short description

DataServices param-

eter in Attributes

pane

TPump command

Prevents TPump fromchecking for statementrate changes from, or update status informa-tion for, the TPumpMonitor.

NoMonitor BEGIN LOAD

Prevents TPump fromterminating because of an error associatedwith a variable-lengthrecord.

NoStopIMPORT INFILE

Number of SQL state-ments to pack into amultiple-statement re-quest. The default is 20

statements per re-quest. The maximumvalue is 600.

PackBEGIN LOAD

Select PackMaximumto have TPump dynam-ically determine thenumber of records topack within one re-quest. The maximum

value is 600.

PackMaximumBEGIN LOAD


9Using Bulk Loading




Short description

DataServices param-

eter in Attributes

pane

TPump command

Initial maximum rate atwhich TPump sendsSQL statements to theTeradata database.Value must be a posi-tive integer. If unspeci-fied, Rate is unlimited.

RateBEGIN LOAD





Short description

DataServices param-

eter in Attributes

pane

TPump command

Specifies whether or not to use robustrestart logic. Value canbe YES or NO.

• NO specifies simplerestart logic, whichcause TPump tobegin where the last

checkpoint occurredin the job. TPumpredoesanyprocess-ing that occurredafter the check-point.

• YES specifies robustrestart logic, whichyou would use for DML statements

that change the re-sults when you re-peat the operation.Examples of suchstatements includethe following:

INSERTs into ta-bles which allowduplicate rows

UPDATE foo SET A=A+1...

RobustBEGIN LOAD


9Using Bulk Loading




Short description

DataServices param-

eter in Attributes

pane

TPump command

Specify a comma sepa-rated list of columns touse as the key for rowsand guarantee that op-erations on these rowsoccur serially.

SerializeOnBEGIN LOAD

Number of hours thatthe utility tries to log onsessions required toperform the TPump

job. The default is four hours.

TenacityHoursBEGIN LOAD

Number of minutes thatTPump waits before itretries a logon opera-

tion. The default is sixminutes.

TenacitySleepBEGIN LOAD


For more information about these parameters, see the Teradata ParallelData Pump Reference.








Related Topics


Load

To bulk load a Teradata table using the Load utility

1. In the Bulk Loader Options tab of your target table editor, choose Load

in the Bulk loader drop-down list.


3. Enter a command to be invoked by Data Services in the Command linetext box. For example, fastload<C:\tera_script\float.ctl

4. If you chose Data File in File Option, enter (or browse to) the directorypath where you want Data Services to place your data file.

5. If you chose Generic named pipe or Named pipe access module inFile Option, enter the pipe name.

Bulk loading in Netezza

Data Services supports bulk loading to Netezza Performance Servers.

For detailed information about Netezza loading options and their behavior in theNetezza environment, see the relevant Netezza product documentation.

Netezza recommends using the bulk-loading method to load data for faster performance. Unlike other bulk loaders, the Data Services bulk loader for Netezza supports UPDATE and DELETE as well as INSERT operations,which allows for more flexibility and performance.

Netezza bulk-loading process

To bulk load to a Netezza target table, Data Services:

• Creates an external table that is associated with a local file or namedpipe

• Loads data from the source into the file or named pipe


9Using Bulk Loading




• Loads data from the external table into a staging table by executing anINSERT statement

• Loads data from the staging table to the target table by executing a set

of INSERT/UPDATE/DELETE statements

Options overview

From the Bulk Loader Options tab of your Netezza target table editor, selectone of these methods depending on your Netezza environment:

• Named pipe—Data Services streams data as it is written to the namedpipe through the external table to the staging table. For files that are larger

than 4 GB in size, select this option for faster performance.• File—Data Services writes the data to a file before loading through the

external table to the staging table. For files that are smaller than 4 GB insize, select this option for faster performance.

• None—Data Services does not use bulk loading.

Because the Data Services bulk loader for Netezza also supports UPDATEand DELETE operations, the following options (on the target table editor Options tab) are also available for Netezza bulk loading.

Note: These are the only options that will be available on the Options tab.• Column comparison

• Number of loaders

• Use input keys

• Update key columns


Using Bulk Loading9 Bulk loading in Netezza



• Auto correct load

Related Topics

• Reference Guide:Target tables

Netezza log files

When writing from the external table to the staging table, Netezza generateslogs (nzlog and nzbad files) and writes them to a database server workingdirectory. You can use these logs to troubleshoot your jobs. (If you do notenter a Database server working directory in the datastore editor, Netezzauses the temp directory on its server, /tmp, to store the nzlog and nzbad

files.)

For Data Services to access and manage these logs, configure the FTPparameters in the datastore editor. After a load, Data Services copies thesefiles from the specified Netezza Database server working directory to thespecified Data Services Bulk loader directory and deletes them from theNetezza server.

For successful loads, Data Services then deletes these log files from theBulk loader directory (assuming the Clean up bulk loader directory after

load option is checked in the target table editor).

For failed loads, Data Services does not delete the log files from the Bulkloader directory even if the Clean up bulk loader directory after load

option is checked in the target table editor.

Configuring bulk loading for Netezza

To configure the datastore:

1. Create a new ODBC datastore.

2. Specify the options for the Netezza database by selecting the Databasetype: ODBC.

3. Click Advanced.

4. Click in the field to the right of Bulk loader directory and type thedirectory path or click Browse to where Data Services should write SQLand data files for bulk loading.


9Using Bulk Loading




5. Expand the FTP category and enter the FTP host name, login user name,login password, and host working directory.

Data Services uses these options to transfer the Netezza nzlog and nzbad

files.Note: If this datastore is not being used specifically for Netezza bulkloading, Data Services ignores any FTP option entries.

6. If you are loading non-ASCII character data, set the Code page to latin-9.

If you are loading mulitbyte data, set the Code page to utf-8.

7. Click OK or Apply.

Related Topics

• Reference Guide: ODBC • Designer Guide: Defining a database datastore

To configure the target table

For descriptions of all Netezza bulk-loading options.

1. Create a data flow with a Netezza target table.

2. Double-click the target table to open its editor.

3. Click the Bulk Loader Options tab.4. For Bulk load, from the drop-down list select a bulk-loading method (File

or Named pipe).

5. Edit the other options as necessary.

Click the Options tab. For Netezza databases, only the following optionsare available. Edit them as necessary.

• Column comparison

• Number of loaders

• Use input keys

• Update key columns

• Auto correct load





Related Topics

• Reference Guide:Target tables


9Using Bulk Loading




Other Tuning Techniques

10



The previous chapters describe the following tuning techniques:

• Maximizing push-down operations

• Using Caches

• Using Parallel Execution

• Distributing Data Flow Execution

• Using Bulk Loading

This section describes other tuning techniques that you can use to adjustData Services performance:

• Source-based performance options

• Join ordering

• Minimizing extracted data

• Using array fetch size

• Target-based performance options

• Loading method

• Rows per commit

• Job design performance options

• Loading only changed data

• Minimizing data type conversion

• Minimizing locale conversion

• Precision in operations

These techniques require that you monitor the change in performancecarefully as you make small adjustments.

Related Topics

• Maximizing Push-Down Operations on page 41

• Using Caches on page 63

• Using parallel Execution on page 77

• Distributing Data Flow Execution on page 101

• Using Bulk Loading on page 123


Other Tuning Techniques10 Source-based performance options



• Join ordering on page 185

• Minimizing extracted data on page 188

• Using array fetch size on page 189

• Loading method on page 190• Rows per commit on page 191

• Job design performance options on page 192

• Minimizing data type conversion on page 192

• Minimizing locale conversion on page 193

• Precision in operations on page 193

• Measuring Data Services Performance on page 23

Source-based performance options

Join ordering

Data Services reads sources in preparation for the join in an order determinedby a "depth first traversal" of a tree starting from the root node. Controllingthis order—referred to as the join order in this description—can often havea profound effect on the performance of producing the join result. The joinorder has no effect on the actual result produced.

Related Topics

• How join ranks affect join order on page 185

• How join ranks are used in outer joins on page 186

How join ranks affect join order

Consider these examples without join ranks set:

In this join tree, there is only onebranch to traverse. Table A is readthen joined with Table B. The resultsfrom that join are joined with TableC.


1Other Tuning Techniques




In this join tree, the longest branch istraversed first. Table J is read, then

joined with Table L. The results are joined with Table M. The results arethen joined with Table K.

You can use join ranks to control the order in which the sources are joined.Data Services joins the child nodes in the order of decreasing join ranks.The join rank associated with a node is the maximum join rank of any nodefurther down the branch.

For example, suppose you are joining four tables in an outer join—tables J

through M—and you assign each a different join rank.

In this join tree, the join rank of Table L is considered 3 because the join rankof Table M is greater than the original join rank of Table L. However, the joinrank of Table K (5) is still larger than the join rank of Table L. The join order is as follows: Table J is read and joined with Table K. The results are joinedwith Table L. The result of that join is joined with Table M.

Note: If the value of Join rank is 0 and you collected cache statistics for thesources involved in the join, the optimizer determines the join ranks basedon the cache statistics.

How join ranks are used in outer joins

Join ordering is determined by the type of join you choose to use in DataServices. There are two types of joins:

• Normal — Created by drawing connection lines from multiple sources ina data flow to the same query and then entering an integer as a Join

rank value for each source using source editors. Data Services orders join operations using these values.





For a normal join between three or more tables, Data Services internallyimplements a series of two-way joins. First, it joins the two highest rankedtables together using the join ranking rules outlined above. Next, DataServices joins the result of the first join with the third highest join rankedtable. This process continues until all tables are included in the join.

• Outer Join — Created by taking a normal join and then using the Outer

Join tab of the query to specify the outer and inner sources. Data Servicesprocesses the outer source as if it was assigned the highest join rank.

For an Outer join, Data Services ignores the Join rank values unless youare using three or more tables and a pair of joins share the same outer

join table.

For example, if table A is an outer join to tables B and C, the join rank iscalculated by taking the maximum join rank value of the two tables ineach outer join pair (A and B and A and C). The pair with the higher joinrank becomes the outer loop.

Data Services implements joins as nested loop joins. The source with thehigher join rank or the one specified as an Outer Source in the Outer Jointab becomes an outer loop. If all the ranks are equal or not set (defaults to0), then Data Services picks an inner and outer source at random. During

job execution, Data Services reads the source in the outer loop one time;and reads the source in the inner loop for each row in the outer loop.Performance improves when fewer rows are read.

The number of output rows for a join depends on the type of join you use ina data flow.

• For a Normal join, the output is limited to the number of rows that meet join conditions.

• For an Outer join, the output is equal to or greater than the number of rows in the source specified as the Outer Source.






Using an Outer join is often preferable with real-time jobs because oftenyou want the whole message passed on whether or not conditions a joinlooks for in the inner source exist.

In fact, if you do not use an Outer Join to order joins for a message, DataServices will still process the message as if it has the highest join rank.The message source editor lacks a Join rank option because DataServices automatically gives it the highest join rank.

Related Topics

• Reference Guide: Query

Tips

For a join between two tables, assign a higher join rank value to the larger table and, if possible, cache the smaller table.

For a join between a table and file:

• If the file is small and can be cached, then assign it a lower join rank valueand cache it.

• If you cannot cache the file, then assign it a higher join rank value so thatit becomes the "outer table" in the join.

For a join between two files, assign a higher rank value to the larger file and,if possible, cache the smaller file.

Minimizing extracted data

The best way to minimize the amount of data extracted from the sourcesystems is to retrieve only the data that has changed since the last time youperformed the extraction. This technique is called changed-data capture.

Related Topics

• Designer Guide: Techniques for Capturing Changed Data





Using array fetch size

Data Services provides an easy way to request and obtain multiple data rowsfrom source databases. The array fetch size feature allows you to retrievedata using fewer requests, thereby significantly reducing traffic on your network. Tuning array fetch size can also reduce the amount of CPU use onthe Job Server computer.

The Data Services array fetch feature lowers the number of databaserequests by "fetching" multiple rows (an array) of data with each request.Enter the number of rows to fetch per request in the Array fetch size optionon any source table editor or SQL transform editor. The default setting is1000, meaning that with each database request, Data Services will

automatically fetch 1000 rows of data from your source database. Themaximum array fetch size that you can specify is 5000 bytes.

Business Objects recommends that you set the array fetch size based onnetwork speed.

Note: Higher array fetch settings will consume more processing memoryproportionally to the length of the data in each row and the number of rowsin each fetch.

Regardless of the array fetch setting, sources reading columns with an Oracle

LONG data type cannot take advantage of this feature. If a selected datacolumn is of type LONG, the array fetch size internally defaults to 1 row per request.

To set the Array fetch size parameter

1. Use either a source table editor or an SQL transform editor.

To use a source table editor:

a. Double-click a source table in the Designer's workspace.

b. In the Performance section of the Source tab, enter a number in theArray fetch size text box.

To use an SQL transform editor:

a. Double-click an SQL transform in the Designer's workspace.

b. In the SQL transform editor, enter a number in the Array fetch size

text box.






Array Fetch Size indicates the number of rows returned in a single fetchcall to a source table. The default value is 1000. This value reduces thenumber of round-trips to the database and can improve performance for table reads.

The Array Fetch Size option does not support long column data types. If the SELECT list contains a long column, Data Services sets the ArrayFetch Size to 1 and reads one row of data at a time from the database.

2. Click OK.

Target-based performance options

Loading method

You can choose to use regular loading or bulk loading with Data Services.For a regular load, the Parameterized SQL option is automatically selectedby Data Services when it must generate, parse, and compile the statement.By using parameterized SQL, Data Services can minimize these efforts byusing one handle for a set of values instead of one handle per value.

You cannot combine bulk loading with the following options:

• Auto-correct load

• Enable Partitioning

• Number of Loaders

• Full push down to a database

Data Services automatically selects this optimizer process when thefollowing conditions are met:

• The source and target in a data flow are on the same database

• The database supports the operations in the data flow

If the optimizer pushes down source or target operations, then it ignoresthe performance options set for sources (Array fetch size, Caching, andJoin rank) because Data Services is not solely processing the data flow.

• Overflow file

• Transactional loading


Other Tuning Techniques10 Target-based performance options



To improve performance for a regular load (parameterized SQL), you canselect the following options from the target table editor. Note that if you useone, you cannot use the others for the same target.

• Enable PartitioningParallel loading option. The number of parallel loads is determined by thenumber of partitions in the target table.

• Number of Loaders

Parallel loading option. The number of parallel loads is determined by thenumber you enter for this option.

Related Topics

• Push-down operations on page 42

• Table partitioning on page 79

• Using Bulk Loading on page 123

Rows per commit

Rows per commit for regular loading defaults to 1000 rows. Setting theRows per commit value significantly affects job performance. Adjust therows per commit value in the target table editor's Options tab, noting the

following rules:• Do not use negative number signs and other non-numeric characters.

• If you enter nothing or 0, the text box will automatically display 1000.

• If you enter a number larger than 5000, the text box automatically displays5000, which is the maximum value allowed.

Business Objects recommends setting rows per commit between 500 and2000 for best performance. You might also want to calculate a value. To dothis, use the following formula:

max_IO_size/row size (in bytes)

For most platforms, max_IO_size is 64K. For Solaris, max_IO_size is 1024K.For a formula to calculate row size, see Caching sources on page 65.

Note that even with a value greater than one set for Rows per commit, DataServices will submit data one row at a time if the following conditions exist:



Target-based performance options



• You set Rows per commit too high for your environment

• You are using an overflow file

• The initial transaction fails

In this case, performance becomes worse than setting Rows per commit

to 1.

Job design performance options

Loading only changed data

Identifying and loading only changed data is called changed-data capture(CDC), which includes only incremental data that has changed since the lastrefresh cycle. Performance improves because with less data to extract,transform, and load, the job typically takes less time.

Related Topics

• Designer Guide: Techniques for Capturing Changed Data

Minimizing data type conversion

Data Services offers very robust and easy-to-use data type conversions viacolumn mappings of different data types. Business Objects recommendsthat you:

• Avoid unnecessary data conversions.

• Verify that Data Services is performing the implicit conversions (selectedwhen you drag and drop columns from input to output schemas in thequery transform) as expected. This can be done by looking at the warningsgenerated during job validation.


Other Tuning Techniques10 Job design performance options



Minimizing locale conversion

If your jobs do not require the use of different or multi-byte locales, you canincrease performance by ensuring that locales are single-byte and notmismatched.

Precision in operations

Data Services supports the following precision ranges: 0-28, 29-38, 39-67,68-96. Note that as you decrease precision, performance increases for arithmetic operations and comparison operations. In addition, when

processing an arithmetic or boolean operation that includes decimals indifferent precision ranges, Data Services converts all to the highest precisionrange value because it cannot process more than one decimal data typeprecision range for a single operation. For example, if Data Services mustperform an arithmetic operation for decimals with precision 28 and 38, itconverts both to precision 38 before completing the operation.



Job design performance options



Index

A

aggregations, pushing to database 43al_engine process 24al_jobserver process 24analyzing trace log files 30array fetch size 189auto correct load

enabling push down 53

B

bulk loadingDB2 129DB2, using import utility 137Oracle 124Oracle, conventional-path load 125Oracle, direct-path load 125Sybase 138

Teradata 139Teradata, Load Utilities 157Teradata, Warehouse Builder 151using 177vs. regular loading 190

C

caching joins 66Table_Comparison transform 68tables 65using a lookup function 67

caching data 39, 64caching lookups vs. setting source table as

outer join 67

changed-data capture, performance, using for 188

conventional-path loading in Oracle 125

D

datacapturing changes 192

Data Services, optimizing data transformations42

data transformationsoptimize by caching 28, 39, 64optimizing 42pushing to source database 47

data type conversion 192database links

and datastores 57defined 57example data flow 59

importing 58software support 58tuning performance 61

datastoresadding database link support 58and database links 57persistent cache 69

DB2, bulk loading 129degree of parallelism, enabled for functions 92degree of parallelism, for data flows 86

Eextracting data

caching 64minimizing 188




F

filtering, pushing to database 43functions, pushing to database 43

I

import utility for DB2 bulk loading 137IPC communications

between data flows 120between sub data flows 121

J

join ordering 185 join rank

outer joins 185 joins, caching 66 joins, pushing to database 43

L

linked datastorescreating 58example data flow 59

tuning performance 61Loading 190logs

performance monitor 71lookup function, caching 67

M

minimize data extracted 188monitor log files, performance,

href="../perf_opt/measuring with 31

O

operational dashboards 34optimizing data transformations 42

Oraclebulk loading 124bulk loading, conventional-path load 125bulk loading, direct-path loading 125parallel loading 127

ordering, pushing to database 43outer joins

rank in 185

P

parallel execution 78, 99degree of parallelism 86Degree Of Parallelism 86, 107

degree of parallelism, enabled for functions92

file multi-threading 96for different data flows and work flows 78partitioned tables 79partitioned tables, creating 81partitioned tables, enabling 84using table partitioning and DOP 94within a single data flow 79, 99

parallel loading in Oracle 127peer-to-peer communications

between data flows 120between sub data flows 121

performancechanged-data capture and 192data transformations and 47improving, DS environment 18, 22improving, with bulk loading 177improving, with parallel execution 78, 99measuring, with DI options 25measuring, with DS options 34tuning, with DS job design options 39, 184tuning, with DS source options 39, 184

tuning, with DS target options 39, 184performance monitor log 71Performance Monitor, reading execution

statistics 32persistent cache datastore 69


Index



projection, pushing to database 43pushing operations to database

example 45logic 42overview 42

R

rowsduplicate, avoiding loading 53

rows per commit 191rows, retrieving multiple 26, 189

Sserver groups

grid computing 117sizing tables 65sources

retrieving multiple rows 26, 189SQL

pushing to database 42Sybase, bulk loading 138

T

Table_Comparison transform, performance,improving 68

tablescaching 65retrieving multiple rows 26, 189sizing 65

Teradatabulk loading 139bulk loading using Load Utilities 157bulk loading using Warehouse Builder 151load utilities 158, 162, 165, 177

throughput, improving 39

trace log files, analyzing 30tuning techniques

array fetch size 189caching data 39, 64 join ordering 185minimize data extracted 188minimize data type conversion 192rows per commit 191


Index



Index

Xi3 Ds Performance Optimization Guide En

Documents

Transcript of Xi3 Ds Performance Optimization Guide En