ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Advanced Architecture and Best Practices

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

2April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Welcome!

This course is intended to provide: An overview of the development and runtime

architecture of DataStage Enterprise Edition Recommendations for parallel Job Design and Best

Practices

There is purposely a combination of baseline and advanced material Most of this information does not exist in the current

course offerings or DataStage documentation This material will eventually be rolled into future

Essentials and Advanced course offerings


DataStage Enterprise EditionAdvanced Architecture and Best Practices

Agenda: Day 1

Module 1: Parallel Framework ArchitectureModule 2: Partitioning, Collecting, and SortingModule 3: The Parallel Job Score

Day 2Module 4: Best Practices and Job Design TipsModule 5: Environment VariablesModule 6: Introduction to Performance Tuning



Module 01: Parallel Framework Architecture

Paul ChristensenSolution Architect




Why You Need to Know This

DataStage Client is a developer productivity tool GUI is not intended as a replacement for understanding parallel,

flow-based ETL design DataStage Designer includes intelligence to facilitate quick

development of simple flows But, this is a development environment, not Visio (picture

drawing)

The key to mastering Enterprise Edition is in understanding the DataStage Parallel Framework Parallel ETL is a fundamentally different process Complex, high-volume flows require an understanding of the

underlying engine architecture For now (v7.x1), you’ll ALWAYS need a copy of the

“OEM” (Orchestrate) documentation Documentation for the DataStage Parallel Framework


DataStage Enterprise EditionParallel Framework Architecture


Data Stage Enterprise Edition Component Architecture

UNIX Operating System / NetworkingParallel Hardware (SMP, Cluster, MPP)

DataStage Parallel Application Framework and Runtime System

Ascential Data Management Components

Ascential Data Analysis

Components

Transformer, BuildOp

Components

Third Party Components

Ascential Applications(Data Stage Client)

Third Party Applications


Introduction to Enterprise Edition

Parallel processing = executing your application on multiple CPUs Scalable processing = add more resources

(CPUs and disks) to increase system performance

• Example system containing 6 CPUs (or processing nodes) and disks

• Run an application in parallel by executing it on 2 or more CPUs

• Scale up system by adding more CPUs

• Can add new CPUs as individual nodes, or add CPUs to an SMP node

1 2

3 4

5 6


Source

Transform

Target

Operational Data

Archived Data

Clean Load

Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes

• disk I/O consumes the processing• terabytes of disk required for temporary staging

Traditional Batch Processing

Disk Disk DiskData

Warehouse


• Transform, clean and load processes are executing simultaneously• Rows are moving forward through the flow

Target

Load

• Start a downstream process while an upstream process is still running.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still have limits on scalability

Think of a conveyor belt moving rows from process to process!

Pipeline Multiprocessing

Data Warehouse

Source

Operational Data

Archived DataTransform Clean Load


Node 1

Node 2

Node 3

Node 4

subset1

subset2

subset3

subset4

Partition Parallelism

Divide large data into smaller subsets (partitions) across resources Goal is to evenly distribute data Some transforms require all data

within same group to be in same partition

Requires the same transform on all partitions BUT: Each partition is independent

of others, there is no concept of “global” state

Facilitates near-linear scalability (correspondence to hardware resources) 8X faster on 8 processors 24X faster on 24 processors…

Transform

Transform

Transform

Transform

Source Data


Data Warehouse

Source Data

Source

Transform

Target

Clean Load

Pipelining

Par

titio

ning

Enterprise Edition Combines Partition and Pipeline Parallelisms

Within the Parallel Framework, Pipelining and Partitioning Are Always Automatic Job developer need only identify

Sequential vs. Parallel operations (by stage)Method of data partitioningConfiguration file (there are advanced topics here)Advanced per-stage options (buffer tuning, combination,

etc)


User assembles the flow using DataStage Designer

… at runtime, this job runs in parallel for any configuration(1 node, 4 nodes, N nodes)

Job Design vs. Execution

No need to modify or recompile your job design!


explicit

pipeline

• Explicit parallelism

• Implicit pipeline "parallelism"

• Implicit data-partition parallelism

Sort

DerivationSample

Lookup

Link Constraint

data-partition

Example: Three Types of Parallelism


Defining Parallelism

Execution modeExecution mode (sequential/parallel) is controlled by stage definition and properties default = parallel for most Ascential-supplied stages Can override default in most cases through Advanced

stage properties; examples where stage usage defines parallelism:Sequential File reads (unless number of readers per node is set)Sequential File targetsOracle Enterprise sources (unless partition table is set)others...

Degree of parallelismDegree of parallelism is determined by config. file Total number of logical nodes in nameless default pool,

or Nodes listed in [nodemap] or in named [nodepool]


The Parallel Configuration File

Configuration Files separate configuration (hardware/software) from job design Specified per job at runtime by $APT_CONFIG_FILE Alter hardware and resources without changing job design

Defines #nodes = logical processing units with corresponding resources (need not match physical CPUs) Dataset, Scratch, Buffer disk (filesystems) Optional resources (eg. Database, SAS, etc) Advanced topics (“pools” - named subsets of nodes)

Multiple configuration files should be used Optimize overall throughput and matches job characteristics to

overall hardware resources Provides runtime “throttle” on resource usage on a per job basis


{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }}

1

43

2

key aspects:

1. # Nodes defined (LOGICAL processing entities)

2. Resources assigned to each node (order of entries within each node is significant!)

3. Advanced resource optimizations and configuration (named pools, database, SAS)

The Parallel Configuration File


DataStage Enterprise EditionJob Compilation


DataStage DesignerParallel Canvas Job Compilation

DataStage Designer client generates all code Validates link requirements, mandatory stage options,

transformer logic, etc. Generates OSH representation of job data flow and

stages GUI “stages” are representations of Framework “operators” Stages in parallel shared containers are statically inserted in

the job flow Each server shared container becomes a dsjobsh operator

Generates transform code for each parallel Transformer Compiled on the DataStage server into C++ and then to

corresponding native operators To improve compilation times, previously compiled

Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client

upgrades)

Buildop stages must be compiled manually within the GUI or using buildop UNIX command line

Designer Client

DataStage server

Executable Job

C++ for each Transformer

Generated OSH

TransformerComponents

Compile


Viewing Generated OSH

Enable viewing of generated OSH in DS Administrator:

Schema

Operator

OSH is visible in:- Job Properties- Job run log - View Data- Table Definitions (Show Schema)

Comments


Example Stage / Operator Mapping

Sequential File Source: import Target: export

DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column

Generator, Surrogate Key Generator: generator

Oracle Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert

Lookup File Set Target: lookup -createOnly

Within Designer, stages represent operators, but there is not always a 1:1 correspondence.

Examples:

See “OEM” OperatorsRef.PDF


Generated OSH Primer

Designer inserts comment blocks to assist in understanding the generated OSH

Note that operator order within the generated OSH is the order a stage was added to the job canvas

OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition

operator name schema

for generator, import, export operator options (use “-name value” format) input (indicated by n< where n is the input #) output (indicated by n> where n is the output #)

may include modify For every operator, input and/or output datasets (links) are

numbered sequentially starting from 0. For example: op1 0> dst op1 1< src

The following operator input/output data sources are generated by DataStage Designer:

Virtual data set, (name.v) Persistent data set

(name.ds or [ds] name)

Example of generated OSH for first 2 stages of this job:

######################################################## STAGE: Row_Generator_0## Operatorgenerator## Operator options-schema record( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10};)-records 50000

## General options[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]## Outputs0> [] 'Row_Generator_0:lnk_gen.v';

######################################################## STAGE: SortSt## Operatortsort## Operator options-key 'a'-asc

## General options[ident('SortSt'); jobmon_ident('SortSt'); par]## Inputs0< 'Row_Generator_0:lnk_gen.v'## Outputs0> [modify (keep a,b,c;)] 'SortSt:lnk_sorted.v';

Virtual data set (link) name is used to connect output of one operator to input of another


Framework DataStageschema table definition

property format

type SQL type + length [and scale]

virtual dataset link

record/field row/column

operator stage

step, flow, OSH command job

Framework DS engine

• GUI uses both terminologies• Log messages (info, warnings, errors) use Framework terms

Terminology


DataStage Enterprise EditionRuntime Architecture


Enterprise Edition Runtime Execution

Generated OSH and Configuration file are used to “compose” a job SCORE similar to the way an RDBMS builds a query optimization plan Identifies degree of parallelism and node assignment for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (datasets) between adjacent operators Inserts buffer operators to prevent deadlocks (eg. fork-joins) Defines number of actual UNIX processes

Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements

Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log

It is only after these steps that processing begins This is the “startup overhead” of an Enterprise Edition job

Job processing ends when Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg.

DataStage Director STOP)


Viewing the Job SCORE

• Set $APT_DUMP_SCORE to output the Score to the DataStage job log

• For each job run, 2 separate Score Dumps are written to the log

• First score is actually from the license operator

• Second score entry is the actual job scoreLicense Operator job score

Actual job score


Example Job Score

Job scores are divided into two sections Datasets

partitioning and collecting

Operatorsnode/operator mapping

Both sections note sequential or parallel processing

Why 9 Unix processes?


Job Execution: The OrchestraConductor Node

C

Processing Node

SL

PP P

SL

PP P

Processing Node

• Conductor - initial Framework process– Score Composer– Creates Section Leader processes (one/node)– Consolidates massages, to DataStage log– Manages orderly shutdown

• Section Leader (one per Node)– Forks Players processes (one/Stage)– Manages up/down communication

• Players– The actual processes associated with Stages– Combined players: one process only– Sends stderr, stdout to Section Leader– Establish connections to other players for data flow– Clean up upon completion

• Default Communication:• SMP: Shared Memory• MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)


copy,0

generator,0

Section Leader,0

Conductor

copy,2

generator,2

Section Leader,2

copy,1

generator,1

Section Leader,1

$ osh “generator -schema record(a:int32) [par] | roundrobin | copy”

Control Channel/TCP

Stdout Channel/Pipe

Stderr Channel/Pipe

APT_Communicator

Runtime Control and Data Networks


Parallel Data Flow

Think of job runtime as a series of “conveyor belts” transporting rows for each link If the stage is parallel, each link will have multiple independent “belts” (partitions)

Row order is undefined (“non-deterministic”) across partitions, or across multiple links Order within a particular link and partition is deterministic

based on partition type and, optionally, sort order

For this reason, job designs cannot include “circular” references eg. cannot update a source or reference used in the same flow

Data Flow

Undef

ined

ord

er a

cros

s

parti

tions

and

links


DataStage Enterprise EditionData Types, Conversions, Nullability


The Framework processes only datasets

For external data, Enterprise Edition must perform conversion operations: Format translation

using data type mappings May also require:

Recordization Columnization

External data formats fall in two major categories: Automatic: the conversion is automatic or semi-automatic

data stored in a relational database (DB2, Informix, Oracle, Teradata)

data stored in a SAS data set Mapping rules are documented in OperatorsRef.pdf

Manual: user needs to manually specify formats everything else: flat text files, binary files Use the Sequential File Stage

Exte

rnal D

ata

Data Formats

Co

nversio

n

DataSet format

Parallel Framework

Co

nversio

n

Exte

rnal D

ata


Data Sets

Consist of: Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node)

Virtual Data Sets exist in-memory Correspond to DataStage Designer links

Persistent Data Sets are stored on-disk Descriptor file

(metadata, configuration file, data file locations, flags) Multiple Data Files

(one per node, stored in disk resource file systems)

There is no “DataSet” operator – the Designer GUI inserts a copy operator

node1:/local/disk1/…node2:/local/disk2/…

Data Sets are the structured internal representation of data within the Parallel Framework


When to Use Persistent Data Sets

When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints) Stored in native internal format (no conversion overhead) Retain data partitioning and sort order

(end-to-end parallelism across jobs) Maximum performance through parallel I/O

Data Sets are not intended for long-term or archive storage Internal format is subject to change with new DataStage releases Requires access to named resources

(node names, file system paths, etc) Binary format is platform-specific

For fail-over scenarios, servers should be able to cross-mount filesystems Can read a dataset as long as your current $APT_CONFIG_FILE defines

the same NODE names (fastnames may differ) orchadmin –x lets you recover data from a dataset if the node names are no

longer available


Caution on using Plug-In MetaData

DataStage Server plug-ins do not always match the data type definitions used by native Enterprise database stages Do not use a Plug-In to

import Oracle table definitions

Instead, use ORCHDBUTIL to import Oracle table definitions


Runtime Column Propagation

Runtime Column Propagation (RCP) allows you to define only part of your table definition (schema). When RCP is enabled, if your job encounters extra columns not defined in the metadata, it will adopt these extra columns and propagate them through the rest of the job.

RCP must be enabled at the project level (it is off by default) Can then be enabled/disabled at the job level (Job

Properties/Execution) Can also be enabled/disabled at the stage level (Output Columns)

RCP allows maximum re-use of parallel shared containers Input and Output table definitions only need columns required by

container stages. Parallel Shared Container can be used by multiple jobs with different schemas, as long as the core input/output columns exist.

Must enable RCP in every stage within the parallel shared container


Output Mapping With RCP Disabled

When RCP is Disabled (default) DataStage Designer will enforce Stage Input

Column to Output Column mappings At job compile time modify operators are

inserted on output links in the generated OSH


Output Mapping With RCP Enabled

When RCP is Enabled DataStage Designer will not enforce mapping rules

Modify is still inserted at compile but Columns are not removed from output Columns are not renamed unless explicitly dragged to

derivation

In this example, runtime error because Name will not map to NAME, (RCP maps by case sensitive column name)

Must drag column name to derivation column


Type Conversions

Enterprise Edition provides numerous conversion functions between source and target data types Default type conversions take place across the output

mappings of any parallel stage when runtime column propagation is disabled for that stageVariable to Fixed-length string conversions will pad remaining

length with ASCII NULL (0x0) charactersUse $APT_STRING_PADCHAR to change default padding

(also used by target Sequential File stages) Non-default type conversions require use of Transformer or

Modify (recommended method) Look for warnings in DataStage log to indicate unexpected

conversions!


Source Type to Target Type Conversions

Source Field Target Fieldd = There is a default type conversion from source field type to destination field type. e = You can use a Modify or a Transformer conversion function to convert from the source type to the destination type A blank

cell indicates that no conversion is provided.

int8

uint8

int16

uint16

int32

uint32

int64

uint64

sfloat

dfloat

decimal

string

ustring

raw

date

time

timestam

p

int8 d d d d d d d d d e d d e d e e e e

uint8 d d d d d d d d d d d d

int16 d e d d d d d d d d d d e d e

uint16 d d d d d d d d d d d e d e

int32 d e d d d d d d d d d d e d e e e

uint32 d d d d d d d d d d d e d e e

Int64 d e d d d d d d d d d d d

uint64 d d d d d d d d d d d d

sfloat d e d d d d d d d d d d d

dfloat d e d d d d d d d d d e d e d e e e

decimal d e d d d d e d d e d e d d e d e d e

string d e d d e d d d e d d d d e d e d e e e

ustring d e d d e d d d e d d d d e d e d e e

raw e e

date e e e e e e e

time e e e e e d e

timestamp e e e e e e e


Enterprise Edition Nullable Data

Out-of-band: an internal data value to mark a field as NULL In-band: a specific user-defined field value indicates a NULL

Required for Transformer processing Disadvantage:

must reserve a field value that cannot be used as valid data elsewhere in the flow Examples:

a numeric field’s most negative possible valuean empty string

To convert a NULL representation from an out-of-band to an in-band and vice-versa: Transformer stage:

Stage variables: IF ISNULL(linkname.colname) THEN … ELSE …Derivations: SetNull(linkname.colname)

Modify stage:destinationColumnName = handle_null(sourceColumnName,value)destinationColumnName = make_null(sourceColumnName,value)


Null Transfer Rules

When mapping between source and destination columns of different nullability, the following rules apply:

Source Field Destination Field Result

not_nullable not_nullable Source value propagates to destination.

nullable nullable Source value or null propagates.

not_nullable nullable Source value propagates; destination value is never null.

nullable not_nullable WARNING messages in log. If source value is null, a fatal error occurs. Must handle in Transformer or Modify.


NULLS and Sequential Files

For NULLABLE columns, the following properties are used when reading from or writing to Sequential Files: null_field

A number, string, or C-style literal escape value (eg. \xAB) that defines the NULL value representation

null_lengthField length that indicates a NULL value (only appropriate for

variable-length files)

Null field representation can be any string, regardless of valid values for actual column data type


Lookup and Nullable Columns

When using Lookup with “If Not Found = Continue” unmatched output rows follow nullability attributes of the reference link for non-key columns: If the non-key columns of the reference

link are defined as non-nullable, the Lookup stage assigns a "default value" on unmatched recordsDefault Value depends on the data type*.

For example: Integer columns default value is zero. Varchar is a zero-length string (this is

distinctly different from a NULL value) Char is a string of fixed length

$APT_STRING_PADCHAR characters If the non-key columns of the reference

link are defined as nullable, the Lookup stage will place NULL values in these columns for unmatched records

* Data type default values are documented in OEM UserGuide.pdf

Lookup

If Not Found = Continue

Unmatched rows follow nullability attributes of

non-key reference link columns

TIP:

When changing column attributes, be careful to propagate the change through the remaining links of your job design

(Including the output column definition of the Lookup stage in this example).


Outer JOINs and Nullable Columns

Similar to Lookup, when performing an OUTER JOIN (Left Outer, Right Outer, Full Outer), unmatched output rows follow nullability attributes of the the corresponding outer link(s):

If the non-key columns of the outer link(s) are defined as non-nullable, the Join stage will assigns a "default value" on unmatched records, based on their data type

If the non-key columns of the outer link(s) are defined as nullable, the Join stage will place NULL values in these columns for unmatched records

Left Outer JOIN

Unmatched rows follow nullability attributes of

non-key outer link columns

Left

Right

Full Outer JOIN

Unmatched rows follow nullability attributes of non-key columns of

outer links


Transformer and Null Expressions

Within a parallel transformer, any expression that includes a NULL value will produce a NULL result 1 + NULL = NULL “John” : NULL : “Doe” = NULL

When the result of a link constraint or output derivation is NULL, the Transformer will output that row to its reject link (dashed line) Always create a Transformer reject link in DataStage

Designer Always test for null values before

using in an expressionIF ISNULL(link.col) THEN… ELSEUse stage variables if re-used

v7 Transformer now warns when rows reject

v7 also clarifies naming of output link constraints (Otherwise)


Framework “OEM” Documentation

UserGuide.PDF Covers framework architecture, parallel processing,

partitioning/collecting data, data sets, data types, conversion functions, OSH

Also includes detailed documentation on buildops

OperatorsRef.PDF Detailed reference for every built-in operator

RecordSchema.PDF Format of Framework schema definition

(including import, export, generator)

DevGuide.PDF, HeaderSorted.PDF, ClassSorted.PDF low-level Orchestrate C++ APIs for building custom operators

Available in the documentation section (“Orchestrate”) of Ascential eServices public website


For More Information

Framework “OEM” Documentation User Guide Operators Reference Record Schema

DataStage Enterprise Edition Best Practices and Performance Tuning document PLEASE send your comments and feedback to:

[email protected]

Don’t be afraid to try!



Module 01: Parallel Framework Architecture






Module 02: Partitioning, Collecting, and Sorting Data





Partitioners, Collectors, and Sorting

Partitioners distribute rows of a single link into smaller segments that can be processed independently in parallel ONLY before parallel stages

Collectors combine parallel partitions of a single link for sequential processing ONLY before sequential stages

Sorting is used to arrange rows into specific groupings and order. May be parallel or sequential

partitioner collector

Stagerunning

Sequentially

Stage running in

Parallel

Stage running in

Parallel


Partitioning and Collecting Icons

“Fan-Out” Partitioner

Sequential to Parallel

Collector

(“Fan-In”)Parallel to Sequential

NOTE: Partitioner and Collector icons are ALWAYS drawn “Left to Right” regardless of how the link is drawn!


Partitioning Data

partitioner

Stage running in

Parallel

Stage running in

Parallel


Partitioners

Partitioners distribute rows of a single link (data set) into smaller segments that can be processed independently in parallel

Partitioners exist before ANY parallel stage. The previous stage may be running: Sequentially

Results in a “fan-out” operation (and link icon)

In Parallel If partitioning method changes, data is repartitioned

partitioner

Stage running in

Parallel

Stage running in

Parallel

Stage running in

Parallel

Stagerunning

Sequentially

Stage running in

Parallel

Stage running in

Parallel

repartitioning icon


Partition Numbers and Director Job Log

At runtime, the Parallel Framework determines the degree of parallelism for each stage using: $APT_CONFIG_FILE (and optionally) a stage’s node pool (Advanced properties)

Partitions are assigned numbers, starting at zero Partition number is appended to the stage name for

messages written to the DataStage Director job log

partition #

stage name


System Variables for Parallel Derivations

To facilitate parallel calculations regardless of actual runtime config, system variables are provided in Column / Row Generator and Transformer stages.

Within Column / Row Generator, two reserved words are provided for numeric cycles: part: actual partition # (starts at zero) partcount: total number of partitions at runtime

Starting with v7.1, the Surrogate Key Generator stage can generate a sequence of integer values in parallel: Internally similar to using Column Generator stage

with part and partcount keywords Also supports initial value for the sequence(s)

Within the Transformer, @INROWNUM system variable is generated for each node. Instead, use: @PARTITIONNUM: actual partition number

(starts at zero) @NUMPARTITIONS: total number of partitions

Example Generator Sequence:Type = CycleInitial value = partIncrement = partcount

For a 4-node configuration file:@NUMPARTITIONS = 4@PARTITIONNUM = 0 through 3

Assuming incoming data isround-robin partitioned:

Row# Part Partcount Result1 0 4 02 1 4 13 2 4 24 3 4 35 0 4 46 1 4 57 2 4 68 3 4 7

initial values

first increment


Selecting a Partitioning Method

Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes

Greatly varied partition sizes (skew) increase processing time

Enable “Show Instances” in DataStage Director Job Monitor to show data distribution (skew) across partitions:

Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the DataStage log as each stage/node (operator) completes processing


Selecting a Partitioning Method

Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes

Greatly varied partition sizes (skew) increase processing time

Objective 2: Partition method MUST match the stage logic, assigning related records to the same partition if required Any stage that operates on groups of related data (often using

key columns)Examples: Aggregator, Join, Merge, Sort, Remove Duplicates, etc…

(perhaps also Transformers, Buildops)

Partitioning method needed to ensure correct results may violate Objective #1, depending on actual data distribution

Objective 3: Partition method should not be overly complex The simplest method that meets Objectives 1 and 2 If possible, leverage partitioning performed earlier in a

flow


Specifying Partitioning

Partitioning method is defined on the Input properties, Partitioning category, of any stage running in parallel


Partitioning Methods

Keyless PartitioningRows are distributed independent of actual data values

Same Existing partitioning is not

altered Round Robin

Rows are evenly alternated among partitions

Random Rows randomly assigned to

partitions Entire

Each partition gets the entire dataset (rows duplicated)

Keyed PartitioningRows are distributed at runtime based on values in specified key column(s)

Hash Rows with same key column

value go to the same partition Modulus

Assigns each row of an input dataset to a partition, as determined by a specified numeric key column in the input dataset

Range Similar to hash, but partition

mapping is user-determined and partitions are ordered

DB2 Matches DB2 EEE partitioning Discussed in database chapter

Auto (the default method) DataStage EE chooses appropriate

partitioning method Round Robin, Same, or Hash are

most commonly chosen


036

147

258

036

147

258

Row ID's

SAME partitioning icon

SAME Partitioning

Keyless partitioning method Rows retain current

distribution and order from output of previous parallel stage Doesn’t move data between

nodes Retains “carefully partitioned”

data (such as the output of a previous sort)

Fastest partitioning method (no overhead)


Impact of SAME Partitioning

Don’t over-use SAME partitioning in a job flow Because SAME does not alter existing partitions,

the degree of parallelism remains unchanged in the downstream stage If you read a Sequential File using SAME partitioning

(without specifying Readers Per Node option), the downstream stage will run sequentially!

If you read a persistent Data Set using SAME partitioning, the downstream stage runs with the degree of parallelism used to create the data set, regardless of the current $APT_CONFIG_FILE / specified node pool


Round Robin and Random Partitioning

Keyless partitioning methods Rows are evenly distributed across

partitions Good for initial import of data if no

other partitioning is needed Useful for redistributing data

Fairly low overhead

Round Robin assigns rows to partitions as dealing cards Row/Partition assignment will be the

same for a given $APT_CONFIG_FILE

Random distributes rows with random order Higher overhead than Round Robin Not subject to regular patterns that

might exist in source data Row/Partition assignment will differ

between runs of the same input data

…8 7 6 5 4 3 2 1 0

Round Robin

630

741

852


Parallel Runtime Example

Remember, row order is undefined (non-deterministic) across partitions, or across multiple links

Consider this example job: Round robin partitioning distributes

rows in a specific order to the number of nodes at runtime

But, across nodes, the order a particular node outputs its results may change with each run:

Results with a 4-node $APT_CONFIG_FILE:

Node 0: 1, 5, 9

Node 1: 2, 6, 10

Node 2: 3, 7

Node 3: 4, 8

Row Generator

10 rows {A: Integer, initial_value=1, incr=1}

Round Robin partitioning

With round robin partitioning, rows are distributed in the same order for the same input data and $APT_CONFIG_FILE


ENTIRE Partitioning

Keyless partitioning method Each partition gets a

complete copy of the data Useful for distributing lookup and

reference dataMay have performance impact in

MPP / clustered environments On SMP platforms, Lookup stage

(only) uses shared memory instead of duplicating ENTIRE reference data

ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good

practice to set this explicitly on the Normal Lookup reference link(s)

…8 7 6 5 4 3 2 1 0

ENTIRE

.

.3210

.

.3210

.

.3210


HASH Partitioning

Keyed partitioning method Rows are distributed

according to the values in one or more key columns Guarantees that rows with

identical combination of values in key column(s) are assigned to the same partition

Needed to prevent matching rows from “hiding” in other partitionseg. Join, Merge, RemDup …

Partition size will be relatively equal if the data across the source key column(s) is evenly distributed

…0 3 2 1 0 2 3 2 1 1

HASH

0303

111

222

Values of key column


Hash Key Selection

HASH ensures that rows with the same combination of all key column values are assigned to the same partition

Hash on LName, 4 node config file distributes as:

So

urce D

ata

ID LName FName Address

1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson

5 Dodge Horace 17840 Jefferson

6 Dodge John 75 Boston Boulevard

7 Ford Henry 4901 Evergreen

8 Ford Clara 4901 Evergreen

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore

NOTE: Partition distribution matches

source data distribution In this example, number of distinct

hash key values limits parallelism!

Partitio

n 1










Part 0





Another Hash Key Example

Using same source data, Hash on LName, FName, 4 node config file:

NOTE: Improved distribution Only the unique combination of

key columns appear in the same partition

For partitioning purposes, order of HASH key columns is insignificantNOTE: To avoid repartitioning,

key column order should be consistent across stages with same keys

Part 3




Part 2





Part 1





Part 0





Modulus Partitioning

Keyed partitioning method Rows are distributed according

to the values in one integer key column Simpler (and faster) calculation

than HASH using modulus (remainder) of division: partition = MOD (key_value / #partitions)

Guarantees that rows with identical values in key column end up in the same partition

Partition size will be relatively equal if the data within the key column is evenly distributed

…0 3 2 1 0 2 3 2 1 1

MODULUS

0303

111

222



RANGE Partitioning

Keyed partitioning method Rows are evenly distributed

according to the values in one or more key columns Requires “pre-processing” data to

generate a range map More expensive than HASH partitioning Must read entire data TWICE to guarantee results

Guarantees that rows with identical values in key columns end up in the same partition

The “Write Range Map” stage is used to generate the range map file If the source data distribution is

consistent over time, it may be possible to re-use the map file

Values outside of a given range map will land in the first or last partition as appropriate

565

•QUIZ! If incoming data is ordered on key, something bad happens. WHAT?

Range Map file


4 0 5 1 6 0 5 4 3

010

443

RANGE

ANSWER: The process runs sequentially (key value adjacency)!


Example Partitioning Icons

“fan-out” Sequential to Parallel

AUTO partitioner

re-partitionwatch for this!

SAME partitioner


Automatic Partitioning

By default, the Parallel Framework inserts partition components as necessary to ensure correct results (check the job score) Before any stage with “Auto” partitioning Generally chooses between ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values

(eg. Join, Merge, RemDup) Inserts ENTIRE on Normal (not Sparse) Lookup reference links

NOT always appropriate for MPP/clusters

Since the Framework has limited awareness of your data and business rules, it is usually best to explicitly specify HASH partitioning when key groupings are required Framework has no visibility into Transformer logic Required before SORT and AGGREGATOR (hash method) stages Framework may insert un-needed or non-optimal partitioning


Preserve Partitioning Flag

The “preserve partitioning” flag is designed for stages that use “Auto” partitioning Flag has 3 possible settings:

Set: instructs downstream stages to attempt to retain partitioning and sort order

Clear: downstream stages need not retain partitioning and sort orderPropagate: passes (if possible) the flag setting from input to output links

Set automatically by some operators (eg. Sort, Hash partitioning) Can be manually set by users through stage Advanced properties Functionally equivalent to explicitly specifying SAME partitioning

But allows the Parallel Framework to over-ride and optimize for performance (eg. if the degree of parallelism differs)

Preserve Partitioning setting is part of the data set structure In memory (virtual) and on disk (persistent)

At runtime, if Preserve Partitioning flag as set and a downstream operator cannot use previous partitioning, a warning is issued


Summary: Partitioning Strategy

Use HASH partitioning when stage requires grouping of related values Specify only the key columns that are necessary for correct

grouping (as long as the number of unique values is sufficient) Use MODULUS if group key is a single Integer column RANGE may be appropriate in rare instances when data

distribution is uneven but consistent over time Know your data!

How many unique values in the hash key column(s)?

If grouping is not required, use ROUND ROBIN to redistribute data equally across all partitions Framework will often do this with AUTO partitioning

Try to optimize partitioning for the entire job flow


Job SCORE: Data Sets

The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime

Partitioners and Collectors are associated with datasets (top portion of the SCORE)

Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)

Partitioner / Collector method is shown between the source and target


Interpreting the Job Score - Partitioning

The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows

that are consumed by downstream stages (operators or data sets)

Partitioning method is associated with producerCollector method is associated with consumerSeparated by an indicator:

May also include [pp] notation when Preserve Partitioning flag is set

Producer

Consumer

-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer


Optimizing Partitioning

Minimize the number of re-partitions within and across job flows Within a flow

Examine up-stream partitioning and sort order and attempt to preserve for down-stream stages using SAME partitioning

May require re-examining key column usage within stages and processing (stage) order

Across jobs through a persistent data setData sets retain partitioning AND sort order across flows

If sort order is significant, write to a persistent data set with the Preserve Partitioning flag SET

Useful if downstream jobs are run with the same degree of parallelism and require same partition and sort order


Collecting Data

collector

Stagerunning

Sequentially


Collectors combine partitions of a dataset into a single input stream to a sequential Stage

data partitions (NOT links)

collector

sequential Stage

...

Collectors


Specifying Collector Type

Collector method is defined on the Input properties, Partitioning category, of any stage running sequentially when the previous stage is running in parallel

Stage running in

Parallel

Stagerunning

Sequentially

collector icon


Collector Methods

(Auto)Eagerly read any row from any input partitionOutput row order is undefined (non-deterministic)This is the default collector method

Round RobinPatiently pick row from input partitions in round robin orderSlower than auto, rarely used

OrderedRead all rows from first partition, then second,… Preserves order that exists within partitions

Sort MergeProduces a single (sequential) stream of rows sorted on

specified key column(s) for input sorted on those keysRow order is undefined for non-key columns


Choosing a Collector Method

In most instances, Auto collector (the default) is the fastest and most efficient method of collecting data into a sequential stream

To generate a single stream of sorted data, use the Sort Merge collector for previously-sorted input Input data must be sorted on these keys to produce a sorted result Sort Merge does not perform a sort, it simply defines the order that

rows are read from all partitions using the values in one or more key columns

Ordered collector is only appropriate when sorted input has been range-partitioned No sort required to produce sorted output

Round robin collector can be used to reconstruct original (sequential) row order for round-robin partitioned inputs As long as intermediate processing (eg. sort, aggregator) has not

altered row order or reduced number of rows Rarely used in real life scenarios


Collectors vs. Funnels

Collector Operates on a single,

partitioned link (single virtual data set)

Consolidates partitions as the input to a sequential stage

Always identified by a “fan-in” link icon

Funnel stage Stage that runs in parallel Merges data from multiple

links (multiple virtual data sets) to a single output link

Table Definitions (schema) of all links must match

Don’t confuse a collector with a Funnel stage!

FunnelCollector


Sorting Data


Traditional (Sequential) Sort

Traditionally, the process of sorting data uses one primary key column and (optionally) multiple secondary key columns to generate a sequential, ordered result set Order of key columns determines sequence (and groupings) Each key column specifies an ascending or descending sort group

This is the method that SQL uses for an ORDER BY clause

So

urce D

ata












Sort on:Lname

(asc),

FName (desc)

So

rted R

esult













Parallel Sort

In most cases, there is no need to globally sort data to produce a single sequence of rows

Instead, sorting is most often needed to establish order within specified groups of data Join, Merge, Aggregator, RemDup, etc… This sort can be done in parallel!

Partitioning is used to gather related rows Assigns rows with the same key column(s) values to the

same partition Sorting is used to establish grouping and order within each

partition based on one or more key column(s) Key values are adjacent

Partition and Sort keys need not be the same! Often the case before Remove Duplicates


Example Parallel Sort

Using same source data, Hash partition on LName, FName (4 node config):

Part 3




Part 2





Part 1





Part 0




Within each partition, sort using LName, FName:

Part 3




Part 2





Part 1




9 Ford Edsel 1100 LakeshoreP

art 0




Parallel Sort

Parallel Sort

Parallel Sort

Parallel Sort


Stages that require Sorted Data

Stages that process data on groups Aggregator Remove Duplicates Compare (perhaps)

If only comparing values, not order between two sources Transformer, Buildop (perhaps)

Depending on internal stage-variable logic

“Lightweight” stages that minimize memory usage by requiring data in key-column sort order Join Merge Sort Aggregator


Parallel (Grouped) Sorting Methods

DataStage Designer provides two methods for parallel (grouped) sorting: Sort stage

Parallel execution mode

OR

Specified on a link when partitioning is not AutoLinks with SORT defined will have a Sort icon:

By default, both methods use the same internal sort package (tsort operator)


Sorting on a Link

Easier job maintenance (fewer stages on job canvas)

BUT…Fewer options (tuning, features)

Right-click on key column to

specify sort options

Specify key usage for Sorting,

Partitioning, or Both


Sort Stage

The Sort stage offers more options than a link sort

Always specify “DataStage” Sort Utility (much faster than UNIX sort)


Stable Sorts

Stable sort preserves the order of non-key columns within each sort group

Stable sorts are slightly slower than non-stable sorts for the same data/keys Only use Stable sort when

needed

By default, stable sort is enabled on Sort stages!

Stable sort is not the default for Link sorts


Resorting on Sub-Groups

Use Sort Key Mode property to re-use key column groupings from previous sorts Uses significantly less memory/disk!

Sort is now on previously-sorted key-column groups not the entire data set Outputs rows after each group

Key column order is important! Don’t forget to retain incoming sort order (eg. SAME partitioning)


Partitioning and Sort Order

When partitioning data (except for SAME), sort order is not maintained

To restore row order / groupings, a sort is required after any repartitioning

2

101

3

Partitioner

1

103

102

1

2

3

103

102

101


Sequential (Total) Sorting Methods

Within Enterprise Edition, DataStage provides two methods for generating a sequential (totally sorted) result: Sort stage

Sequential execution mode

OR

SortMerge Collector For sorted input

In general, parallel Sort + SortMergecollector will be MUCH faster than a sequential Sort

- Unless data is already sequential

(Similar to how databases “parallel sort”)


Automatic Sorting

By default, the Parallel Framework inserts sort operators as necessary to ensure correct results Before any stage that requires

matched key values (eg. Join, Merge, RemDup)

Only inserted when the user has NOT explicitly defined an input sort

Check the Job SCORE for inserted tsort operators

For versions 7.01 and later, set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts Instead of actually performing the

sort, the inserted sort operators only VERIFY the data is sorted

If data is not sorted properly at runtime, the job will fail

Recommended only on a per-job basis during performance tuning

op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0))



Sort Resource Usage

By default, each sort uses 20MB per partition as an internal memory buffer Includes user-defined (link, stage) and framework-inserted

sorts A different size can be specified for each Sort stage using the

“Restrict Memory Usage” option Increasing this value can improve performance, especially if the

entire (or group) data partition can fit into memoryDecreasing this value may hurt performance, but will use less

memory (minimum is 1MB per partition)From Designer, this option is unavailable for link sorts

When the memory buffer is filled, sort uses temporary disk space in the following order:

Scratch disks in the $APT_CONFIG_FILE “sort” named disk poolScratch disks in the $APT_CONFIG_FILE default disk pool

(normally all scratch disks are part of the default disk pool)The default directory specified by $TMPDIRThe UNIX /tmp directory


Optimizing Sort Performance

Minimize number of sorts within a job flow Each sort interrupts the parallel pipeline - must read all

rows before generating output

Specify only necessary key columns Avoid stable sorts unless needed to retain order

of non-key column data If possible, use “Sort Key Usage” key column

option to re-use previous sort keys Within Sort stage, adjusting “Restrict Memory Usage” option may improve performance


Partitioning Examples

100

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Example 1

Scenario: Assign average value to existing detail rows “Standard” Solution (3 hash/sorts):

Copy Data, Hash and Sort on all inputs to Aggregator, Join This is also the method the framework would use with Auto

partitioning to ensure correct results

Aggregate

JoinCopy

Notice that all 3 hash partitioners and sorts use the same key columns and order!

101


Example 1 - Optimized Solution

Optimize partitioning keys (and sort order) across multiple stages in a single flow To minimize re-sorts and re-partitions

Optimized Solution (1 hash/sort): Move Hash/Sort upstream before the Copy Use SAME partitioning to preserve partitioning and sort

order

Partition and Sort on key column(s)

Aggregate

JoinCopy

SAME partitioning retains previous sort order

Inputs to JOIN do not need to be sorted!

102


Example 1: Sort Insertion

Looking at the Job SCORE for the “optimal” solution, the Framework inserts sorts before each Join input to ensure correct results Regardless of the partitioning method

chosen In this example we don’t want these

extra sorts

To change behavior of framework-inserted sorts for this job, set $APT_SORT_INSERTION_CHECK_ONLY Inserted sorts will verify row order at

runtime, but will not actually sort data

op3[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op4[4p] {(parallel inserted

tsort operator {key={value=LastName}, key={value=FirstName}}(0))


103


Partitioning Example 2: Header / Detail

Know your data – HASH guarantees correct grouping results, but it is not always the most efficient

Scenario: Header / Detail Processing Assign Data from Header Row to all Detail Rows Use Transformer to

Separate Header and Detail DataAdd Join Key column (constant value) to both outputs

Header

DetailSrc Out

NOTE: since the Join Key value is constant, inputs to the JOIN stage should NOT be sorted

104


Partitioning Example 2: Solutions

Solution 1: Hash On Key Columns and Join This is the “standard”

approach It is also the method the

Framework would use with Auto partitioning

BUT… there is only one hash key value, so the Join runs sequentially

Solution 2:Use Entire to copy header data to all partitions Distribute detail data using

Round Robin Join will now run in parallel

Header

Detail

Src Out

But there is still a possible problem with either solution!

For either solution, to counteract framework-inserted sorts, set

$APT_SORT_INSERTION_CHECK_ONLY

105


Introducing the Buffer Operator

At runtime, the Framework automatically inserts buffer operators to prevent deadlocks and to optimize overall performance For job flows with a fork-join, buffer

operators are inserted on all inputs to the downstream joining operatorAny link split that is later combined

in the same job flow Buffer operators may also be inserted

in an attempt to match producer and consumer rates

Data is never repartitioned across a buffer operator First-in, First-Out row processing

Some stages (eg. Sort, Hash Aggregator) internally buffer the entire data set before outputting a row Buffer operators are never inserted after these stages

Stage 3

Buffer

Buffer

Stage 1

Stage 2

106


Identifying Buffer Operators

At runtime, buffers are identified in the operators section of the job SCORE

For more details on buffering: OEM UserGuide.PDF, Appendix A

It has 6 operators:op0[1p] {(sequential Row_Generator_0) on nodes ( ecc3671[op0,p0] )}op1[1p] {(sequential Row_Generator_1) on nodes ( ecc3672[op1,p0] )}op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)

(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)

(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}op5[1p] {(sequential APT_RealFileExportOperator in

Sequential_File_12) on nodes ( ecc3672[op5,p0] )}It runs 12 processes on 4 nodes.

107


How Buffer Operators Work

The primary goal of a buffer operator is to prevent deadlocks

This is accomplished by “holding rows” until the downstream operator is ready to process them Rows are held in memory up to size defined by

$APT_BUFFER_MAXIMUM_MEMORY

default is 3MB per buffer per partition When buffer memory is filled, rows are spilled to disk

By default, up to amount of available scratch disk unless QUEUE UPPER BOUND limit has been set

BufferProducer Consumer

108


Buffer Flow Control

When buffer memory usage reaches $APT_BUFFER_FREE_RUN the buffer operator will offer resistance to the new rows, slowing down the rate of upstream producer Default 0.5 = 50% Setting $APT_BUFFER_FREE_RUN > 1 (100%) will prevent the

buffer from slowing down upstream producer until data size of $APT_MAXIMUM_MEMORY * $APT_BUFFER_FREE_RUN has been bufferedAssumes that the overhead of disk I/O for buffer scratch usage is less

than the impact of slowing down upstream operator

Producer ConsumerBuffer

$APT_BUFFER_FREE_RUNBuffer will offer resistance to new rows, slowing down

upstream producer

109


Tuning Buffer Settings

On a per-job basis through environment variables $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT And many other advanced options…

On a per-link basis (Inputs/Outputs ->Advanced) Buffer options are defined per link

(virtual data set), hence the Output of one stage is the Input of the following stage

In general, Auto Buffering (default) is recommended Don’t change unless you really

understand your job flow and data! Disabling buffering may cause the

job to deadlock (hang)

In general, buffer tuning is an advanced topic. The default settings should be appropriate for most job flows. For very wide rows, it may be necessary to increase default buffer size

to handle more rows in memory Calculate total record width using internal storage for each data type / length /

scale. For variable-length (varchar) columns, use maximum length.

110


Buffer Resource Usage

By default, each buffer operator uses 3MB per partition of virtual memory Can be changed through Advanced link properties,

or globally using $APT_BUFFER_MAXIMUM_MEMORY

When buffer memory is filled, temporary disk space is used in the following order:

Scratch disks in the $APT_CONFIG_FILE “buffer” named disk pool

Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool)

The default directory specified by $TMPDIRThe UNIX /tmp directory

111


End of Data / End of Data Group

Stages that process groups of data (eg. Join, Merge, Remove Duplicates, Sort Aggregator) cannot output a row until: Data in the grouping key column(s) changes

(logical End of Data Group) All rows have been processed (End of Data)

For stages that process groups, rows are buffered in memory until an End of Data Group or End of Data

Some stages (eg. Sort, Hash Aggregator) must read an entire input data set (until End of Data) before outputting a single record Setting “Don’t Sort, Previously Sorted”

key option changes Sort behavior to output on groups instead of entire data set

112


Revisiting Example 2: Buffering Impact

For large data volumes, buffering introduces a possible problem with this solution: At runtime, buffer operators are inserted for this fork-join scenario The Join stage, operating on key-column groups, is unable to output

rows until (end of data group) or (end of data) Generating one header row with no subsequent change in join

column, data is buffered until end of data

Solution: Use stage variables hold header data values. Output multiple header rows with different join-key values

This additional logic may impact Transformer performance Proper solution ultimately depends on data volume and available

hardware resources

Header

Detail

Src Out

Buffer

Buffer

113


Revisiting Example 2: Buffering Solution

Define stage variables to hold header-row values. Set initial values to empty Only set header values when

header is identified

Header Link: Use output link constraints to

only output data after header values have been captured.

Assign more than one join key value using @INROWNUM

Assumes only one header row

Detail Link: Assign constant value to

detail join column

114


Join Stage: Internal Buffering

Even for Inner Joins, there is a difference between the inputs of a Join stage! The first link (#0, “LEFT” within Link Ordering)

establishes “driver” input – rows are read one at a time For non-unique key values, all rows within the same key

value group are read into memory from the second link (#1, “RIGHT” by Link Ordering)

For Example 2, single Header row must be the second input link (#1) to the Join stage Otherwise, all input data will

be read into virtual memory

115


Avoiding Buffer Contention

Datasets do not buffer – there is no upstream operation that would prevent rows from being output

In some cases, the best solution to avoiding fork-join buffer contention is to split the job, landing results to intermediate data sets Develop a single job first If performance / volume testing indicates a buffering-

related performance issue that cannot be resolved by adjusting buffering settings, then split the job across intermediate datasets

116


Example 2: Why Not Use Lookup?

Lookup cannot output any rows until ALL reference link data has been read into memory (End of Data) Except for Sparse database lookups

NEVER generate Lookup reference data using a fork-join of source data Separate creation of lookup reference data from lookup

processing

Header

Detail

Src

HeaderRef

Out

117


Summary

Partitioning Method should ensure correct results

AND (if possible) evenly distribute data Must be aware of data distribution and impact on

processing

Collecting Used to consolidate partitioned data into sequential process

Sorting Parallel sorting establishes row order within groups

Partitioning gathers related rows Sequential sorting only needed to produce single, globally

sorted sequential result set



Module 02: Partitioning, Collecting, and Sorting Data






Module 03: The Parallel Job Score




120


The Parallel Job SCORE

The Job SCORE details the optimization plan used by the DataStage Parallel Framework to run a given job design based on a specified $APT_CONFIG_FILE Similar to the way a parallel RDBMS builds a query plan Identifies degree of parallelism and node assignment(s) for each operator

Details mappings between functional (stage/operator) and actual UNIX processes Includes buffer operators inserted to prevent deadlocks and optimize data flow

rates between stages Can be used to identify sorts and partitioners that have been automatically inserted

to ensure correct results

Outlines connection topology (datasets) between adjacent operators and/or persistent data sets

Defines number of actual UNIX processes Where possible, multiple operators are combined within a single UNIX process to

improve performance and optimize resource requirements

121


Viewing the Job SCORE

• Set $APT_DUMP_SCORE to output the Score to the DataStage job log

• Can enable at the project level to apply to all jobs

• For each job run, 2 separate Score Dumps are written to the log

• First score is actually from the license operator

• Second score entry is the actual job score

License Operator job score

Actual job score

122


Example Job Score

Job scores are divided into two sections Datasets

partitioning and collecting

Operatorsnode/operator mapping

Both sections note sequential or parallel processing

123


Job SCORE: Operators

The operators (lower) section of the Job Score details the mapping between stages and actual processes created at runtime Operator combination Operator to node mappings Degree of Parallelism per operator Framework-inserted sorts Buffer operators

op0[1p] {(sequential APT_CombinedOperatorController:

(Row_Generator_0) (inserted tsort operator

{key={value=LastName}, key={value=FirstName}})

) on nodes ( node1[op0,p0] )}op1[4p] {(parallel inserted

tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op2[4p] {(parallel buffer(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

124


Operator Combination

At runtime, the DataStage Parallel Framework can only combine stages (operators) that: Use the same partitioning method

Repartitioning prevents operator combination between the corresponding producer and consumer stages

Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination

Are CombinableSet automatically within the stage/operator definitionCan also be set within DataStage Designer: Advanced stage properties

125


Composite Operator Example: Lookup

The Lookup stage is a composite operator Internally it contains more than one

component, but to the user it appears to be one stage LUTCreateImpl

Reads the reference data into memory

LUTProcessImpl Performs actual lookup processing

once reference data has been loaded

At runtime, each internal component is assigned to operators independently

op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3)

on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel

APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)

(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)

(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}

126


Job SCORE: Data Sets

The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime

Partitioners and Collectors are associated with datasets (top portion of the SCORE)

Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)

Partitioner / Collector method is shown between the source and target

127


Interpreting the Job Score - Partitioning

The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows

that are consumed by downstream stages (operators or data sets)

Partitioning method is associated with producerCollector method is associated with consumer

“eCollectAny” is specified for parallel consumers, although no collection occurs!Separated by an indicator:

May also include [pp] notation when Preserve Partitioning flag is set

Producer

Consumer

-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer

128


Using the Job SCORE

$APT_DUMP_SCORE = 1 (“True”) is a recommended default (project level) setting for all jobs

At runtime, the Job SCORE can be examined to identify: Number of UNIX processes generated for a given job and

$APT_CONFIG_FILE Operator combination Partitioning methods between operators Framework-inserted components

Including Sorts, Partitioners, and Buffer operators



Module 03: The Parallel Job Score






Module 04: Best Practices and Job Design Tips




131


Assumptions

This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Material covered in

DS324PX: DataStage Enterprise Edition Essentials


DataStageEnterprise Edition

Job Design Tips

133


Overall Job Design

Ideal job design must strike a balance between performance, resource usage, and restartability

In theory, best performance results from processing all data in-memory without landing to disk Requires hardware resources (eg. CPU, memory)

and UNIX resources (eg. ulimit, nfiles, etc)Resource usage grows exponentially based on degree of

parallelism and number of stages in a flow Must also consider what else is running on the server(s)

May not be possible with very large amounts of dataeg. Sort will use scratch disk if data is larger than memory buffer

Business rules may dictate job boundarieseg. Dimensional maintenance before Fact table processing/loadeg. Lookup reference data must be created before lookup

processing

134


Modular Job Design

Parallel shared containers facilitate modular job design by creating re-usable components (stages, logic) Runtime column propagation allows maximum parallel

shared container re-use (only need to define columns used within container logic)

The total number of stages in a job includes the total of all stages in all parallel shared containers

Job parameters and multi-instance job properties facilitate job re-use

Land intermediate results to parallel data sets

135


Establishing Job Boundaries

Business requirements

Functional / DataStage requirements

Establish restart points in the event of a failure Segment long-running steps Separate final database Load from Extract and

Transformation steps

Resource utilization (number of stages, etc)

Performance Fork-join job flows may run faster if split into two

separate jobs with intermediate datasetsDepends on processing requirements and ability to tune buffering

136


Job Sequences

Job Sequences can be used to combine individual jobs into functional “modules” to perform a sequence of activities

Starting with DataStage release 7.1, Job Sequences can be “restartable” In the event of a failure, re-

running the sequence will not re-run activities that successfully completed

It is the developer’s responsibility to ensure that an individual job can be re-run after a failure

Enable Sequence restart in Job Properties (enabled by default)

The “do not checkpoint run” sequence stage property will execute that step every Sequence run.

137


Job Design – Stage Usage Tips

Sequential File Optimizing performance Reading and Writing fixed-width files Adjusting write buffer size

Column Import Lookup Sort Aggregator Transformer Database Stages

138


Reading a Sequential File in Parallel

By default, Sequential File reads are not parallel, unless multiple files are specified

The Readers Per Node option can be used to read a single input file in parallel at evenly spaced offsets Note that sequential row

order cannot be maintained when reading a file in parallel

139


Partitioning and Sequential Files

Sequential File sources (import operator) create one partition for each input file Always follow a Sequential File with ROUND ROBIN or

other appropriate partitioning type NEVER follow a Sequential File source with SAME

partitioningIf reading from one file, this will cause the downstream

flow to run sequentially!SAME is only appropriate in unusual scenarios where the

source data is already separated into multiple files by partition

140


Capturing Sequential File Rejects

The Sequential File stage supports an optional reject link to capture rows that do not match source or target format Reject schema is a single

raw (binary) column Be careful writing rejects to

another SequentialFile Easiest to output rejects to

a Dataset (with Peek for debug)

141


Sequential File Tips

To write fixed-length files from variable-length fields, use the following column properties: field width: specifies the output column width pad string: specifies character used to pad data to the

specified field width (if not specified an ASCII NULL character 0x0 is used for padding)

When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VARCHAR columns Starting with v7.01, set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead

142


Buffering Sequential File Writes

By default, Sequential File targets (export operator) buffer writes to optimize performance Buffers are automatically flushed when the job

completes successfully

For realtime applications, the environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer For example $APT_EXPORT_FLUSH_COUNT=1 flushes to disk

for every row Setting this value low incurs a SIGNIFICANT

performance penalty!

143


Using Column Import

The Column Import stage can be used to improve performance of non-parallel Sequential File reads and FTP sources Allows column parsing to run in parallel Separates parsing (CPU) from sequential source I/O

Define source file/FTP as a single columnType RAW or [VAR]CHARMaximum length = record sizeNote that there are metadata implications

Define Columns, Data Types, and other format options in Column Import stageSimilar to Sequential File definition

144


Lookup Stage Usage

The Lookup stage is most appropriate when reference data is small enough to fit into physical (shared) memory For reference datasets larger than available memory, use

the JOIN or MERGE stage

Limit use of Sparse Lookup (for DB2 and Oracle reference tables) Per-row database lookups are extremely expensive (slow)

For small numbers of rows, can be used for database-generated variables / function results

ONLY appropriate when the number of input rows is significantly smaller (eg. 1:100) than the number of reference rows

145


Lookup Reference Data Partitioning

ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good practice to set this explicitly on

the Normal Lookup reference link(s)

Lookup stage uses shared memory instead of duplicating ENTIRE reference data On SMP platforms

To minimize data movement across nodes in clustered / MPP platforms, it may be appropriate to select a keyed partitioning method Especially if data is already partitioned on those keys Input and Reference data partitioning must match

146


Lookup Reference Data

NEVER generate Lookup reference data using a fork-join of source data Lookup cannot output rows until all reference data has

been read into memory (except for Oracle or DB2 Sparse reference links)

Use Lookup File Sets to separate the creation of lookup reference data from lookup processing

Header

Detail

Src

HeaderRef

Out

147


Lookup File Sets

Lookup File Sets should be used to store reference data on disk Data is stored in native format, partitioned,

and pre-indexed on lookup key column(s) Key column(s) and partitioning are specified when file is

created

Lookup File Sets can only be used as reference input link to a Lookup stage Partitioning method and key columns specified when the

Lookup File Set is created will be used to process the reference data on subsequent Lookups that use this file

Particularly useful when static reference data can be re-used in multiple jobs (or runs of the same job)

148


Aggregator

The Aggregator stage summarizes data based on groupings of key-column values Input partitioning must match desired groupings

Use Hash method for inputs with a limited number of distinct key-column values Uses 2K of memory/group Incoming data does not need to be pre-sorted Results are output after all rows

have been read Output row order is undefined

Even if input data is sorted

Use Sort method with a large (or unknown) number of distinct key-column values Requires input pre-sorted on key columns Results are output after each group

149


Sequential (Total) Aggregations

To summarize on all input rows Generate a constant-value key column

Column GeneratorTransformer (if already in the upstream job flow)

Sequentially Aggregate on generated key columnNo need to sort or hash-partition input data!

Use 2 aggregators to prevent sequential aggregation (and collector) from slowing down upstream data flow First aggregator runs in parallel, grouping on generated key

columnRound-robin input if not evenly distributed

Second aggregator runs sequentially, grouping on generated key columnAuto collector

Parallel Sequential

150


Transformer vs. Other Stages

For optimum performance, consider more appropriate stages instead of a Transformer in parallel job flows:- Use the Copy stage as a placeholder

- this is different from DataStage Server!- unless FORCE=TRUE, Copy is optimized out at runtime

- Leverage stage (eg. Copy) Output Mappings (RCP off) to Rename ColumnsDrop ColumnsPerform Default Type Conversions

- Modify is the most efficient “stage”. Use it for- Non-default type conversions- Null handling (converting between in-band and out-of-band)- String trimming (v7.01 and later)

- NOTE: starting with v7.01, Transformer output link constraints are FASTER than Filter stage! (Filter is always interpreted)

151


Transformer vs. Lookup

- Consider implementing Lookup tables for expressions that depend on value mapping

- For example:- Instead of using transformer expressions such as:

- … link.A=1 OR link.A=3 OR link.A=5 …- … link.A=2 OR link.A=7 OR link.A=15 OR link.A=20 …

- Create a Lookup table for the source-value pairs, and use the Lookup stage to assign values

- This method can also be used to simply output link constraints

A Result

1 1

3 1

5 1

2 2

7 2

15 2

20 2

152


Transformer Performance Guidelines

- Minimize the number of Transformers by combining derivations from multiple Transformers

NEVER use the Server-side BASIC Transformer in high-volume data flows Intended to provide a migration path for existing DataStage Server

applications that use DataStage BASIC routines Starting with v7, the parallel Transformer supports user-defined

functions (external object files or libraries, not through DataStage BASIC routines)

Replace Transformer stages that do not meet performance requirements with BuildOps It is generally not necessary to replace all Transformers, just those

that are bottlenecks Remember, BuildOps require more knowledgeable developers than

equivalent Transformer logic

153


Optimizing Transformer Expressions

The parallel Transformer uses the following evaluation algorithm: Evaluate each stage variable initial value For each input row:

Evaluate each stage variable derivation value unless the derivation is empty

For each output link: Evaluate each column derivation value Write the output record

Stage variables and columns within a link are evaluated in the order displayed in the Transformer editor

154


Optimizing Transformer Expressions

Given Transformer evaluation order, use stage variables instead of per-column derivations to minimize repeated use of the same derivation: Move repeated expressions outside of loops Examples:

Portions of output column derivations that are used in multiple derivations

Where an expression includes calculated constant values Use the stage variable Initial Value to evaluate once for all

rows

Where an expression requiring a type conversion is used as a constant, or it is used in multiple places

155


Transformer Decimal Arithmetic

Starting with v7.0.1 and v6.0.2, the Transformer supports DECIMAL arithmetic (earlier releases converted to dfloat) Default internal decimal variables are precision 38 scale

10, but this can be changed by specifying$APT_DECIMAL_INTERM_PRECISION$APT_DECIMAL_INTERM_SCALE

Set $APT_DECIMAL_INTERM_ROUND_MODE to specify:ceil: rounds toward positive infinity

1.4 -> 2, -1.6 -> -1floor: rounds toward negative infinity

1.6 -> 1, -1.4 -> -2round_inf: rounds or truncates to nearest representable value, breaking

ties by rounding positive values toward positive infinity and negative values toward negative infinity

1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2trunc_zero: discard any fractional digits to the right of the rightmost

fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size

1.56 -> 1.5, -1.56 -> -1.5

156


Conditionally Aborting a Job

Use the “Abort After Rows” setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job Create a new output link and assign a link constraint that

matches the abort condition Set the “Abort After Rows” for this link to the number of

rows allowed before the job aborts

When the “Abort After Rows” threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers

157


More Transformer Best Practices

Always include Reject Link Captures NULL errors from

Transformer expressions

Always test for null value before using a column in a function

Avoid type conversions Try to maintain data type as imported

Be aware of Column and Stage Variable data types It is easy to neglect setting the proper Stage Variable

type

158


“First Row” Transformer Derivations

Within a Transformer, stage variables can be used to identify the first row of an input group Define one stage variable for each grouping key column Define a stage variable to flag when input key column(s) do

not match previous value(s) On new group (flag set), set stage variable(s) to incoming key

column value(s)

159


“Last Row” Transformer Derivations

Since the Transformers cannot “read ahead”, other methods must be used when derivations depend on the last row of a group

For aggregate calculations within the Transformer, generate a “running total” for each group, then Remove Duplicates, retaining Last row

160


Identifying “Last Row” in a Group

In general, it is a bad idea to perform multiple, back-to-back sorts

The sort stage, however, can be used for more than just sorting Sub-sorting on groups (instead of complete sorts) Creating key change columns

Example: For derivations that cannot output a running total, use 3 Sort stages before Transformer to generate a change key column for the last row in the group Often, data is already sorted earlier in the same flow Hash/Sort on key columns before first sort Use SAME partitioning to ensure that subsequent stages keep grouping

and sort order

Sort KeyChange SubSort

161


“Last Row” Sort Details

Sorts on key columns Sorts Descending on

group order column

First Sort Does no sorting – creates

key-change column Specify only key columns

Second “Sort” Does not sort on key

columns Sub-sorts Ascending on

group order column

Final “Sub-Sort”


DataStageEnterprise Edition

Database Stage Usage

163


Database Stage Usage

Overall Database Guidelines Native Parallel vs. Plug-In Stages DB2 Guidelines Oracle Guidelines Teradata Guidelines SQL or DataStage?

164


Optimizing Select Lists for Read

For source database stages, limit the use of “SELECT * ” to read all columns Uses more memory, may impact job performance Only needed for “dynamic” source / target flows

(uncommon)

Instead, explicitly specify ONLY the columns needed in the flow For Table read method, specify Select List property

Or, use Auto-Generated or User-Defined SQL

165


Native Parallel Database Stages

Starting with release 7, DataStage Enterprise Edition offers database connectivity through native parallel and plug-in stage types.

In general, for maximum parallel performance,scalability, and features it is best to use thenative parallel database stages. Parallel read and write OPEN and CLOSE commands

166


Upsert (API) vs. Load Methods

For database targets, most Enterprise stages provide the choice of Upsert or Load Methods Upsert method uses database APIs

Allows concurrent processing with other jobs and applications

Does not bypass database constraints, indexes, triggers Load method uses corresponding database-specific

parallel load utilitiesCan be significantly faster than Upsert method for large

data volumesSubject to database-specific limitations of load utilities

May be issues with index maintenance, constraints, etc May not work with tables that have associated triggers

Requires exclusive access to target table

167


OPEN and CLOSE commands

OPEN command allows user to specify SQL to be executed before the stage begins reading or writing Example: Create temporary table used to write rows

CLOSE command allows user to specify SQL to be executed after the stage completes reading or writing Example: “SELECT FROM … INSERT INTO…” from temporary

table to actual table Example: Delete temporary table(s)

Available only in the native parallel (Enterprise) database stages

168


Plug-In Database Stages

Plug-in stage types are intended to provide connectivity to database configurations not offered by native parallel stages. Cannot read in parallel Cannot span multiple servers in clustered or MPP

configurations

169


Designer Palette Customization

DataStage repository window displays all stages available in the parallel canvas.

Stage Types/Parallel category

Not all of these stages are included in the default Designer palette.

Customize the palette to add stage types (eg. Teradata API)

170


Enterprise Edition DB2 Stages

DB2 Enterprise stage Should always be used when reading from, performing

lookups against, or writing to DB2 Enterprise Server Edition with Database Partitioning Feature (DPF)DB2 v7.x this was called “DB2EEE”

Tightly coupled with DB2, communicates directly with each DB2 database node, using same partitioning as DB2 table

Supports Parallel Read, Upsert, Load, Sparse Lookup

DB2 API stage Provides connectivity to non-UNIX DB2 databases

(such as mainframe editions through DB2-CONNECT)

171


DB2 Upsert Commit Interval

For target DB2 tables using the Upsert method, the DB2 Enterprise Stage provides options to specify the database commit interval for each stage

Rows are committed after a period of time or number of rows, whichever comes first: Default is every 2 seconds or 2000 rows

172


Cleaning Up Failed DB2 Loads

In the event of a failure during DB2 Load operation, the DB2 Fast Loader marks the table inaccessible (quiesced exclusive or load pending state)

To reset the target table state to normal mode: Re-run the job specifying “CleanupOnFailure=True” option Any rows that were inserted before the load failure must

be deleted manually

173


Enterprise Edition Oracle Stages

Oracle Enterprise Source

Supports sequential (default) or parallel reads Target

Upsert: uses Oracle APILoad: invokes SQL*Loader, subject to its limitations

Oracle OCI Load ONLY used for heterogeneous loads

When target databases hardware platforms differ from the Oracle client (DataStage server) platform

174


Specifying Oracle Remote Server

The Oracle Enterprise Remote Server connection option is intended for Oracle instances on remote hosts

In general, avoid using this option for local Oracle databases (on same host as DataStage server) Specifying for local Oracle instances forces TCP

(network) instead of shared memory database connection

Instead, set the environment variable $ORACLE_SIDOracle environment is typically defined within the

DataStage dsenv file

175


Reading from Oracle in Parallel

By default, Oracle Enterprise reads sequentially. Use the partition table option to read in parallel from Oracle sources

Limitations of Parallel Read: Source table can only be non-partitioned or range-partitioned Cannot run queries containing a "GROUP BY" clause which

are not also partitioned by same field Cannot perform a non-collocated join

176


Oracle Schema Owner

To access Oracle tables that were created by a different user, fully-qualify the table name Syntax: ownername.tablename NOTES:

Parameterize ownernameDatabase permissions must allow accessCANNOT create an unqualified synonym

no access to Oracle system catalog information required by Oracle Enterprise stage

177


Improving Oracle Upsert Performance

In Upsert write mode, the Oracle Enterprise stage: Executes the Insert statement (if present) first If the Insert fails with a unique-constraint violation, it then

executes the Update statement For larger data volumes, it is often faster to identify Insert

and Update data within the job and separate into different Oracle Enterprise targets Set Upsert Mode=“Update Only” for rows to be updated Set Upsert Mode=“Update and Insert” for rows to be

inserted Prevents double-processing of update records

Insert processing uses Oracle host arrays to improve processing Optional InsertArraySize parameter can enhance performance

(default is 500 rows)

178


Oracle Upsert Commit Interval

For target Oracle tables using the Upsert method, two environment variables specify the database commit interval As environment variables, commit settings apply to all

Oracle stages in a job

Rows are committed after a period of time or number of rows, whichever comes first, for each Oracle stage/partition: $APT_ORAUPSERT_COMMIT_ROW_INTERVAL

Default is every 2000 rows (per stage/partition) $APT_ORAUPSERT_TIME_INTERVAL

Default is every 2 seconds

179


Oracle Load into Indexed Tables

By default, Oracle Enterprise will not Load an indexed table Must drop indexes before the load, and recreate after the

load (need appropriate Oracle privileges)Can use OPEN and CLOSE commands

In Append or Truncate modes, the IndexMode option can allow load into an indexed table: Rebuild: bypasses indexes during load, rebuilds indexes

after load completes uses Oracle ALTER INDEX REBUILD commandindexes cannot be partitioned

Maintenance: maintains index on loadLoads each partition sequentiallyTable and Index must be partitionedIndex must be local range-partitioned using same range values

used to partition the table

180


Alternate: Load into Indexed Tables

If index mode options are not possible, or if you do not have proper Oracle permissions, it is still possible to Load into an indexed table: Set Oracle Enterprise stage to run sequentially Set environment variable $APT_ORACLE_LOAD_OPTIONS

OPTIONS(DIRECT=TRUE,PARALLEL=FALSE)

181


Teradata Stage Usage

Because of limitations imposed by the Teradata Utilities, it is sometimes appropriate to use plug-in stages for Teradata sources or targets Teradata imposes a system-wide limit to the number of

concurrent database utilitiesCan be adjusted by the DBA, but can not be greater than 15Within a parallel job, each use of Teradata Enterprise, Teradata

MultiLoad, or Teradata Load stages count against this limit when the job is run

Which Teradata stage to use? Source or Target Teradata Enterprise

uses FastExport and FastLoad utilitiesHigh-volume parallel reads and writesTargets are limited to Insert operations (empty table or Append)Supports OPEN and CLOSE commands

182


Teradata Enterprise DBOptions

For Teradata instances with a large number of AMPs (VPROCs), it may be necessary to set the optional SessionsPerPlayer and RequestedSessions in the DBOptions string in the Teradata Enterprise stage It is a good idea to parameterize these settings Syntax is:

user=[user],password=[password],SessionsPerPlayer=nn, RequestedSessions=nn

183


Teradata Enterprise Sessions

RequestedSessions determines the total number of distributed connections to the Teradata source or target When not specified, it equals the number of Teradata

VPROCs (AMPs) (your DBA can provide this) Can set between 1 and number of VPROCs

SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism). Default is 2 sessions / player The number selected should be such that

SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions

Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for -SessionsPerPlayer should be increased.

184


Teradata Server– MPP with 4 TPA nodes– 4 AMP’s per TPA node

DataStage Server

Configuration File Sessions Per Player Total Sessions

16 nodes 1 16

8 nodes 2 16

8 nodes 1 8

4 nodes 4 16

Example Settings

Teradata SessionsPerPlayer Example

185


Teradata Plug-Ins

Target Teradata MultiLoad plug-in (MultiLoad utility)Targets allow Insert, Update, Delete, or Upsert of moderate data

volumes (stage cannot run in parallel)Do NOT use as a source in an EE flow! (runs FastExport

sequentially)

Target Teradata MultiLoad plug-in (TPump utility)Targets allow Insert, Update, Delete, or Upsert of small data

volumes in a large database Does NOT lock target table exclusivelystage cannot run in parallel

Source or Target Teradata API stage does not use database utilitiesIntended for small-volumes of dataDoes not count against Teradata limits, but slower than TPumpAnd…cannot read in parallel (parallel writes are allowed)

186


Teradata Stage Usage Guidelines

Stages that use Teradata Utilities (database-wide limit): TeraData Enterprise will always have maximum

performance for high volumes of dataONLY stage that will read in parallelLimited target capabilities (insert, append)

TeraData MultiLoad for moderate data volumesInserts, Updates, Deletes, UpsertsTarget stage ONLY!Must run sequentially

TeraData MultiLoad (TPump option)Similar to MultiLoad, but does not lock target table exclusively

Stages that do not use Teradata Utilities: Teradata API

187


SQL or DataStage?

When reading data from multiple tables in the same database, it is possible to use either SQL or DataStage for some tasks.

In general, the optimal implementation leverages the strengths of each technology: When possible, use a SQL filter (WHERE clause) to limit the number

of rows sent to the DataStage job Use a SQL JOIN to combine data from tables of small-medium

number of rows, especially when the join columns are indexed In general, avoid SQL SORTs – DataStage SORT is much faster

and runs in parallel without the overhead of sort-merge Use DataStage SORT and JOIN to combine data from very large

tables, or when the join condition is complex Avoid the use of database stored procedures (eg. Oracle PL/SQL)

on a per-row basis. Implement these routines using native DataStage components.

When the direction is not obvious, the decision is often made by actual tests, or influenced by other factors such as metadata needs and developer skill sets

188



Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference Record Schema

DataStage Enterprise Edition Best Practices and Performance Tuning document




Module 04: Best Practices and Job Design Tips






Module 05: Environment Variables




191


Understanding a Job’s UNIX Environment

Jobs inherit environment variables at runtime based on this order of evaluation: Environment variables defined in $DSHOME/dsenv

Shared by all projects on the DataStage server

Project-level environment variables defined by DS AdministratorDuplicate variables over-ride $DSHOME/dsenvNOTE: when migrating between environments, project level

environment variables are NOT exported

Job-level environment variables set in Job ParametersDuplicate variables over-ride $DSHOME/dsenv and project-level

settingsCannot be set / passed in Job Sequences (bug!)To avoid hard-coding job parameters, use special values:

$ENV – pulls value from operating system environment $PROJDEF – uses project default value

192


Copying Project-Level Environment Variables

Project-level environment variables are not exported when performing a full export using DataStage manager

With care, project-level environment variables can be copied between projects by editing the DSParams file located on the top-level of the project directory User-defined settings are near the end of this file

IMPORTANT: Always make a backup-copy of the DSParams file before any manual editing. It is possible to render a project un-usuable through

improper editing of DSParams

[InternalSettings]DisableParSCCheck=0[AUTO-PURGE]PurgeEnabled=0DaysOld=0PrevRuns=0[EnvVarValues]"ORACLE_SID"\1\"cpaul""APT_SORT_INSERTION_CHECK_ONLY"\1\"1"

193


Environment Variables For All Jobs

The following environment variables are recommended for all jobs. Although these can be set at the project level, it is better to specify within job properties Provides runtime parameter Specify in your Job template(s)

$APT_CONFIG_FILE=[filepath] $APT_DUMP_SCORE=1 $APT_RECORD_COUNTS=1

Outputs record counts to the job log as each operator completes processing $OSH_ECHO=1

Outputs generated OSH to job log $APT_PM_SHOW_PIDS=1

Places UNIX process ID entries in job log for each process started at runtime Does not show DataStage phantom or Server processes

$APT_BUFFER_MAXIMUM_TIMEOUT=1 Maximum buffer delay in seconds

$APT_COPY_TRANSFORM_OPERATOR=1 For clusters/MPP only: copies Transform operator(s) to remote nodes

194


Job Monitoring Environment Variables

Starting with DataStage v7, the Director Job Monitor captures results on a time interval Captured row counts are shown in Director, Job Monitor, and

Designer (Show Performance Statistics) This data is also stored in the DataStage repository, and can be

extracted using Job Control or XML reports

The following environment variables alter Job Monitor characteristics: $APT_MONITOR_TIME=[seconds]

Specifies time interval for capturing job monitor information at runtime. $APT_MONITOR_SIZE=[rows]

If set, specifies that the job monitor capture information on a row (not time) basis. This is the method used in DataStage release 6.x

$APT_NO_JOBMON=1Disables job monitoring completely – no statistics will be captured In rare instances, this may improve performance

195


Job Design Environment Variables

$APT_STRING_PADCHAR=[char] Overrides the default pad character of 0x0 (ASCII NULL) Can be a string character, or C-notation Used for all variable-length to fixed-length string conversions May have implications for some target database stages

(eg. Oracle) $APT_DECIMAL_INTERM_PRECISION=[precision]

$APT_DECIMAL_INTERM_SCALE=[scale] Specifies internal precision and scale used for internal Transformer derivations Default precision/scale is [38,10], maximum is [255,255]

$APT_DECIMAL_INTERM_ROUND_MODE=[mode] ceil: rounds toward positive infinity

1.4 -> 2, -1.6 -> -1 floor: rounds toward negative infinity

1.6 -> 1, -1.4 -> -2 round_inf: rounds or truncates to nearest representable value, breaking ties by

rounding positive values toward positive infinity and negative values toward negative infinity 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2

trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size 1.56 -> 1.5, -1.56 -> -1.5

196


Job Debugging Environment Variables

The following environment variables can assist with debugging a job flow: $OSH_PRINT_SCHEMAS=1

Outputs the actual schema used at runtime for each dataset in a job flow. This is useful for determining if actual schema matches what the job designer expected.

$APT_PM_PLAYER_TIMING=1When set, prints detailed information in the job log for each

operator, including CPU utilization and elapsed processing time $APT_PM_PLAYER_MEMORY=1

When set, prints detailed information in the job log for each operator when additional memory is allocated

$APT_BUFFERING_POLICY=FORCE$APT_BUFFER_FREE_RUN=1000Used in conjunction, these two environment variables effectively

isolate each operator from slowing upstream production. Using the job monitor statistics, this can identify which part of a job flow is impacting overall performance.

NOT recommended for production job runs!

197


Buffer Environment Variables

The following environment variables may also be specified on a per-stage basis within Designer: $APT_BUFFERING_POLICY $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT

198


Sequential File Stage Environment Variables

Environment Variable Setting Description

$APT_EXPORT_FLUSH_COUNT [nrows] Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

(DataStage v7.01 and later)

1 Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. By default, imported string fields that exceed their maximum declared length are truncated.

$APT_IMPORT_BUFFER_SIZE$APT_EXPORT_BUFFER_SIZE

[Kbytes] Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.

$APT_CONSISTENT_BUFFERIO_SIZE [bytes] In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.

$APT_DELIMITED_READ_SIZE [bytes] Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2.

This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.

$APT_MAX_DELIMITED_READ_SIZE [bytes] This variable controls the upper bound which is by default 100,000 bytes. When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

199


DB2 Environment Variables


$INSTHOME [path] Specifies the DB2 install directory. This variable is usually set in a user’s environment from .db2profile.

$APT_DB2INSTANCE_HOME [path] Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined).

$APT_DBNAME [database] Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the “Use Database Environment Variable” option is True. If $APT_DBNAME is not defined, $DB2DBDFT is used to find the database name.

$APT_RDBMS_COMMIT_ROWSCan also be specified with the “Row Commit Interval”

stage input property.

[rows] Specifies the number of records to insert between commits. The default value is 2000.

$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle DB2 databases which use the special characters # and $ in column names.

200


Informix Environment Variables


$INFORMIXDIR [path] Specifies the Informix install directory.

$INFORMIXSQLHOSTS [filepath] Specifies the path to the Informix sqlhosts file.

$INFORMIXSERVER [name] Specifies the name of the Informix server matching an entry in the sqlhosts file.

$APT_COMMIT_INTERVAL [rows] Specifies the commit interval in rows for Informix HPL Loads. The default is 10000.

201


Oracle Environment Variables


$ORACLE_HOME [path] Specifies installation directory for current Oracle instance. Normally set in a user’s environment by scripts.

$ORACLE_SID [sid] Specifies the Oracle service name, corresponding to a TNSNAMES entry.

$APT_ORAUPSERT_COMMIT_ROW_INTERVAL$APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num][seconds]

These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method.

Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.

$APT_ORACLE_LOAD_OPTIONS [SQL*Loader options]

Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)

$APT_ORACLE_LOAD_DELIMITED

(DataStage 7.01 and later)

[char] Specifies a field delimiter for target Oracle stages using the Load method. Setting this variable makes it possible to load fields with trailing or leading blank characters.

$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM 1 When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.

$APT_ORA_WRITE_FILES [filepath] Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.

$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

202


Teradata Environment Variables


$APT_TERA_SYNC_DATABASE [name] Starting with v7, specifies the database used for the terasync table. By default, EE uses the

$APT_TERA_SYNC_USER [user] Starting with v7, specifies the user that creates and writes to the terasync table.

$APT_TER_SYNC_PASSWORD [password] Specifies the password for the user identified by $APT_TERA_SYNC_USER.

$APT_TERA_64K_BUFFERS 1 Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration.

$APT_TERA_NO_ERR_CLEANUP 1 This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table.

$APT_TERA_NO_PERM_CHECKS 1 Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.

203



Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) Admin Install Guide, Chapter 11: Environment Variables Operators Reference




Module 05: Environment Variables






Module 06: Introduction to Performance Tuning




206


Assumptions

This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Module 04: Best Practices and Job Design Tips Material covered in

DS324PX: DataStage Enterprise Edition Essentials

207


Optimizing Performance

The ability to process large volumes of data in a short period of time requires optimizing all aspects of the job flow and environment for maximum throughput and performance: Job Design Stage Properties DataStage Parameters Configuration File Disk Subsystem

Especially RAID arrays / SANs Source and Target database Network etc...

208


Enterprise Edition Performance

Within DataStage, examine (in order): End-to-end process flow

Intermediate results, sources/targets, disk usage DataStage Configuration File(s) for Each Job

Degree of ParallelismImpact on Overall System ResourcesFile system mappings, scratch disk

Individual Job Design (including shared containers)Stages chosen, overall design approachPartitioning StrategyCombinationBuffering (as a last resort)

Ultimate job performance may be constrained by external sources / targets eg. disk subsystem, network, database, etc. May be appropriate to scale-back degree of parallelism to

conserve un-used resources

209


Performance Tuning Methodology

Performance tuning is an iterative process: Test in isolation (nothing else should be running)

DataStage ServerSource and Target databases

Change one item at a time, then examine impactUse Job Score to determine

Number of processes generated Operator combination Framework-inserted sorting and partitioning

Use DataStage Job Monitor to verify Data distribution (partitioning) Throughput and bottlenecks

Use UNIX system monitoring tools to determine resource utilization (CPU, memory, disk, network)

210


Using DataStage Director Job Monitor

Enable “Show Instances” to show data distribution (skew) across partitions Best performance with even distribution

Enable “Show %CP” to display CPU utilization

211


Selectively Disabling Operator Combination

Operator combination is intended to improve overall performance and lower resource usage Generally separates I/O from CPU activity

There may be instances when operator combination hurts performance One process cannot use more than 100% of CPU It is also a good idea to separate I/O from CPU tasks

Use DataStage Job Monitor to identify CPU bottlenecks Selectively disable combination through Designer stage

properties

In unusual circumstances, disable all combination by setting $APT_DISABLE_COMBINATION=TRUE Generates significantly more UNIX processes May negatively impact performance

212


Operator Combination Example

In this example, the combined operator is using 100% CPU Disabling operator combination allows each to use stage to use

more CPU, and separates I/O from CPU It has 4 operators:op0[1p] {(parallel FileSetIn.InStream) on nodes ( node1[op0,p0] )}op1[1p] {(parallel APT_TransformOperatorImplJob_Transformer in Transformer) on nodes ( node1[op1,p0] )}op2[1p] {(parallel APT_RealFileExportOperator in File_Set_6.ToOutput) on nodes ( node1[op2,p0] )}op3[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op3,p0] )}It runs 4 processes on 1 node.

It has 2 operators:op0[1p] {(parallel APT_CombinedOperatorController: (FileSetIn.InStream) (APT_TransformOperatorImplJob_Transformer in Transformer) (APT_RealFileExportOperator in File_Set_6.ToOutput) ) on nodes ( node1[op0,p0] )}op1[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op1,p0] )}It runs 2 processes on 1 node.

Without Operator Combination

With Operator Combination

213


Configuration File Guidelines

Minimize I/O overlap across nodes If multiple filesystems are shared across nodes, alter order of

file systems within each node definition Pay particular attention with mapping of file systems to

physical controllers / drives within a RAID array or SAN Use local disks for scratch storage if possible

Named Pools can be used to further separate I/O “buffer” – file systems are only used for buffer overflow “sort” – file systems are only used for sorting

On clustered / MPP configurations, named pools can be used to further specify resources across physical servers Through careful job design, can minimize data shipping Specifies server(s) with database connectivity

214


Use Parallel Data Sets

Use Parallel Data Sets to land intermediate results between parallel jobs Stored in native internal format

(no conversion overhead) Retains data partitioning and sort order

(end-to-end parallelism across jobs) Maximum performance through parallel I/O But, can only be used by other DataStage Enterprise

Edition parallel jobs

When generating Lookup reference data to be used in subsequent jobs, use Lookup File Sets Internal format, partitioned Pre-indexed

215


Impact of Partitioning

Ensure data is as close to evenly distributed as possible When business rules dictate otherwise, re-partition to a

more balanced distribution as soon as possible to improve performance of downstream stages

Minimize repartitions by optimizing the flow to re-use upstream partitioning Especially in clustered / MPP environments

Know your data Choose hash key columns that generate sufficient

unique key combinations (while meeting business requirements)

Use SAME partitioning carefully Maintains degree of parallelism

216


Impact of Sorting

Use parallel sorts if possible (sort by key-column groups) Where sequential sort is required, parallel sort + sort merge collector

is generally much faster than a sequential sort Complete sorts are expensive

Interrupts pipelineRows cannot be output until all rows have been read

Uses scratch disk for intermediate storageUnless the data set is small enough to fit in sort buffer

Minimize and combine sorts where possible Use the “Don’t Sort, Previously Sorted” key-column option to leverage

previous sort groupings Uses much less memory Outputs rows after each key-column group

Parallel data sets maintain sort order and partitioning across jobs

Stable sorts are slower than non-stable sorts; use only when necessary

Use the “Restrict Memory Usage (MB)” option to increase amount of memory per partition (default is 20MB)

217


Impact of Transformers

Minimize number and use of Transformers Consider more appropriate stages / methods

Copy, Output Mappings, Modify, Lookup Combine derivations from multiple Transformers

Use stage variables to perform calculations used by multiple derivations

Replace complex Transformers that do not meet performance requirements with BuildOps

And NEVER use the BASIC Transformer for high-volume flows!

218


Impact of Buffering

Consider maximum row width For very wide rows, it may be necessary to increase

buffer size to hold more rows in memory (default is 3MB / partition)

Set through stage properties or for entire job using $APT_BUFFER_MAXIMUM_MEMORY

Tune all other factors (job design, configuration file, disk, resources, etc) before tuning buffer settings

Be careful changing buffering mode Disabling buffering might cause deadlocks (job

hang)

In some cases, the best solution to avoiding fork-join buffer contention may be to split the job, landing to intermediate data sets

219


Isolating Buffers from Overall Performance

Buffer operators may make it difficult to identify performance bottlenecks in a job flow

Setting the following environment variables effectively isolates each stage (by inserting buffers), and prevents the buffers from slowing down upstream stages (by spilling to disk) $APT_BUFFERING_POLICY=FORCE

Inserts buffers between each operator (isolates) $APT_BUFFER_FREE_RUN=1000

Writes excess buffer to disk instead of slowing down producerBuffer will not slow down producer until it has written 1000*$APT_MAXIMUM_MEMORY to disk

Important notes: These settings will generate a significant amount of disk I/O! Use

configuration file “buffer” disk pools to isolate buffer file systems from scratch and resource disks

Do NOT use these settings for production jobs!

220


Other Performance Tips

Remove un-needed columns as early as possible within the flow Minimizes memory usage, optimizes buffering Use a select list when reading from database sources To remove columns on Output Mapping, disable runtime

column propagation

Always specify a maximum length for VARCHAR columns Significant performance benefits

Avoid type conversions if possible Verify with $OSH_PRINT_SCHEMAS Always import Oracle table definitions using orchdbutil

221


Tuning Sequential File Performance

On heavily loaded file servers or some RAID/SAN configurations, setting these environment variables may improve performance (specify a number in Kbytes, default is 128): $APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE

In some disk array configurations, set the following environment variable equal to the read/write size in bytes: $APT_CONSISTENT_BUFFERIO_SIZE

222



Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference





Module 06: Introduction to Performance Tuning




ds325ee

Documents

Transcript of ds325ee