ds325ee

223
© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited. DataStage Enterprise Edition Advanced Architecture and Best Practices NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited. Last revision: June 22, 2004

Transcript of ds325ee

Page 1: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Advanced Architecture and Best Practices

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 2: ds325ee

2April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Welcome!

This course is intended to provide: An overview of the development and runtime

architecture of DataStage Enterprise Edition Recommendations for parallel Job Design and Best

Practices

There is purposely a combination of baseline and advanced material Most of this information does not exist in the current

course offerings or DataStage documentation This material will eventually be rolled into future

Essentials and Advanced course offerings

Page 3: ds325ee

3April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise EditionAdvanced Architecture and Best Practices

Agenda: Day 1

Module 1: Parallel Framework ArchitectureModule 2: Partitioning, Collecting, and SortingModule 3: The Parallel Job Score

Day 2Module 4: Best Practices and Job Design TipsModule 5: Environment VariablesModule 6: Introduction to Performance Tuning

Page 4: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 01: Parallel Framework Architecture

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 23, 2004

Page 5: ds325ee

5April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Why You Need to Know This

DataStage Client is a developer productivity tool GUI is not intended as a replacement for understanding parallel,

flow-based ETL design DataStage Designer includes intelligence to facilitate quick

development of simple flows But, this is a development environment, not Visio (picture

drawing)

The key to mastering Enterprise Edition is in understanding the DataStage Parallel Framework Parallel ETL is a fundamentally different process Complex, high-volume flows require an understanding of the

underlying engine architecture For now (v7.x1), you’ll ALWAYS need a copy of the

“OEM” (Orchestrate) documentation Documentation for the DataStage Parallel Framework

Page 6: ds325ee

6April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise EditionParallel Framework Architecture

Page 7: ds325ee

7April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Data Stage Enterprise Edition Component Architecture

UNIX Operating System / NetworkingParallel Hardware (SMP, Cluster, MPP)

DataStage Parallel Application Framework and Runtime System

Ascential Data Management Components

Ascential Data Analysis

Components

Transformer, BuildOp

Components

Third Party Components

Ascential Applications(Data Stage Client)

Third Party Applications

Page 8: ds325ee

8April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Introduction to Enterprise Edition

Parallel processing = executing your application on multiple CPUs Scalable processing = add more resources

(CPUs and disks) to increase system performance

• Example system containing 6 CPUs (or processing nodes) and disks

• Run an application in parallel by executing it on 2 or more CPUs

• Scale up system by adding more CPUs

• Can add new CPUs as individual nodes, or add CPUs to an SMP node

1 2

3 4

5 6

Page 9: ds325ee

9April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Source

Transform

Target

Operational Data

Archived Data

Clean Load

Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes

• disk I/O consumes the processing• terabytes of disk required for temporary staging

Traditional Batch Processing

Disk Disk DiskData

Warehouse

Page 10: ds325ee

10April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

• Transform, clean and load processes are executing simultaneously• Rows are moving forward through the flow

Target

Load

• Start a downstream process while an upstream process is still running.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still have limits on scalability

Think of a conveyor belt moving rows from process to process!

Pipeline Multiprocessing

Data Warehouse

Source

Operational Data

Archived DataTransform Clean Load

Page 11: ds325ee

11April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Node 1

Node 2

Node 3

Node 4

subset1

subset2

subset3

subset4

Partition Parallelism

Divide large data into smaller subsets (partitions) across resources Goal is to evenly distribute data Some transforms require all data

within same group to be in same partition

Requires the same transform on all partitions BUT: Each partition is independent

of others, there is no concept of “global” state

Facilitates near-linear scalability (correspondence to hardware resources) 8X faster on 8 processors 24X faster on 24 processors…

Transform

Transform

Transform

Transform

Source Data

Page 12: ds325ee

12April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Data Warehouse

Source Data

Source

Transform

Target

Clean Load

Pipelining

Par

titio

ning

Enterprise Edition Combines Partition and Pipeline Parallelisms

Within the Parallel Framework, Pipelining and Partitioning Are Always Automatic Job developer need only identify

Sequential vs. Parallel operations (by stage)Method of data partitioningConfiguration file (there are advanced topics here)Advanced per-stage options (buffer tuning, combination,

etc)

Page 13: ds325ee

13April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

User assembles the flow using DataStage Designer

… at runtime, this job runs in parallel for any configuration(1 node, 4 nodes, N nodes)

Job Design vs. Execution

No need to modify or recompile your job design!

Page 14: ds325ee

14April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

explicit

pipeline

• Explicit parallelism

• Implicit pipeline "parallelism"

• Implicit data-partition parallelism

Sort

DerivationSample

Lookup

Link Constraint

data-partition

Example: Three Types of Parallelism

Page 15: ds325ee

15April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Defining Parallelism

Execution modeExecution mode (sequential/parallel) is controlled by stage definition and properties default = parallel for most Ascential-supplied stages Can override default in most cases through Advanced

stage properties; examples where stage usage defines parallelism:Sequential File reads (unless number of readers per node is set)Sequential File targetsOracle Enterprise sources (unless partition table is set)others...

Degree of parallelismDegree of parallelism is determined by config. file Total number of logical nodes in nameless default pool,

or Nodes listed in [nodemap] or in named [nodepool]

Page 16: ds325ee

16April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

The Parallel Configuration File

Configuration Files separate configuration (hardware/software) from job design Specified per job at runtime by $APT_CONFIG_FILE Alter hardware and resources without changing job design

Defines #nodes = logical processing units with corresponding resources (need not match physical CPUs) Dataset, Scratch, Buffer disk (filesystems) Optional resources (eg. Database, SAS, etc) Advanced topics (“pools” - named subsets of nodes)

Multiple configuration files should be used Optimize overall throughput and matches job characteristics to

overall hardware resources Provides runtime “throttle” on resource usage on a per job basis

Page 17: ds325ee

17April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }}

1

43

2

key aspects:

1. # Nodes defined (LOGICAL processing entities)

2. Resources assigned to each node (order of entries within each node is significant!)

3. Advanced resource optimizations and configuration (named pools, database, SAS)

The Parallel Configuration File

Page 18: ds325ee

18April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise EditionJob Compilation

Page 19: ds325ee

19April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage DesignerParallel Canvas Job Compilation

DataStage Designer client generates all code Validates link requirements, mandatory stage options,

transformer logic, etc. Generates OSH representation of job data flow and

stages GUI “stages” are representations of Framework “operators” Stages in parallel shared containers are statically inserted in

the job flow Each server shared container becomes a dsjobsh operator

Generates transform code for each parallel Transformer Compiled on the DataStage server into C++ and then to

corresponding native operators To improve compilation times, previously compiled

Transformers that have not been modified are not recompiled Force Compile recompiles all Transformers (use after client

upgrades)

Buildop stages must be compiled manually within the GUI or using buildop UNIX command line

Designer Client

DataStage server

Executable Job

C++ for each Transformer

Generated OSH

TransformerComponents

Compile

Page 20: ds325ee

20April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Viewing Generated OSH

Enable viewing of generated OSH in DS Administrator:

Schema

Operator

OSH is visible in:- Job Properties- Job run log - View Data- Table Definitions (Show Schema)

Comments

Page 21: ds325ee

21April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example Stage / Operator Mapping

Sequential File Source: import Target: export

DataSet: copy Sort (DataStage): tsort Aggregator: group Row Generator, Column

Generator, Surrogate Key Generator: generator

Oracle Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert

Lookup File Set Target: lookup -createOnly

Within Designer, stages represent operators, but there is not always a 1:1 correspondence.

Examples:

See “OEM” OperatorsRef.PDF

Page 22: ds325ee

22April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Generated OSH Primer

Designer inserts comment blocks to assist in understanding the generated OSH

Note that operator order within the generated OSH is the order a stage was added to the job canvas

OSH uses the familiar syntax of the UNIX shell to create applications for Data Stage Enterprise Edition

operator name schema

for generator, import, export operator options (use “-name value” format) input (indicated by n< where n is the input #) output (indicated by n> where n is the output #)

may include modify For every operator, input and/or output datasets (links) are

numbered sequentially starting from 0. For example: op1 0> dst op1 1< src

The following operator input/output data sources are generated by DataStage Designer:

Virtual data set, (name.v) Persistent data set

(name.ds or [ds] name)

Example of generated OSH for first 2 stages of this job:

######################################################## STAGE: Row_Generator_0## Operatorgenerator## Operator options-schema record(  a:int32;  b:string[max=12];  c:nullable decimal[10,2] {nulls=10};)-records 50000

## General options[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]## Outputs0> [] 'Row_Generator_0:lnk_gen.v';

######################################################## STAGE: SortSt## Operatortsort## Operator options-key 'a'-asc

## General options[ident('SortSt'); jobmon_ident('SortSt'); par]## Inputs0< 'Row_Generator_0:lnk_gen.v'## Outputs0> [modify (keep  a,b,c;)] 'SortSt:lnk_sorted.v';

Virtual data set (link) name is used to connect output of one operator to input of another

Page 23: ds325ee

23April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Framework DataStageschema table definition

property format

type SQL type + length [and scale]

virtual dataset link

record/field row/column

operator stage

step, flow, OSH command job

Framework DS engine

• GUI uses both terminologies• Log messages (info, warnings, errors) use Framework terms

Terminology

Page 24: ds325ee

24April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise EditionRuntime Architecture

Page 25: ds325ee

25April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Enterprise Edition Runtime Execution

Generated OSH and Configuration file are used to “compose” a job SCORE similar to the way an RDBMS builds a query optimization plan Identifies degree of parallelism and node assignment for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (datasets) between adjacent operators Inserts buffer operators to prevent deadlocks (eg. fork-joins) Defines number of actual UNIX processes

Where possible, multiple operators are combined within a single UNIX process to improve performance and optimize resource requirements

Job SCORE is used to fork UNIX processes with communication interconnects for data, message, and control Setting $APT_PM_SHOW_PIDS to show UNIX process IDs in DataStage log

It is only after these steps that processing begins This is the “startup overhead” of an Enterprise Edition job

Job processing ends when Last row (end of data) is processed by final operator in the flow (or) A fatal error is encountered by any operator (or) Job is halted (SIGINT) by DataStage Job Control or human intervention (eg.

DataStage Director STOP)

Page 26: ds325ee

26April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Viewing the Job SCORE

• Set $APT_DUMP_SCORE to output the Score to the DataStage job log

• For each job run, 2 separate Score Dumps are written to the log

• First score is actually from the license operator

• Second score entry is the actual job scoreLicense Operator job score

Actual job score

Page 27: ds325ee

27April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example Job Score

Job scores are divided into two sections Datasets

partitioning and collecting

Operatorsnode/operator mapping

Both sections note sequential or parallel processing

Why 9 Unix processes?

Page 28: ds325ee

28April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Execution: The OrchestraConductor Node

C

Processing Node

SL

PP P

SL

PP P

Processing Node

• Conductor - initial Framework process– Score Composer– Creates Section Leader processes (one/node)– Consolidates massages, to DataStage log– Manages orderly shutdown

• Section Leader (one per Node)– Forks Players processes (one/Stage)– Manages up/down communication

• Players– The actual processes associated with Stages– Combined players: one process only– Sends stderr, stdout to Section Leader– Establish connections to other players for data flow– Clean up upon completion

• Default Communication:• SMP: Shared Memory• MPP: Shared Memory (within hardware node) and TCP (across hardware nodes)

Page 29: ds325ee

29April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

copy,0

generator,0

Section Leader,0

Conductor

copy,2

generator,2

Section Leader,2

copy,1

generator,1

Section Leader,1

$ osh “generator -schema record(a:int32) [par] | roundrobin | copy”

Control Channel/TCP

Stdout Channel/Pipe

Stderr Channel/Pipe

APT_Communicator

Runtime Control and Data Networks

Page 30: ds325ee

30April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Parallel Data Flow

Think of job runtime as a series of “conveyor belts” transporting rows for each link If the stage is parallel, each link will have multiple independent “belts” (partitions)

Row order is undefined (“non-deterministic”) across partitions, or across multiple links Order within a particular link and partition is deterministic

based on partition type and, optionally, sort order

For this reason, job designs cannot include “circular” references eg. cannot update a source or reference used in the same flow

Data Flow

Undef

ined

ord

er a

cros

s

parti

tions

and

links

Page 31: ds325ee

31April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise EditionData Types, Conversions, Nullability

Page 32: ds325ee

32April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

The Framework processes only datasets

For external data, Enterprise Edition must perform conversion operations: Format translation

using data type mappings May also require:

Recordization Columnization

External data formats fall in two major categories: Automatic: the conversion is automatic or semi-automatic

data stored in a relational database (DB2, Informix, Oracle, Teradata)

data stored in a SAS data set Mapping rules are documented in OperatorsRef.pdf

Manual: user needs to manually specify formats everything else: flat text files, binary files Use the Sequential File Stage

Exte

rnal D

ata

Data Formats

Co

nversio

n

DataSet format

Parallel Framework

Co

nversio

n

Exte

rnal D

ata

Page 33: ds325ee

33April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Data Sets

Consist of: Framework Schema (format=name, type, nullability) Data Records (data) Partition (subset of rows for each node)

Virtual Data Sets exist in-memory Correspond to DataStage Designer links

Persistent Data Sets are stored on-disk Descriptor file

(metadata, configuration file, data file locations, flags) Multiple Data Files

(one per node, stored in disk resource file systems)

There is no “DataSet” operator – the Designer GUI inserts a copy operator

node1:/local/disk1/…node2:/local/disk2/…

Data Sets are the structured internal representation of data within the Parallel Framework

Page 34: ds325ee

34April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

When to Use Persistent Data Sets

When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints) Stored in native internal format (no conversion overhead) Retain data partitioning and sort order

(end-to-end parallelism across jobs) Maximum performance through parallel I/O

Data Sets are not intended for long-term or archive storage Internal format is subject to change with new DataStage releases Requires access to named resources

(node names, file system paths, etc) Binary format is platform-specific

For fail-over scenarios, servers should be able to cross-mount filesystems Can read a dataset as long as your current $APT_CONFIG_FILE defines

the same NODE names (fastnames may differ) orchadmin –x lets you recover data from a dataset if the node names are no

longer available

Page 35: ds325ee

35April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Caution on using Plug-In MetaData

DataStage Server plug-ins do not always match the data type definitions used by native Enterprise database stages Do not use a Plug-In to

import Oracle table definitions

Instead, use ORCHDBUTIL to import Oracle table definitions

Page 36: ds325ee

36April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Runtime Column Propagation

Runtime Column Propagation (RCP) allows you to define only part of your table definition (schema). When RCP is enabled, if your job encounters extra columns not defined in the metadata, it will adopt these extra columns and propagate them through the rest of the job.

RCP must be enabled at the project level (it is off by default) Can then be enabled/disabled at the job level (Job

Properties/Execution) Can also be enabled/disabled at the stage level (Output Columns)

RCP allows maximum re-use of parallel shared containers Input and Output table definitions only need columns required by

container stages. Parallel Shared Container can be used by multiple jobs with different schemas, as long as the core input/output columns exist.

Must enable RCP in every stage within the parallel shared container

Page 37: ds325ee

37April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Output Mapping With RCP Disabled

When RCP is Disabled (default) DataStage Designer will enforce Stage Input

Column to Output Column mappings At job compile time modify operators are

inserted on output links in the generated OSH

Page 38: ds325ee

38April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Output Mapping With RCP Enabled

When RCP is Enabled DataStage Designer will not enforce mapping rules

Modify is still inserted at compile but Columns are not removed from output Columns are not renamed unless explicitly dragged to

derivation

In this example, runtime error because Name will not map to NAME, (RCP maps by case sensitive column name)

Must drag column name to derivation column

Page 39: ds325ee

39April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Type Conversions

Enterprise Edition provides numerous conversion functions between source and target data types Default type conversions take place across the output

mappings of any parallel stage when runtime column propagation is disabled for that stageVariable to Fixed-length string conversions will pad remaining

length with ASCII NULL (0x0) charactersUse $APT_STRING_PADCHAR to change default padding

(also used by target Sequential File stages) Non-default type conversions require use of Transformer or

Modify (recommended method) Look for warnings in DataStage log to indicate unexpected

conversions!

Page 40: ds325ee

40April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Source Type to Target Type Conversions

Source Field Target Fieldd = There is a default type conversion from source field type to destination field type. e = You can use a Modify or a Transformer conversion function to convert from the source type to the destination type A blank

cell indicates that no conversion is provided.

int8

uint8

int16

uint16

int32

uint32

int64

uint64

sfloat

dfloat

decimal

string

ustring

raw

date

time

timestam

p

int8 d d d d d d d d d e d d e d e e e e

uint8 d d d d d d d d d d d d

int16 d e d d d d d d d d d d e d e

uint16 d d d d d d d d d d d e d e

int32 d e d d d d d d d d d d e d e e e

uint32 d d d d d d d d d d d e d e e

Int64 d e d d d d d d d d d d d

uint64 d d d d d d d d d d d d

sfloat d e d d d d d d d d d d d

dfloat d e d d d d d d d d d e d e d e e e

decimal d e d d d d e d d e d e d d e d e d e

string d e d d e d d d e d d d d e d e d e e e

ustring d e d d e d d d e d d d d e d e d e e

raw e e

date e e e e e e e

time e e e e e d e

timestamp e e e e e e e

Page 41: ds325ee

41April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Enterprise Edition Nullable Data

Out-of-band: an internal data value to mark a field as NULL In-band: a specific user-defined field value indicates a NULL

Required for Transformer processing Disadvantage:

must reserve a field value that cannot be used as valid data elsewhere in the flow Examples:

a numeric field’s most negative possible valuean empty string

To convert a NULL representation from an out-of-band to an in-band and vice-versa: Transformer stage:

Stage variables: IF ISNULL(linkname.colname) THEN … ELSE …Derivations: SetNull(linkname.colname)

Modify stage:destinationColumnName = handle_null(sourceColumnName,value)destinationColumnName = make_null(sourceColumnName,value)

Page 42: ds325ee

42April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Null Transfer Rules

When mapping between source and destination columns of different nullability, the following rules apply:

Source Field Destination Field Result

not_nullable not_nullable Source value propagates to destination.

nullable nullable Source value or null propagates.

not_nullable nullable Source value propagates; destination value is never null.

nullable not_nullable WARNING messages in log. If source value is null, a fatal error occurs. Must handle in Transformer or Modify.

Page 43: ds325ee

43April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

NULLS and Sequential Files

For NULLABLE columns, the following properties are used when reading from or writing to Sequential Files: null_field

A number, string, or C-style literal escape value (eg. \xAB) that defines the NULL value representation

null_lengthField length that indicates a NULL value (only appropriate for

variable-length files)

Null field representation can be any string, regardless of valid values for actual column data type

Page 44: ds325ee

44April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Lookup and Nullable Columns

When using Lookup with “If Not Found = Continue” unmatched output rows follow nullability attributes of the reference link for non-key columns: If the non-key columns of the reference

link are defined as non-nullable, the Lookup stage assigns a "default value" on unmatched recordsDefault Value depends on the data type*.

For example: Integer columns default value is zero. Varchar is a zero-length string (this is

distinctly different from a NULL value) Char is a string of fixed length

$APT_STRING_PADCHAR characters If the non-key columns of the reference

link are defined as nullable, the Lookup stage will place NULL values in these columns for unmatched records

* Data type default values are documented in OEM UserGuide.pdf

Lookup

If Not Found = Continue

Unmatched rows follow nullability attributes of

non-key reference link columns

TIP:

When changing column attributes, be careful to propagate the change through the remaining links of your job design

(Including the output column definition of the Lookup stage in this example).

Page 45: ds325ee

45April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Outer JOINs and Nullable Columns

Similar to Lookup, when performing an OUTER JOIN (Left Outer, Right Outer, Full Outer), unmatched output rows follow nullability attributes of the the corresponding outer link(s):

If the non-key columns of the outer link(s) are defined as non-nullable, the Join stage will assigns a "default value" on unmatched records, based on their data type

If the non-key columns of the outer link(s) are defined as nullable, the Join stage will place NULL values in these columns for unmatched records

Left Outer JOIN

Unmatched rows follow nullability attributes of

non-key outer link columns

Left

Right

Full Outer JOIN

Unmatched rows follow nullability attributes of non-key columns of

outer links

Page 46: ds325ee

46April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer and Null Expressions

Within a parallel transformer, any expression that includes a NULL value will produce a NULL result 1 + NULL = NULL “John” : NULL : “Doe” = NULL

When the result of a link constraint or output derivation is NULL, the Transformer will output that row to its reject link (dashed line) Always create a Transformer reject link in DataStage

Designer Always test for null values before

using in an expressionIF ISNULL(link.col) THEN… ELSEUse stage variables if re-used

v7 Transformer now warns when rows reject

v7 also clarifies naming of output link constraints (Otherwise)

Page 47: ds325ee

47April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Framework “OEM” Documentation

UserGuide.PDF Covers framework architecture, parallel processing,

partitioning/collecting data, data sets, data types, conversion functions, OSH

Also includes detailed documentation on buildops

OperatorsRef.PDF Detailed reference for every built-in operator

RecordSchema.PDF Format of Framework schema definition

(including import, export, generator)

DevGuide.PDF, HeaderSorted.PDF, ClassSorted.PDF low-level Orchestrate C++ APIs for building custom operators

Available in the documentation section (“Orchestrate”) of Ascential eServices public website

Page 48: ds325ee

48April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

For More Information

Framework “OEM” Documentation User Guide Operators Reference Record Schema

DataStage Enterprise Edition Best Practices and Performance Tuning document PLEASE send your comments and feedback to:

[email protected]

Don’t be afraid to try!

Page 49: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 01: Parallel Framework Architecture

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 23, 2004

Page 50: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 02: Partitioning, Collecting, and Sorting Data

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 51: ds325ee

51April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioners, Collectors, and Sorting

Partitioners distribute rows of a single link into smaller segments that can be processed independently in parallel ONLY before parallel stages

Collectors combine parallel partitions of a single link for sequential processing ONLY before sequential stages

Sorting is used to arrange rows into specific groupings and order. May be parallel or sequential

partitioner collector

Stagerunning

Sequentially

Stage running in

Parallel

Stage running in

Parallel

Page 52: ds325ee

52April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning and Collecting Icons

“Fan-Out” Partitioner

Sequential to Parallel

Collector

(“Fan-In”)Parallel to Sequential

NOTE: Partitioner and Collector icons are ALWAYS drawn “Left to Right” regardless of how the link is drawn!

Page 53: ds325ee

53April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Data

partitioner

Stage running in

Parallel

Stage running in

Parallel

Page 54: ds325ee

54April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioners

Partitioners distribute rows of a single link (data set) into smaller segments that can be processed independently in parallel

Partitioners exist before ANY parallel stage. The previous stage may be running: Sequentially

Results in a “fan-out” operation (and link icon)

In Parallel If partitioning method changes, data is repartitioned

partitioner

Stage running in

Parallel

Stage running in

Parallel

Stage running in

Parallel

Stagerunning

Sequentially

Stage running in

Parallel

Stage running in

Parallel

repartitioning icon

Page 55: ds325ee

55April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partition Numbers and Director Job Log

At runtime, the Parallel Framework determines the degree of parallelism for each stage using: $APT_CONFIG_FILE (and optionally) a stage’s node pool (Advanced properties)

Partitions are assigned numbers, starting at zero Partition number is appended to the stage name for

messages written to the DataStage Director job log

partition #

stage name

Page 56: ds325ee

56April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

System Variables for Parallel Derivations

To facilitate parallel calculations regardless of actual runtime config, system variables are provided in Column / Row Generator and Transformer stages.

Within Column / Row Generator, two reserved words are provided for numeric cycles: part: actual partition # (starts at zero) partcount: total number of partitions at runtime

Starting with v7.1, the Surrogate Key Generator stage can generate a sequence of integer values in parallel: Internally similar to using Column Generator stage

with part and partcount keywords Also supports initial value for the sequence(s)

Within the Transformer, @INROWNUM system variable is generated for each node. Instead, use: @PARTITIONNUM: actual partition number

(starts at zero) @NUMPARTITIONS: total number of partitions

Example Generator Sequence:Type = CycleInitial value = partIncrement = partcount

For a 4-node configuration file:@NUMPARTITIONS = 4@PARTITIONNUM = 0 through 3

Assuming incoming data isround-robin partitioned:

Row# Part Partcount Result1 0 4 02 1 4 13 2 4 24 3 4 35 0 4 46 1 4 57 2 4 68 3 4 7

initial values

first increment

Page 57: ds325ee

57April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Selecting a Partitioning Method

Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes

Greatly varied partition sizes (skew) increase processing time

Enable “Show Instances” in DataStage Director Job Monitor to show data distribution (skew) across partitions:

Setting the environment variable $APT_RECORD_COUNTS outputs row counts per partition to the DataStage log as each stage/node (operator) completes processing

Page 58: ds325ee

58April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Selecting a Partitioning Method

Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition Ensures that processing is evenly distributed across nodes

Greatly varied partition sizes (skew) increase processing time

Objective 2: Partition method MUST match the stage logic, assigning related records to the same partition if required Any stage that operates on groups of related data (often using

key columns)Examples: Aggregator, Join, Merge, Sort, Remove Duplicates, etc…

(perhaps also Transformers, Buildops)

Partitioning method needed to ensure correct results may violate Objective #1, depending on actual data distribution

Objective 3: Partition method should not be overly complex The simplest method that meets Objectives 1 and 2 If possible, leverage partitioning performed earlier in a

flow

Page 59: ds325ee

59April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Specifying Partitioning

Partitioning method is defined on the Input properties, Partitioning category, of any stage running in parallel

Page 60: ds325ee

60April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Methods

Keyless PartitioningRows are distributed independent of actual data values

Same Existing partitioning is not

altered Round Robin

Rows are evenly alternated among partitions

Random Rows randomly assigned to

partitions Entire

Each partition gets the entire dataset (rows duplicated)

Keyed PartitioningRows are distributed at runtime based on values in specified key column(s)

Hash Rows with same key column

value go to the same partition Modulus

Assigns each row of an input dataset to a partition, as determined by a specified numeric key column in the input dataset

Range Similar to hash, but partition

mapping is user-determined and partitions are ordered

DB2 Matches DB2 EEE partitioning Discussed in database chapter

Auto (the default method) DataStage EE chooses appropriate

partitioning method Round Robin, Same, or Hash are

most commonly chosen

Page 61: ds325ee

61April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

036

147

258

036

147

258

Row ID's

SAME partitioning icon

SAME Partitioning

Keyless partitioning method Rows retain current

distribution and order from output of previous parallel stage Doesn’t move data between

nodes Retains “carefully partitioned”

data (such as the output of a previous sort)

Fastest partitioning method (no overhead)

Page 62: ds325ee

62April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Impact of SAME Partitioning

Don’t over-use SAME partitioning in a job flow Because SAME does not alter existing partitions,

the degree of parallelism remains unchanged in the downstream stage If you read a Sequential File using SAME partitioning

(without specifying Readers Per Node option), the downstream stage will run sequentially!

If you read a persistent Data Set using SAME partitioning, the downstream stage runs with the degree of parallelism used to create the data set, regardless of the current $APT_CONFIG_FILE / specified node pool

Page 63: ds325ee

63April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Round Robin and Random Partitioning

Keyless partitioning methods Rows are evenly distributed across

partitions Good for initial import of data if no

other partitioning is needed Useful for redistributing data

Fairly low overhead

Round Robin assigns rows to partitions as dealing cards Row/Partition assignment will be the

same for a given $APT_CONFIG_FILE

Random distributes rows with random order Higher overhead than Round Robin Not subject to regular patterns that

might exist in source data Row/Partition assignment will differ

between runs of the same input data

…8 7 6 5 4 3 2 1 0

Round Robin

630

741

852

Page 64: ds325ee

64April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Parallel Runtime Example

Remember, row order is undefined (non-deterministic) across partitions, or across multiple links

Consider this example job: Round robin partitioning distributes

rows in a specific order to the number of nodes at runtime

But, across nodes, the order a particular node outputs its results may change with each run:

Results with a 4-node $APT_CONFIG_FILE:

Node 0: 1, 5, 9

Node 1: 2, 6, 10

Node 2: 3, 7

Node 3: 4, 8

Row Generator

10 rows {A: Integer, initial_value=1, incr=1}

Round Robin partitioning

With round robin partitioning, rows are distributed in the same order for the same input data and $APT_CONFIG_FILE

Page 65: ds325ee

65April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

ENTIRE Partitioning

Keyless partitioning method Each partition gets a

complete copy of the data Useful for distributing lookup and

reference dataMay have performance impact in

MPP / clustered environments On SMP platforms, Lookup stage

(only) uses shared memory instead of duplicating ENTIRE reference data

ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good

practice to set this explicitly on the Normal Lookup reference link(s)

…8 7 6 5 4 3 2 1 0

ENTIRE

.

.3210

.

.3210

.

.3210

Page 66: ds325ee

66April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

HASH Partitioning

Keyed partitioning method Rows are distributed

according to the values in one or more key columns Guarantees that rows with

identical combination of values in key column(s) are assigned to the same partition

Needed to prevent matching rows from “hiding” in other partitionseg. Join, Merge, RemDup …

Partition size will be relatively equal if the data across the source key column(s) is evenly distributed

…0 3 2 1 0 2 3 2 1 1

HASH

0303

111

222

Values of key column

Page 67: ds325ee

67April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Hash Key Selection

HASH ensures that rows with the same combination of all key column values are assigned to the same partition

Hash on LName, 4 node config file distributes as:

So

urce D

ata

ID LName FName Address

1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson

5 Dodge Horace 17840 Jefferson

6 Dodge John 75 Boston Boulevard

7 Ford Henry 4901 Evergreen

8 Ford Clara 4901 Evergreen

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore

NOTE: Partition distribution matches

source data distribution In this example, number of distinct

hash key values limits parallelism!

Partitio

n 1

ID LName FName Address

1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson

7 Ford Henry 4901 Evergreen

8 Ford Clara 4901 Evergreen

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore

Part 0

ID LName FName Address

5 Dodge Horace 17840 Jefferson

6 Dodge John 75 Boston Boulevard

Page 68: ds325ee

68April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Another Hash Key Example

Using same source data, Hash on LName, FName, 4 node config file:

NOTE: Improved distribution Only the unique combination of

key columns appear in the same partition

For partitioning purposes, order of HASH key columns is insignificantNOTE: To avoid repartitioning,

key column order should be consistent across stages with same keys

Part 3

ID LName FName Address

1 Ford Henry 66 Edison Avenue

7 Ford Henry 4901 Evergreen

Part 2

ID LName FName Address

4 Ford Eleanor 7900 Jefferson

6 Dodge John 75 Boston Boulevard

10 Ford Eleanor 1100 Lakeshore

Part 1

ID LName FName Address

3 Ford Edsel 7900 Jefferson

5 Dodge Horace 17840 Jefferson

9 Ford Edsel 1100 Lakeshore

Part 0

ID LName FName Address

2 Ford Clara 66 Edison Avenue

8 Ford Clara 4901 Evergreen

Page 69: ds325ee

69April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Modulus Partitioning

Keyed partitioning method Rows are distributed according

to the values in one integer key column Simpler (and faster) calculation

than HASH using modulus (remainder) of division: partition = MOD (key_value / #partitions)

Guarantees that rows with identical values in key column end up in the same partition

Partition size will be relatively equal if the data within the key column is evenly distributed

…0 3 2 1 0 2 3 2 1 1

MODULUS

0303

111

222

Values of key column

Page 70: ds325ee

70April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

RANGE Partitioning

Keyed partitioning method Rows are evenly distributed

according to the values in one or more key columns Requires “pre-processing” data to

generate a range map More expensive than HASH partitioning Must read entire data TWICE to guarantee results

Guarantees that rows with identical values in key columns end up in the same partition

The “Write Range Map” stage is used to generate the range map file If the source data distribution is

consistent over time, it may be possible to re-use the map file

Values outside of a given range map will land in the first or last partition as appropriate

565

•QUIZ! If incoming data is ordered on key, something bad happens. WHAT?

Range Map file

Values of key column

4 0 5 1 6 0 5 4 3

010

443

RANGE

ANSWER: The process runs sequentially (key value adjacency)!

Page 71: ds325ee

71April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example Partitioning Icons

“fan-out” Sequential to Parallel

AUTO partitioner

re-partitionwatch for this!

SAME partitioner

Page 72: ds325ee

72April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Automatic Partitioning

By default, the Parallel Framework inserts partition components as necessary to ensure correct results (check the job score) Before any stage with “Auto” partitioning Generally chooses between ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values

(eg. Join, Merge, RemDup) Inserts ENTIRE on Normal (not Sparse) Lookup reference links

NOT always appropriate for MPP/clusters

Since the Framework has limited awareness of your data and business rules, it is usually best to explicitly specify HASH partitioning when key groupings are required Framework has no visibility into Transformer logic Required before SORT and AGGREGATOR (hash method) stages Framework may insert un-needed or non-optimal partitioning

Page 73: ds325ee

73April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Preserve Partitioning Flag

The “preserve partitioning” flag is designed for stages that use “Auto” partitioning Flag has 3 possible settings:

Set: instructs downstream stages to attempt to retain partitioning and sort order

Clear: downstream stages need not retain partitioning and sort orderPropagate: passes (if possible) the flag setting from input to output links

Set automatically by some operators (eg. Sort, Hash partitioning) Can be manually set by users through stage Advanced properties Functionally equivalent to explicitly specifying SAME partitioning

But allows the Parallel Framework to over-ride and optimize for performance (eg. if the degree of parallelism differs)

Preserve Partitioning setting is part of the data set structure In memory (virtual) and on disk (persistent)

At runtime, if Preserve Partitioning flag as set and a downstream operator cannot use previous partitioning, a warning is issued

Page 74: ds325ee

74April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Summary: Partitioning Strategy

Use HASH partitioning when stage requires grouping of related values Specify only the key columns that are necessary for correct

grouping (as long as the number of unique values is sufficient) Use MODULUS if group key is a single Integer column RANGE may be appropriate in rare instances when data

distribution is uneven but consistent over time Know your data!

How many unique values in the hash key column(s)?

If grouping is not required, use ROUND ROBIN to redistribute data equally across all partitions Framework will often do this with AUTO partitioning

Try to optimize partitioning for the entire job flow

Page 75: ds325ee

75April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job SCORE: Data Sets

The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime

Partitioners and Collectors are associated with datasets (top portion of the SCORE)

Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)

Partitioner / Collector method is shown between the source and target

Page 76: ds325ee

76April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Interpreting the Job Score - Partitioning

The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows

that are consumed by downstream stages (operators or data sets)

Partitioning method is associated with producerCollector method is associated with consumerSeparated by an indicator:

May also include [pp] notation when Preserve Partitioning flag is set

Producer

Consumer

-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer

Page 77: ds325ee

77April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Partitioning

Minimize the number of re-partitions within and across job flows Within a flow

Examine up-stream partitioning and sort order and attempt to preserve for down-stream stages using SAME partitioning

May require re-examining key column usage within stages and processing (stage) order

Across jobs through a persistent data setData sets retain partitioning AND sort order across flows

If sort order is significant, write to a persistent data set with the Preserve Partitioning flag SET

Useful if downstream jobs are run with the same degree of parallelism and require same partition and sort order

Page 78: ds325ee

78April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Collecting Data

collector

Stagerunning

Sequentially

Page 79: ds325ee

79April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Collectors combine partitions of a dataset into a single input stream to a sequential Stage

data partitions (NOT links)

collector

sequential Stage

...

Collectors

Page 80: ds325ee

80April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Specifying Collector Type

Collector method is defined on the Input properties, Partitioning category, of any stage running sequentially when the previous stage is running in parallel

Stage running in

Parallel

Stagerunning

Sequentially

collector icon

Page 81: ds325ee

81April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Collector Methods

(Auto)Eagerly read any row from any input partitionOutput row order is undefined (non-deterministic)This is the default collector method

Round RobinPatiently pick row from input partitions in round robin orderSlower than auto, rarely used

OrderedRead all rows from first partition, then second,… Preserves order that exists within partitions

Sort MergeProduces a single (sequential) stream of rows sorted on

specified key column(s) for input sorted on those keysRow order is undefined for non-key columns

Page 82: ds325ee

82April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Choosing a Collector Method

In most instances, Auto collector (the default) is the fastest and most efficient method of collecting data into a sequential stream

To generate a single stream of sorted data, use the Sort Merge collector for previously-sorted input Input data must be sorted on these keys to produce a sorted result Sort Merge does not perform a sort, it simply defines the order that

rows are read from all partitions using the values in one or more key columns

Ordered collector is only appropriate when sorted input has been range-partitioned No sort required to produce sorted output

Round robin collector can be used to reconstruct original (sequential) row order for round-robin partitioned inputs As long as intermediate processing (eg. sort, aggregator) has not

altered row order or reduced number of rows Rarely used in real life scenarios

Page 83: ds325ee

83April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Collectors vs. Funnels

Collector Operates on a single,

partitioned link (single virtual data set)

Consolidates partitions as the input to a sequential stage

Always identified by a “fan-in” link icon

Funnel stage Stage that runs in parallel Merges data from multiple

links (multiple virtual data sets) to a single output link

Table Definitions (schema) of all links must match

Don’t confuse a collector with a Funnel stage!

FunnelCollector

Page 84: ds325ee

84April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sorting Data

Page 85: ds325ee

85April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Traditional (Sequential) Sort

Traditionally, the process of sorting data uses one primary key column and (optionally) multiple secondary key columns to generate a sequential, ordered result set Order of key columns determines sequence (and groupings) Each key column specifies an ascending or descending sort group

This is the method that SQL uses for an ORDER BY clause

So

urce D

ata

ID LName FName Address

1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson

5 Dodge Horace 17840 Jefferson

6 Dodge John 75 Boston Boulevard

7 Ford Henry 4901 Evergreen

8 Ford Clara 4901 Evergreen

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore

Sort on:Lname

(asc),

FName (desc)

So

rted R

esult

ID LName FName Address

6 Dodge John 75 Boston Boulevard

5 Dodge Horace 17840 Jefferson

1 Ford Henry 66 Edison Avenue

7 Ford Henry 4901 Evergreen

4 Ford Eleanor 7900 Jefferson

10 Ford Eleanor 1100 Lakeshore

3 Ford Edsel 7900 Jefferson

9 Ford Edsel 1100 Lakeshore

2 Ford Clara 66 Edison Avenue

8 Ford Clara 4901 Evergreen

Page 86: ds325ee

86April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Parallel Sort

In most cases, there is no need to globally sort data to produce a single sequence of rows

Instead, sorting is most often needed to establish order within specified groups of data Join, Merge, Aggregator, RemDup, etc… This sort can be done in parallel!

Partitioning is used to gather related rows Assigns rows with the same key column(s) values to the

same partition Sorting is used to establish grouping and order within each

partition based on one or more key column(s) Key values are adjacent

Partition and Sort keys need not be the same! Often the case before Remove Duplicates

Page 87: ds325ee

87April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example Parallel Sort

Using same source data, Hash partition on LName, FName (4 node config):

Part 3

ID LName FName Address

1 Ford Henry 66 Edison Avenue

7 Ford Henry 4901 Evergreen

Part 2

ID LName FName Address

4 Ford Eleanor 7900 Jefferson

6 Dodge John 75 Boston Boulevard

10 Ford Eleanor 1100 Lakeshore

Part 1

ID LName FName Address

3 Ford Edsel 7900 Jefferson

5 Dodge Horace 17840 Jefferson

9 Ford Edsel 1100 Lakeshore

Part 0

ID LName FName Address

2 Ford Clara 66 Edison Avenue

8 Ford Clara 4901 Evergreen

Within each partition, sort using LName, FName:

Part 3

ID LName FName Address

1 Ford Henry 66 Edison Avenue

7 Ford Henry 4901 Evergreen

Part 2

ID LName FName Address

6 Dodge John 75 Boston Boulevard

4 Ford Eleanor 7900 Jefferson

10 Ford Eleanor 1100 Lakeshore

Part 1

ID LName FName Address

5 Dodge Horace 17840 Jefferson

3 Ford Edsel 7900 Jefferson

9 Ford Edsel 1100 LakeshoreP

art 0

ID LName FName Address

2 Ford Clara 66 Edison Avenue

8 Ford Clara 4901 Evergreen

Parallel Sort

Parallel Sort

Parallel Sort

Parallel Sort

Page 88: ds325ee

88April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Stages that require Sorted Data

Stages that process data on groups Aggregator Remove Duplicates Compare (perhaps)

If only comparing values, not order between two sources Transformer, Buildop (perhaps)

Depending on internal stage-variable logic

“Lightweight” stages that minimize memory usage by requiring data in key-column sort order Join Merge Sort Aggregator

Page 89: ds325ee

89April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Parallel (Grouped) Sorting Methods

DataStage Designer provides two methods for parallel (grouped) sorting: Sort stage

Parallel execution mode

OR

Specified on a link when partitioning is not AutoLinks with SORT defined will have a Sort icon:

By default, both methods use the same internal sort package (tsort operator)

Page 90: ds325ee

90April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sorting on a Link

Easier job maintenance (fewer stages on job canvas)

BUT…Fewer options (tuning, features)

Right-click on key column to

specify sort options

Specify key usage for Sorting,

Partitioning, or Both

Page 91: ds325ee

91April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sort Stage

The Sort stage offers more options than a link sort

Always specify “DataStage” Sort Utility (much faster than UNIX sort)

Page 92: ds325ee

92April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Stable Sorts

Stable sort preserves the order of non-key columns within each sort group

Stable sorts are slightly slower than non-stable sorts for the same data/keys Only use Stable sort when

needed

By default, stable sort is enabled on Sort stages!

Stable sort is not the default for Link sorts

Page 93: ds325ee

93April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Resorting on Sub-Groups

Use Sort Key Mode property to re-use key column groupings from previous sorts Uses significantly less memory/disk!

Sort is now on previously-sorted key-column groups not the entire data set Outputs rows after each group

Key column order is important! Don’t forget to retain incoming sort order (eg. SAME partitioning)

Page 94: ds325ee

94April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning and Sort Order

When partitioning data (except for SAME), sort order is not maintained

To restore row order / groupings, a sort is required after any repartitioning

2

101

3

Partitioner

1

103

102

1

2

3

103

102

101

Page 95: ds325ee

95April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sequential (Total) Sorting Methods

Within Enterprise Edition, DataStage provides two methods for generating a sequential (totally sorted) result: Sort stage

Sequential execution mode

OR

SortMerge Collector For sorted input

In general, parallel Sort + SortMergecollector will be MUCH faster than a sequential Sort

- Unless data is already sequential

(Similar to how databases “parallel sort”)

Page 96: ds325ee

96April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Automatic Sorting

By default, the Parallel Framework inserts sort operators as necessary to ensure correct results Before any stage that requires

matched key values (eg. Join, Merge, RemDup)

Only inserted when the user has NOT explicitly defined an input sort

Check the Job SCORE for inserted tsort operators

For versions 7.01 and later, set $APT_SORT_INSERTION_CHECK_ONLY to change behavior of automatically inserted sorts Instead of actually performing the

sort, the inserted sort operators only VERIFY the data is sorted

If data is not sorted properly at runtime, the job will fail

Recommended only on a per-job basis during performance tuning

op1[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

op1[4p] {(parallel inserted tsort operator {key={value=LastName, subArgs={sorted}}, key={value=FirstName}, subArgs={sorted}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

Page 97: ds325ee

97April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sort Resource Usage

By default, each sort uses 20MB per partition as an internal memory buffer Includes user-defined (link, stage) and framework-inserted

sorts A different size can be specified for each Sort stage using the

“Restrict Memory Usage” option Increasing this value can improve performance, especially if the

entire (or group) data partition can fit into memoryDecreasing this value may hurt performance, but will use less

memory (minimum is 1MB per partition)From Designer, this option is unavailable for link sorts

When the memory buffer is filled, sort uses temporary disk space in the following order:

Scratch disks in the $APT_CONFIG_FILE “sort” named disk poolScratch disks in the $APT_CONFIG_FILE default disk pool

(normally all scratch disks are part of the default disk pool)The default directory specified by $TMPDIRThe UNIX /tmp directory

Page 98: ds325ee

98April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Sort Performance

Minimize number of sorts within a job flow Each sort interrupts the parallel pipeline - must read all

rows before generating output

Specify only necessary key columns Avoid stable sorts unless needed to retain order

of non-key column data If possible, use “Sort Key Usage” key column

option to re-use previous sort keys Within Sort stage, adjusting “Restrict Memory Usage” option may improve performance

Page 99: ds325ee

99April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Examples

Page 100: ds325ee

100

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Example 1

Scenario: Assign average value to existing detail rows “Standard” Solution (3 hash/sorts):

Copy Data, Hash and Sort on all inputs to Aggregator, Join This is also the method the framework would use with Auto

partitioning to ensure correct results

Aggregate

JoinCopy

Notice that all 3 hash partitioners and sorts use the same key columns and order!

Page 101: ds325ee

101

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example 1 - Optimized Solution

Optimize partitioning keys (and sort order) across multiple stages in a single flow To minimize re-sorts and re-partitions

Optimized Solution (1 hash/sort): Move Hash/Sort upstream before the Copy Use SAME partitioning to preserve partitioning and sort

order

Partition and Sort on key column(s)

Aggregate

JoinCopy

SAME partitioning retains previous sort order

Inputs to JOIN do not need to be sorted!

Page 102: ds325ee

102

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example 1: Sort Insertion

Looking at the Job SCORE for the “optimal” solution, the Framework inserts sorts before each Join input to ensure correct results Regardless of the partitioning method

chosen In this example we don’t want these

extra sorts

To change behavior of framework-inserted sorts for this job, set $APT_SORT_INSERTION_CHECK_ONLY Inserted sorts will verify row order at

runtime, but will not actually sort data

op3[4p] {(parallel inserted tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op4[4p] {(parallel inserted

tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

Page 103: ds325ee

103

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Example 2: Header / Detail

Know your data – HASH guarantees correct grouping results, but it is not always the most efficient

Scenario: Header / Detail Processing Assign Data from Header Row to all Detail Rows Use Transformer to

Separate Header and Detail DataAdd Join Key column (constant value) to both outputs

Header

DetailSrc Out

NOTE: since the Join Key value is constant, inputs to the JOIN stage should NOT be sorted

Page 104: ds325ee

104

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning Example 2: Solutions

Solution 1: Hash On Key Columns and Join This is the “standard”

approach It is also the method the

Framework would use with Auto partitioning

BUT… there is only one hash key value, so the Join runs sequentially

Solution 2:Use Entire to copy header data to all partitions Distribute detail data using

Round Robin Join will now run in parallel

Header

Detail

Src Out

But there is still a possible problem with either solution!

For either solution, to counteract framework-inserted sorts, set

$APT_SORT_INSERTION_CHECK_ONLY

Page 105: ds325ee

105

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Introducing the Buffer Operator

At runtime, the Framework automatically inserts buffer operators to prevent deadlocks and to optimize overall performance For job flows with a fork-join, buffer

operators are inserted on all inputs to the downstream joining operatorAny link split that is later combined

in the same job flow Buffer operators may also be inserted

in an attempt to match producer and consumer rates

Data is never repartitioned across a buffer operator First-in, First-Out row processing

Some stages (eg. Sort, Hash Aggregator) internally buffer the entire data set before outputting a row Buffer operators are never inserted after these stages

Stage 3

Buffer

Buffer

Stage 1

Stage 2

Page 106: ds325ee

106

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Identifying Buffer Operators

At runtime, buffers are identified in the operators section of the job SCORE

For more details on buffering: OEM UserGuide.PDF, Appendix A

It has 6 operators:op0[1p] {(sequential Row_Generator_0) on nodes ( ecc3671[op0,p0] )}op1[1p] {(sequential Row_Generator_1) on nodes ( ecc3672[op1,p0] )}op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)

(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)

(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}op5[1p] {(sequential APT_RealFileExportOperator in

Sequential_File_12) on nodes ( ecc3672[op5,p0] )}It runs 12 processes on 4 nodes.

Page 107: ds325ee

107

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

How Buffer Operators Work

The primary goal of a buffer operator is to prevent deadlocks

This is accomplished by “holding rows” until the downstream operator is ready to process them Rows are held in memory up to size defined by

$APT_BUFFER_MAXIMUM_MEMORY

default is 3MB per buffer per partition When buffer memory is filled, rows are spilled to disk

By default, up to amount of available scratch disk unless QUEUE UPPER BOUND limit has been set

BufferProducer Consumer

Page 108: ds325ee

108

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Buffer Flow Control

When buffer memory usage reaches $APT_BUFFER_FREE_RUN the buffer operator will offer resistance to the new rows, slowing down the rate of upstream producer Default 0.5 = 50% Setting $APT_BUFFER_FREE_RUN > 1 (100%) will prevent the

buffer from slowing down upstream producer until data size of $APT_MAXIMUM_MEMORY * $APT_BUFFER_FREE_RUN has been bufferedAssumes that the overhead of disk I/O for buffer scratch usage is less

than the impact of slowing down upstream operator

Producer ConsumerBuffer

$APT_BUFFER_FREE_RUNBuffer will offer resistance to new rows, slowing down

upstream producer

Page 109: ds325ee

109

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Tuning Buffer Settings

On a per-job basis through environment variables $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT And many other advanced options…

On a per-link basis (Inputs/Outputs ->Advanced) Buffer options are defined per link

(virtual data set), hence the Output of one stage is the Input of the following stage

In general, Auto Buffering (default) is recommended Don’t change unless you really

understand your job flow and data! Disabling buffering may cause the

job to deadlock (hang)

In general, buffer tuning is an advanced topic. The default settings should be appropriate for most job flows. For very wide rows, it may be necessary to increase default buffer size

to handle more rows in memory Calculate total record width using internal storage for each data type / length /

scale. For variable-length (varchar) columns, use maximum length.

Page 110: ds325ee

110

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Buffer Resource Usage

By default, each buffer operator uses 3MB per partition of virtual memory Can be changed through Advanced link properties,

or globally using $APT_BUFFER_MAXIMUM_MEMORY

When buffer memory is filled, temporary disk space is used in the following order:

Scratch disks in the $APT_CONFIG_FILE “buffer” named disk pool

Scratch disks in the $APT_CONFIG_FILE default disk pool (normally all scratch disks are part of the default disk pool)

The default directory specified by $TMPDIRThe UNIX /tmp directory

Page 111: ds325ee

111

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

End of Data / End of Data Group

Stages that process groups of data (eg. Join, Merge, Remove Duplicates, Sort Aggregator) cannot output a row until: Data in the grouping key column(s) changes

(logical End of Data Group) All rows have been processed (End of Data)

For stages that process groups, rows are buffered in memory until an End of Data Group or End of Data

Some stages (eg. Sort, Hash Aggregator) must read an entire input data set (until End of Data) before outputting a single record Setting “Don’t Sort, Previously Sorted”

key option changes Sort behavior to output on groups instead of entire data set

Page 112: ds325ee

112

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Revisiting Example 2: Buffering Impact

For large data volumes, buffering introduces a possible problem with this solution: At runtime, buffer operators are inserted for this fork-join scenario The Join stage, operating on key-column groups, is unable to output

rows until (end of data group) or (end of data) Generating one header row with no subsequent change in join

column, data is buffered until end of data

Solution: Use stage variables hold header data values. Output multiple header rows with different join-key values

This additional logic may impact Transformer performance Proper solution ultimately depends on data volume and available

hardware resources

Header

Detail

Src Out

Buffer

Buffer

Page 113: ds325ee

113

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Revisiting Example 2: Buffering Solution

Define stage variables to hold header-row values. Set initial values to empty Only set header values when

header is identified

Header Link: Use output link constraints to

only output data after header values have been captured.

Assign more than one join key value using @INROWNUM

Assumes only one header row

Detail Link: Assign constant value to

detail join column

Page 114: ds325ee

114

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Join Stage: Internal Buffering

Even for Inner Joins, there is a difference between the inputs of a Join stage! The first link (#0, “LEFT” within Link Ordering)

establishes “driver” input – rows are read one at a time For non-unique key values, all rows within the same key

value group are read into memory from the second link (#1, “RIGHT” by Link Ordering)

For Example 2, single Header row must be the second input link (#1) to the Join stage Otherwise, all input data will

be read into virtual memory

Page 115: ds325ee

115

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Avoiding Buffer Contention

Datasets do not buffer – there is no upstream operation that would prevent rows from being output

In some cases, the best solution to avoiding fork-join buffer contention is to split the job, landing results to intermediate data sets Develop a single job first If performance / volume testing indicates a buffering-

related performance issue that cannot be resolved by adjusting buffering settings, then split the job across intermediate datasets

Page 116: ds325ee

116

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example 2: Why Not Use Lookup?

Lookup cannot output any rows until ALL reference link data has been read into memory (End of Data) Except for Sparse database lookups

NEVER generate Lookup reference data using a fork-join of source data Separate creation of lookup reference data from lookup

processing

Header

Detail

Src

HeaderRef

Out

Page 117: ds325ee

117

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Summary

Partitioning Method should ensure correct results

AND (if possible) evenly distribute data Must be aware of data distribution and impact on

processing

Collecting Used to consolidate partitioned data into sequential process

Sorting Parallel sorting establishes row order within groups

Partitioning gathers related rows Sequential sorting only needed to produce single, globally

sorted sequential result set

Page 118: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 02: Partitioning, Collecting, and Sorting Data

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 119: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 03: The Parallel Job Score

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 120: ds325ee

120

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

The Parallel Job SCORE

The Job SCORE details the optimization plan used by the DataStage Parallel Framework to run a given job design based on a specified $APT_CONFIG_FILE Similar to the way a parallel RDBMS builds a query plan Identifies degree of parallelism and node assignment(s) for each operator

Details mappings between functional (stage/operator) and actual UNIX processes Includes buffer operators inserted to prevent deadlocks and optimize data flow

rates between stages Can be used to identify sorts and partitioners that have been automatically inserted

to ensure correct results

Outlines connection topology (datasets) between adjacent operators and/or persistent data sets

Defines number of actual UNIX processes Where possible, multiple operators are combined within a single UNIX process to

improve performance and optimize resource requirements

Page 121: ds325ee

121

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Viewing the Job SCORE

• Set $APT_DUMP_SCORE to output the Score to the DataStage job log

• Can enable at the project level to apply to all jobs

• For each job run, 2 separate Score Dumps are written to the log

• First score is actually from the license operator

• Second score entry is the actual job score

License Operator job score

Actual job score

Page 122: ds325ee

122

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Example Job Score

Job scores are divided into two sections Datasets

partitioning and collecting

Operatorsnode/operator mapping

Both sections note sequential or parallel processing

Page 123: ds325ee

123

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job SCORE: Operators

The operators (lower) section of the Job Score details the mapping between stages and actual processes created at runtime Operator combination Operator to node mappings Degree of Parallelism per operator Framework-inserted sorts Buffer operators

op0[1p] {(sequential APT_CombinedOperatorController:

(Row_Generator_0) (inserted tsort operator

{key={value=LastName}, key={value=FirstName}})

) on nodes ( node1[op0,p0] )}op1[4p] {(parallel inserted

tsort operator {key={value=LastName}, key={value=FirstName}}(0))

on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}op2[4p] {(parallel buffer(0)) on nodes ( node1[op2,p0] node2[op2,p1] node3[op2,p2] node4[op2,p3] )}

Page 124: ds325ee

124

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Operator Combination

At runtime, the DataStage Parallel Framework can only combine stages (operators) that: Use the same partitioning method

Repartitioning prevents operator combination between the corresponding producer and consumer stages

Implicit repartitioning (eg. Sequential operators, node maps) also prevents combination

Are CombinableSet automatically within the stage/operator definitionCan also be set within DataStage Designer: Advanced stage properties

Page 125: ds325ee

125

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Composite Operator Example: Lookup

The Lookup stage is a composite operator Internally it contains more than one

component, but to the user it appears to be one stage LUTCreateImpl

Reads the reference data into memory

LUTProcessImpl Performs actual lookup processing

once reference data has been loaded

At runtime, each internal component is assigned to operators independently

op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3)

on nodes ( ecc3671[op2,p0] )}op3[4p] {(parallel buffer(0)) on nodes ( ecc3671[op3,p0] ecc3672[op3,p1] ecc3673[op3,p2] ecc3674[op3,p3] )}op4[4p] {(parallel

APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3)

(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in Transformer_7)

(PeekNull) ) on nodes ( ecc3671[op4,p0] ecc3672[op4,p1] ecc3673[op4,p2] ecc3674[op4,p3] )}

Page 126: ds325ee

126

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job SCORE: Data Sets

The Job SCORE can be used to verify partitioning and collecting methods that are used at runtime

Partitioners and Collectors are associated with datasets (top portion of the SCORE)

Datasets connect a source and a target:- operator(s) (see lower portion of SCORE)- persistent Dataset(s)

Partitioner / Collector method is shown between the source and target

Page 127: ds325ee

127

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Interpreting the Job Score - Partitioning

The DataStage Parallel Framework implements a producer-consumer data flow model Upstream stages (operators or persistent data sets) produce rows

that are consumed by downstream stages (operators or data sets)

Partitioning method is associated with producerCollector method is associated with consumer

“eCollectAny” is specified for parallel consumers, although no collection occurs!Separated by an indicator:

May also include [pp] notation when Preserve Partitioning flag is set

Producer

Consumer

-> Sequential to Sequential<> Sequential to Parallel=> Parallel to Parallel (SAME)#> Parallel to Parallel (not SAME) >> Parallel to Sequential> No producer or no consumer

Page 128: ds325ee

128

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Using the Job SCORE

$APT_DUMP_SCORE = 1 (“True”) is a recommended default (project level) setting for all jobs

At runtime, the Job SCORE can be examined to identify: Number of UNIX processes generated for a given job and

$APT_CONFIG_FILE Operator combination Partitioning methods between operators Framework-inserted components

Including Sorts, Partitioners, and Buffer operators

Page 129: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 03: The Parallel Job Score

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 130: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 04: Best Practices and Job Design Tips

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 131: ds325ee

131

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Assumptions

This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Material covered in

DS324PX: DataStage Enterprise Edition Essentials

Page 132: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStageEnterprise Edition

Job Design Tips

Page 133: ds325ee

133

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Overall Job Design

Ideal job design must strike a balance between performance, resource usage, and restartability

In theory, best performance results from processing all data in-memory without landing to disk Requires hardware resources (eg. CPU, memory)

and UNIX resources (eg. ulimit, nfiles, etc)Resource usage grows exponentially based on degree of

parallelism and number of stages in a flow Must also consider what else is running on the server(s)

May not be possible with very large amounts of dataeg. Sort will use scratch disk if data is larger than memory buffer

Business rules may dictate job boundarieseg. Dimensional maintenance before Fact table processing/loadeg. Lookup reference data must be created before lookup

processing

Page 134: ds325ee

134

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Modular Job Design

Parallel shared containers facilitate modular job design by creating re-usable components (stages, logic) Runtime column propagation allows maximum parallel

shared container re-use (only need to define columns used within container logic)

The total number of stages in a job includes the total of all stages in all parallel shared containers

Job parameters and multi-instance job properties facilitate job re-use

Land intermediate results to parallel data sets

Page 135: ds325ee

135

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Establishing Job Boundaries

Business requirements

Functional / DataStage requirements

Establish restart points in the event of a failure Segment long-running steps Separate final database Load from Extract and

Transformation steps

Resource utilization (number of stages, etc)

Performance Fork-join job flows may run faster if split into two

separate jobs with intermediate datasetsDepends on processing requirements and ability to tune buffering

Page 136: ds325ee

136

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Sequences

Job Sequences can be used to combine individual jobs into functional “modules” to perform a sequence of activities

Starting with DataStage release 7.1, Job Sequences can be “restartable” In the event of a failure, re-

running the sequence will not re-run activities that successfully completed

It is the developer’s responsibility to ensure that an individual job can be re-run after a failure

Enable Sequence restart in Job Properties (enabled by default)

The “do not checkpoint run” sequence stage property will execute that step every Sequence run.

Page 137: ds325ee

137

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Design – Stage Usage Tips

Sequential File Optimizing performance Reading and Writing fixed-width files Adjusting write buffer size

Column Import Lookup Sort Aggregator Transformer Database Stages

Page 138: ds325ee

138

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Reading a Sequential File in Parallel

By default, Sequential File reads are not parallel, unless multiple files are specified

The Readers Per Node option can be used to read a single input file in parallel at evenly spaced offsets Note that sequential row

order cannot be maintained when reading a file in parallel

Page 139: ds325ee

139

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Partitioning and Sequential Files

Sequential File sources (import operator) create one partition for each input file Always follow a Sequential File with ROUND ROBIN or

other appropriate partitioning type NEVER follow a Sequential File source with SAME

partitioningIf reading from one file, this will cause the downstream

flow to run sequentially!SAME is only appropriate in unusual scenarios where the

source data is already separated into multiple files by partition

Page 140: ds325ee

140

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Capturing Sequential File Rejects

The Sequential File stage supports an optional reject link to capture rows that do not match source or target format Reject schema is a single

raw (binary) column Be careful writing rejects to

another SequentialFile Easiest to output rejects to

a Dataset (with Peek for debug)

Page 141: ds325ee

141

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sequential File Tips

To write fixed-length files from variable-length fields, use the following column properties: field width: specifies the output column width pad string: specifies character used to pad data to the

specified field width (if not specified an ASCII NULL character 0x0 is used for padding)

When reading delimited files, extra characters are silently truncated for source file values longer than the maximum specified length of VARCHAR columns Starting with v7.01, set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to reject these records instead

Page 142: ds325ee

142

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Buffering Sequential File Writes

By default, Sequential File targets (export operator) buffer writes to optimize performance Buffers are automatically flushed when the job

completes successfully

For realtime applications, the environment variable $APT_EXPORT_FLUSH_COUNT can be used to specify the number of rows to buffer For example $APT_EXPORT_FLUSH_COUNT=1 flushes to disk

for every row Setting this value low incurs a SIGNIFICANT

performance penalty!

Page 143: ds325ee

143

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Using Column Import

The Column Import stage can be used to improve performance of non-parallel Sequential File reads and FTP sources Allows column parsing to run in parallel Separates parsing (CPU) from sequential source I/O

Define source file/FTP as a single columnType RAW or [VAR]CHARMaximum length = record sizeNote that there are metadata implications

Define Columns, Data Types, and other format options in Column Import stageSimilar to Sequential File definition

Page 144: ds325ee

144

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Lookup Stage Usage

The Lookup stage is most appropriate when reference data is small enough to fit into physical (shared) memory For reference datasets larger than available memory, use

the JOIN or MERGE stage

Limit use of Sparse Lookup (for DB2 and Oracle reference tables) Per-row database lookups are extremely expensive (slow)

For small numbers of rows, can be used for database-generated variables / function results

ONLY appropriate when the number of input rows is significantly smaller (eg. 1:100) than the number of reference rows

Page 145: ds325ee

145

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Lookup Reference Data Partitioning

ENTIRE is the default partitioning for Lookup reference links with “Auto” partitioning On SMP platforms, it is a good practice to set this explicitly on

the Normal Lookup reference link(s)

Lookup stage uses shared memory instead of duplicating ENTIRE reference data On SMP platforms

To minimize data movement across nodes in clustered / MPP platforms, it may be appropriate to select a keyed partitioning method Especially if data is already partitioned on those keys Input and Reference data partitioning must match

Page 146: ds325ee

146

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Lookup Reference Data

NEVER generate Lookup reference data using a fork-join of source data Lookup cannot output rows until all reference data has

been read into memory (except for Oracle or DB2 Sparse reference links)

Use Lookup File Sets to separate the creation of lookup reference data from lookup processing

Header

Detail

Src

HeaderRef

Out

Page 147: ds325ee

147

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Lookup File Sets

Lookup File Sets should be used to store reference data on disk Data is stored in native format, partitioned,

and pre-indexed on lookup key column(s) Key column(s) and partitioning are specified when file is

created

Lookup File Sets can only be used as reference input link to a Lookup stage Partitioning method and key columns specified when the

Lookup File Set is created will be used to process the reference data on subsequent Lookups that use this file

Particularly useful when static reference data can be re-used in multiple jobs (or runs of the same job)

Page 148: ds325ee

148

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Aggregator

The Aggregator stage summarizes data based on groupings of key-column values Input partitioning must match desired groupings

Use Hash method for inputs with a limited number of distinct key-column values Uses 2K of memory/group Incoming data does not need to be pre-sorted Results are output after all rows

have been read Output row order is undefined

Even if input data is sorted

Use Sort method with a large (or unknown) number of distinct key-column values Requires input pre-sorted on key columns Results are output after each group

Page 149: ds325ee

149

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sequential (Total) Aggregations

To summarize on all input rows Generate a constant-value key column

Column GeneratorTransformer (if already in the upstream job flow)

Sequentially Aggregate on generated key columnNo need to sort or hash-partition input data!

Use 2 aggregators to prevent sequential aggregation (and collector) from slowing down upstream data flow First aggregator runs in parallel, grouping on generated key

columnRound-robin input if not evenly distributed

Second aggregator runs sequentially, grouping on generated key columnAuto collector

Parallel Sequential

Page 150: ds325ee

150

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer vs. Other Stages

For optimum performance, consider more appropriate stages instead of a Transformer in parallel job flows:- Use the Copy stage as a placeholder

- this is different from DataStage Server!- unless FORCE=TRUE, Copy is optimized out at runtime

- Leverage stage (eg. Copy) Output Mappings (RCP off) to Rename ColumnsDrop ColumnsPerform Default Type Conversions

- Modify is the most efficient “stage”. Use it for- Non-default type conversions- Null handling (converting between in-band and out-of-band)- String trimming (v7.01 and later)

- NOTE: starting with v7.01, Transformer output link constraints are FASTER than Filter stage! (Filter is always interpreted)

Page 151: ds325ee

151

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer vs. Lookup

- Consider implementing Lookup tables for expressions that depend on value mapping

- For example:- Instead of using transformer expressions such as:

- … link.A=1 OR link.A=3 OR link.A=5 …- … link.A=2 OR link.A=7 OR link.A=15 OR link.A=20 …

- Create a Lookup table for the source-value pairs, and use the Lookup stage to assign values

- This method can also be used to simply output link constraints

A Result

1 1

3 1

5 1

2 2

7 2

15 2

20 2

Page 152: ds325ee

152

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer Performance Guidelines

- Minimize the number of Transformers by combining derivations from multiple Transformers

NEVER use the Server-side BASIC Transformer in high-volume data flows Intended to provide a migration path for existing DataStage Server

applications that use DataStage BASIC routines Starting with v7, the parallel Transformer supports user-defined

functions (external object files or libraries, not through DataStage BASIC routines)

Replace Transformer stages that do not meet performance requirements with BuildOps It is generally not necessary to replace all Transformers, just those

that are bottlenecks Remember, BuildOps require more knowledgeable developers than

equivalent Transformer logic

Page 153: ds325ee

153

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Transformer Expressions

The parallel Transformer uses the following evaluation algorithm: Evaluate each stage variable initial value For each input row:

Evaluate each stage variable derivation value unless the derivation is empty

For each output link: Evaluate each column derivation value Write the output record

Stage variables and columns within a link are evaluated in the order displayed in the Transformer editor

Page 154: ds325ee

154

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Transformer Expressions

Given Transformer evaluation order, use stage variables instead of per-column derivations to minimize repeated use of the same derivation: Move repeated expressions outside of loops Examples:

Portions of output column derivations that are used in multiple derivations

Where an expression includes calculated constant values Use the stage variable Initial Value to evaluate once for all

rows

Where an expression requiring a type conversion is used as a constant, or it is used in multiple places

Page 155: ds325ee

155

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Transformer Decimal Arithmetic

Starting with v7.0.1 and v6.0.2, the Transformer supports DECIMAL arithmetic (earlier releases converted to dfloat) Default internal decimal variables are precision 38 scale

10, but this can be changed by specifying$APT_DECIMAL_INTERM_PRECISION$APT_DECIMAL_INTERM_SCALE

Set $APT_DECIMAL_INTERM_ROUND_MODE to specify:ceil: rounds toward positive infinity

1.4 -> 2, -1.6 -> -1floor: rounds toward negative infinity

1.6 -> 1, -1.4 -> -2round_inf: rounds or truncates to nearest representable value, breaking

ties by rounding positive values toward positive infinity and negative values toward negative infinity

1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2trunc_zero: discard any fractional digits to the right of the rightmost

fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size

1.56 -> 1.5, -1.56 -> -1.5

Page 156: ds325ee

156

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Conditionally Aborting a Job

Use the “Abort After Rows” setting in the output link constraints of the parallel Transformer to conditionally abort a parallel job Create a new output link and assign a link constraint that

matches the abort condition Set the “Abort After Rows” for this link to the number of

rows allowed before the job aborts

When the “Abort After Rows” threshold is reached, the Transformer immediately aborts the job flow, potentially leaving uncommitted database rows or un-flushed file buffers

Page 157: ds325ee

157

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

More Transformer Best Practices

Always include Reject Link Captures NULL errors from

Transformer expressions

Always test for null value before using a column in a function

Avoid type conversions Try to maintain data type as imported

Be aware of Column and Stage Variable data types It is easy to neglect setting the proper Stage Variable

type

Page 158: ds325ee

158

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

“First Row” Transformer Derivations

Within a Transformer, stage variables can be used to identify the first row of an input group Define one stage variable for each grouping key column Define a stage variable to flag when input key column(s) do

not match previous value(s) On new group (flag set), set stage variable(s) to incoming key

column value(s)

Page 159: ds325ee

159

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

“Last Row” Transformer Derivations

Since the Transformers cannot “read ahead”, other methods must be used when derivations depend on the last row of a group

For aggregate calculations within the Transformer, generate a “running total” for each group, then Remove Duplicates, retaining Last row

Page 160: ds325ee

160

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Identifying “Last Row” in a Group

In general, it is a bad idea to perform multiple, back-to-back sorts

The sort stage, however, can be used for more than just sorting Sub-sorting on groups (instead of complete sorts) Creating key change columns

Example: For derivations that cannot output a running total, use 3 Sort stages before Transformer to generate a change key column for the last row in the group Often, data is already sorted earlier in the same flow Hash/Sort on key columns before first sort Use SAME partitioning to ensure that subsequent stages keep grouping

and sort order

Sort KeyChange SubSort

Page 161: ds325ee

161

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

“Last Row” Sort Details

Sorts on key columns Sorts Descending on

group order column

First Sort Does no sorting – creates

key-change column Specify only key columns

Second “Sort” Does not sort on key

columns Sub-sorts Ascending on

group order column

Final “Sub-Sort”

Page 162: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStageEnterprise Edition

Database Stage Usage

Page 163: ds325ee

163

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Database Stage Usage

Overall Database Guidelines Native Parallel vs. Plug-In Stages DB2 Guidelines Oracle Guidelines Teradata Guidelines SQL or DataStage?

Page 164: ds325ee

164

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Select Lists for Read

For source database stages, limit the use of “SELECT * ” to read all columns Uses more memory, may impact job performance Only needed for “dynamic” source / target flows

(uncommon)

Instead, explicitly specify ONLY the columns needed in the flow For Table read method, specify Select List property

Or, use Auto-Generated or User-Defined SQL

Page 165: ds325ee

165

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Native Parallel Database Stages

Starting with release 7, DataStage Enterprise Edition offers database connectivity through native parallel and plug-in stage types.

In general, for maximum parallel performance,scalability, and features it is best to use thenative parallel database stages. Parallel read and write OPEN and CLOSE commands

Page 166: ds325ee

166

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Upsert (API) vs. Load Methods

For database targets, most Enterprise stages provide the choice of Upsert or Load Methods Upsert method uses database APIs

Allows concurrent processing with other jobs and applications

Does not bypass database constraints, indexes, triggers Load method uses corresponding database-specific

parallel load utilitiesCan be significantly faster than Upsert method for large

data volumesSubject to database-specific limitations of load utilities

May be issues with index maintenance, constraints, etc May not work with tables that have associated triggers

Requires exclusive access to target table

Page 167: ds325ee

167

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

OPEN and CLOSE commands

OPEN command allows user to specify SQL to be executed before the stage begins reading or writing Example: Create temporary table used to write rows

CLOSE command allows user to specify SQL to be executed after the stage completes reading or writing Example: “SELECT FROM … INSERT INTO…” from temporary

table to actual table Example: Delete temporary table(s)

Available only in the native parallel (Enterprise) database stages

Page 168: ds325ee

168

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Plug-In Database Stages

Plug-in stage types are intended to provide connectivity to database configurations not offered by native parallel stages. Cannot read in parallel Cannot span multiple servers in clustered or MPP

configurations

Page 169: ds325ee

169

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Designer Palette Customization

DataStage repository window displays all stages available in the parallel canvas.

Stage Types/Parallel category

Not all of these stages are included in the default Designer palette.

Customize the palette to add stage types (eg. Teradata API)

Page 170: ds325ee

170

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Enterprise Edition DB2 Stages

DB2 Enterprise stage Should always be used when reading from, performing

lookups against, or writing to DB2 Enterprise Server Edition with Database Partitioning Feature (DPF)DB2 v7.x this was called “DB2EEE”

Tightly coupled with DB2, communicates directly with each DB2 database node, using same partitioning as DB2 table

Supports Parallel Read, Upsert, Load, Sparse Lookup

DB2 API stage Provides connectivity to non-UNIX DB2 databases

(such as mainframe editions through DB2-CONNECT)

Page 171: ds325ee

171

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DB2 Upsert Commit Interval

For target DB2 tables using the Upsert method, the DB2 Enterprise Stage provides options to specify the database commit interval for each stage

Rows are committed after a period of time or number of rows, whichever comes first: Default is every 2 seconds or 2000 rows

Page 172: ds325ee

172

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Cleaning Up Failed DB2 Loads

In the event of a failure during DB2 Load operation, the DB2 Fast Loader marks the table inaccessible (quiesced exclusive or load pending state)

To reset the target table state to normal mode: Re-run the job specifying “CleanupOnFailure=True” option Any rows that were inserted before the load failure must

be deleted manually

Page 173: ds325ee

173

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Enterprise Edition Oracle Stages

Oracle Enterprise Source

Supports sequential (default) or parallel reads Target

Upsert: uses Oracle APILoad: invokes SQL*Loader, subject to its limitations

Oracle OCI Load ONLY used for heterogeneous loads

When target databases hardware platforms differ from the Oracle client (DataStage server) platform

Page 174: ds325ee

174

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Specifying Oracle Remote Server

The Oracle Enterprise Remote Server connection option is intended for Oracle instances on remote hosts

In general, avoid using this option for local Oracle databases (on same host as DataStage server) Specifying for local Oracle instances forces TCP

(network) instead of shared memory database connection

Instead, set the environment variable $ORACLE_SIDOracle environment is typically defined within the

DataStage dsenv file

Page 175: ds325ee

175

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Reading from Oracle in Parallel

By default, Oracle Enterprise reads sequentially. Use the partition table option to read in parallel from Oracle sources

Limitations of Parallel Read: Source table can only be non-partitioned or range-partitioned Cannot run queries containing a "GROUP BY" clause which

are not also partitioned by same field Cannot perform a non-collocated join

Page 176: ds325ee

176

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Oracle Schema Owner

To access Oracle tables that were created by a different user, fully-qualify the table name Syntax: ownername.tablename NOTES:

Parameterize ownernameDatabase permissions must allow accessCANNOT create an unqualified synonym

no access to Oracle system catalog information required by Oracle Enterprise stage

Page 177: ds325ee

177

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Improving Oracle Upsert Performance

In Upsert write mode, the Oracle Enterprise stage: Executes the Insert statement (if present) first If the Insert fails with a unique-constraint violation, it then

executes the Update statement For larger data volumes, it is often faster to identify Insert

and Update data within the job and separate into different Oracle Enterprise targets Set Upsert Mode=“Update Only” for rows to be updated Set Upsert Mode=“Update and Insert” for rows to be

inserted Prevents double-processing of update records

Insert processing uses Oracle host arrays to improve processing Optional InsertArraySize parameter can enhance performance

(default is 500 rows)

Page 178: ds325ee

178

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Oracle Upsert Commit Interval

For target Oracle tables using the Upsert method, two environment variables specify the database commit interval As environment variables, commit settings apply to all

Oracle stages in a job

Rows are committed after a period of time or number of rows, whichever comes first, for each Oracle stage/partition: $APT_ORAUPSERT_COMMIT_ROW_INTERVAL

Default is every 2000 rows (per stage/partition) $APT_ORAUPSERT_TIME_INTERVAL

Default is every 2 seconds

Page 179: ds325ee

179

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Oracle Load into Indexed Tables

By default, Oracle Enterprise will not Load an indexed table Must drop indexes before the load, and recreate after the

load (need appropriate Oracle privileges)Can use OPEN and CLOSE commands

In Append or Truncate modes, the IndexMode option can allow load into an indexed table: Rebuild: bypasses indexes during load, rebuilds indexes

after load completes uses Oracle ALTER INDEX REBUILD commandindexes cannot be partitioned

Maintenance: maintains index on loadLoads each partition sequentiallyTable and Index must be partitionedIndex must be local range-partitioned using same range values

used to partition the table

Page 180: ds325ee

180

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Alternate: Load into Indexed Tables

If index mode options are not possible, or if you do not have proper Oracle permissions, it is still possible to Load into an indexed table: Set Oracle Enterprise stage to run sequentially Set environment variable $APT_ORACLE_LOAD_OPTIONS

OPTIONS(DIRECT=TRUE,PARALLEL=FALSE)

Page 181: ds325ee

181

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Stage Usage

Because of limitations imposed by the Teradata Utilities, it is sometimes appropriate to use plug-in stages for Teradata sources or targets Teradata imposes a system-wide limit to the number of

concurrent database utilitiesCan be adjusted by the DBA, but can not be greater than 15Within a parallel job, each use of Teradata Enterprise, Teradata

MultiLoad, or Teradata Load stages count against this limit when the job is run

Which Teradata stage to use? Source or Target Teradata Enterprise

uses FastExport and FastLoad utilitiesHigh-volume parallel reads and writesTargets are limited to Insert operations (empty table or Append)Supports OPEN and CLOSE commands

Page 182: ds325ee

182

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Enterprise DBOptions

For Teradata instances with a large number of AMPs (VPROCs), it may be necessary to set the optional SessionsPerPlayer and RequestedSessions in the DBOptions string in the Teradata Enterprise stage It is a good idea to parameterize these settings Syntax is:

user=[user],password=[password],SessionsPerPlayer=nn, RequestedSessions=nn

Page 183: ds325ee

183

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Enterprise Sessions

RequestedSessions determines the total number of distributed connections to the Teradata source or target When not specified, it equals the number of Teradata

VPROCs (AMPs) (your DBA can provide this) Can set between 1 and number of VPROCs

SessionsPerPlayer determines the number of connections each player will have to Teradata. Indirectly, it also determines the number of players (degree of parallelism). Default is 2 sessions / player The number selected should be such that

SessionsPerPlayer * number of nodes * number of players per node = RequestedSessions

Setting the value of SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. In that case, the value for -SessionsPerPlayer should be increased.

Page 184: ds325ee

184

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Server– MPP with 4 TPA nodes– 4 AMP’s per TPA node

DataStage Server

Configuration File Sessions Per Player Total Sessions

16 nodes 1 16

8 nodes 2 16

8 nodes 1 8

4 nodes 4 16

Example Settings

Teradata SessionsPerPlayer Example

Page 185: ds325ee

185

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Plug-Ins

Target Teradata MultiLoad plug-in (MultiLoad utility)Targets allow Insert, Update, Delete, or Upsert of moderate data

volumes (stage cannot run in parallel)Do NOT use as a source in an EE flow! (runs FastExport

sequentially)

Target Teradata MultiLoad plug-in (TPump utility)Targets allow Insert, Update, Delete, or Upsert of small data

volumes in a large database Does NOT lock target table exclusivelystage cannot run in parallel

Source or Target Teradata API stage does not use database utilitiesIntended for small-volumes of dataDoes not count against Teradata limits, but slower than TPumpAnd…cannot read in parallel (parallel writes are allowed)

Page 186: ds325ee

186

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Stage Usage Guidelines

Stages that use Teradata Utilities (database-wide limit): TeraData Enterprise will always have maximum

performance for high volumes of dataONLY stage that will read in parallelLimited target capabilities (insert, append)

TeraData MultiLoad for moderate data volumesInserts, Updates, Deletes, UpsertsTarget stage ONLY!Must run sequentially

TeraData MultiLoad (TPump option)Similar to MultiLoad, but does not lock target table exclusively

Stages that do not use Teradata Utilities: Teradata API

Page 187: ds325ee

187

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

SQL or DataStage?

When reading data from multiple tables in the same database, it is possible to use either SQL or DataStage for some tasks.

In general, the optimal implementation leverages the strengths of each technology: When possible, use a SQL filter (WHERE clause) to limit the number

of rows sent to the DataStage job Use a SQL JOIN to combine data from tables of small-medium

number of rows, especially when the join columns are indexed In general, avoid SQL SORTs – DataStage SORT is much faster

and runs in parallel without the overhead of sort-merge Use DataStage SORT and JOIN to combine data from very large

tables, or when the join condition is complex Avoid the use of database stored procedures (eg. Oracle PL/SQL)

on a per-row basis. Implement these routines using native DataStage components.

When the direction is not obvious, the decision is often made by actual tests, or influenced by other factors such as metadata needs and developer skill sets

Page 188: ds325ee

188

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

For More Information

Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference Record Schema

DataStage Enterprise Edition Best Practices and Performance Tuning document

Don’t be afraid to try!

Page 189: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 04: Best Practices and Job Design Tips

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 190: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 05: Environment Variables

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 191: ds325ee

191

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Understanding a Job’s UNIX Environment

Jobs inherit environment variables at runtime based on this order of evaluation: Environment variables defined in $DSHOME/dsenv

Shared by all projects on the DataStage server

Project-level environment variables defined by DS AdministratorDuplicate variables over-ride $DSHOME/dsenvNOTE: when migrating between environments, project level

environment variables are NOT exported

Job-level environment variables set in Job ParametersDuplicate variables over-ride $DSHOME/dsenv and project-level

settingsCannot be set / passed in Job Sequences (bug!)To avoid hard-coding job parameters, use special values:

$ENV – pulls value from operating system environment $PROJDEF – uses project default value

Page 192: ds325ee

192

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Copying Project-Level Environment Variables

Project-level environment variables are not exported when performing a full export using DataStage manager

With care, project-level environment variables can be copied between projects by editing the DSParams file located on the top-level of the project directory User-defined settings are near the end of this file

IMPORTANT: Always make a backup-copy of the DSParams file before any manual editing. It is possible to render a project un-usuable through

improper editing of DSParams

[InternalSettings]DisableParSCCheck=0[AUTO-PURGE]PurgeEnabled=0DaysOld=0PrevRuns=0[EnvVarValues]"ORACLE_SID"\1\"cpaul""APT_SORT_INSERTION_CHECK_ONLY"\1\"1"

Page 193: ds325ee

193

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Environment Variables For All Jobs

The following environment variables are recommended for all jobs. Although these can be set at the project level, it is better to specify within job properties Provides runtime parameter Specify in your Job template(s)

$APT_CONFIG_FILE=[filepath] $APT_DUMP_SCORE=1 $APT_RECORD_COUNTS=1

Outputs record counts to the job log as each operator completes processing $OSH_ECHO=1

Outputs generated OSH to job log $APT_PM_SHOW_PIDS=1

Places UNIX process ID entries in job log for each process started at runtime Does not show DataStage phantom or Server processes

$APT_BUFFER_MAXIMUM_TIMEOUT=1 Maximum buffer delay in seconds

$APT_COPY_TRANSFORM_OPERATOR=1 For clusters/MPP only: copies Transform operator(s) to remote nodes

Page 194: ds325ee

194

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Monitoring Environment Variables

Starting with DataStage v7, the Director Job Monitor captures results on a time interval Captured row counts are shown in Director, Job Monitor, and

Designer (Show Performance Statistics) This data is also stored in the DataStage repository, and can be

extracted using Job Control or XML reports

The following environment variables alter Job Monitor characteristics: $APT_MONITOR_TIME=[seconds]

Specifies time interval for capturing job monitor information at runtime. $APT_MONITOR_SIZE=[rows]

If set, specifies that the job monitor capture information on a row (not time) basis. This is the method used in DataStage release 6.x

$APT_NO_JOBMON=1Disables job monitoring completely – no statistics will be captured In rare instances, this may improve performance

Page 195: ds325ee

195

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Design Environment Variables

$APT_STRING_PADCHAR=[char] Overrides the default pad character of 0x0 (ASCII NULL) Can be a string character, or C-notation Used for all variable-length to fixed-length string conversions May have implications for some target database stages

(eg. Oracle) $APT_DECIMAL_INTERM_PRECISION=[precision]

$APT_DECIMAL_INTERM_SCALE=[scale] Specifies internal precision and scale used for internal Transformer derivations Default precision/scale is [38,10], maximum is [255,255]

$APT_DECIMAL_INTERM_ROUND_MODE=[mode] ceil: rounds toward positive infinity

1.4 -> 2, -1.6 -> -1 floor: rounds toward negative infinity

1.6 -> 1, -1.4 -> -2 round_inf: rounds or truncates to nearest representable value, breaking ties by

rounding positive values toward positive infinity and negative values toward negative infinity 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2

trunc_zero: discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. If $APT_DECIMAL_INTERM_SCALE is smaller than the results of an internal calculation, round or truncate to the scale size 1.56 -> 1.5, -1.56 -> -1.5

Page 196: ds325ee

196

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Job Debugging Environment Variables

The following environment variables can assist with debugging a job flow: $OSH_PRINT_SCHEMAS=1

Outputs the actual schema used at runtime for each dataset in a job flow. This is useful for determining if actual schema matches what the job designer expected.

$APT_PM_PLAYER_TIMING=1When set, prints detailed information in the job log for each

operator, including CPU utilization and elapsed processing time $APT_PM_PLAYER_MEMORY=1

When set, prints detailed information in the job log for each operator when additional memory is allocated

$APT_BUFFERING_POLICY=FORCE$APT_BUFFER_FREE_RUN=1000Used in conjunction, these two environment variables effectively

isolate each operator from slowing upstream production. Using the job monitor statistics, this can identify which part of a job flow is impacting overall performance.

NOT recommended for production job runs!

Page 197: ds325ee

197

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Buffer Environment Variables

The following environment variables may also be specified on a per-stage basis within Designer: $APT_BUFFERING_POLICY $APT_BUFFER_MAXIMUM_MEMORY $APT_BUFFER_FREE_RUN $APT_BUFFER_DISK_WRITE_INCREMENT

Page 198: ds325ee

198

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Sequential File Stage Environment Variables

Environment Variable Setting Description

$APT_EXPORT_FLUSH_COUNT [nrows] Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

(DataStage v7.01 and later)

1 Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. By default, imported string fields that exceed their maximum declared length are truncated.

$APT_IMPORT_BUFFER_SIZE$APT_EXPORT_BUFFER_SIZE

[Kbytes] Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.

$APT_CONSISTENT_BUFFERIO_SIZE [bytes] In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.

$APT_DELIMITED_READ_SIZE [bytes] Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2.

This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.

$APT_MAX_DELIMITED_READ_SIZE [bytes] This variable controls the upper bound which is by default 100,000 bytes.  When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

Page 199: ds325ee

199

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DB2 Environment Variables

Environment Variable Setting Description

$INSTHOME [path] Specifies the DB2 install directory. This variable is usually set in a user’s environment from .db2profile.

$APT_DB2INSTANCE_HOME [path] Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined).

$APT_DBNAME [database] Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the “Use Database Environment Variable” option is True. If $APT_DBNAME is not defined, $DB2DBDFT is used to find the database name.

$APT_RDBMS_COMMIT_ROWSCan also be specified with the “Row Commit Interval”

stage input property.

[rows] Specifies the number of records to insert between commits. The default value is 2000.

$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle DB2 databases which use the special characters # and $ in column names.

Page 200: ds325ee

200

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Informix Environment Variables

Environment Variable Setting Description

$INFORMIXDIR [path] Specifies the Informix install directory.

$INFORMIXSQLHOSTS [filepath] Specifies the path to the Informix sqlhosts file.

$INFORMIXSERVER [name] Specifies the name of the Informix server matching an entry in the sqlhosts file.

$APT_COMMIT_INTERVAL [rows] Specifies the commit interval in rows for Informix HPL Loads. The default is 10000.

Page 201: ds325ee

201

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Oracle Environment Variables

Environment Variable Setting Description

$ORACLE_HOME [path] Specifies installation directory for current Oracle instance. Normally set in a user’s environment by scripts.

$ORACLE_SID [sid] Specifies the Oracle service name, corresponding to a TNSNAMES entry.

$APT_ORAUPSERT_COMMIT_ROW_INTERVAL$APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num][seconds]

These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method.

Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.

$APT_ORACLE_LOAD_OPTIONS [SQL*Loader options]

Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)

$APT_ORACLE_LOAD_DELIMITED

(DataStage 7.01 and later)

[char] Specifies a field delimiter for target Oracle stages using the Load method. Setting this variable makes it possible to load fields with trailing or leading blank characters.

$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM 1 When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.

$APT_ORA_WRITE_FILES [filepath] Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.

$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

Page 202: ds325ee

202

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Teradata Environment Variables

Environment Variable Setting Description

$APT_TERA_SYNC_DATABASE [name] Starting with v7, specifies the database used for the terasync table. By default, EE uses the

$APT_TERA_SYNC_USER [user] Starting with v7, specifies the user that creates and writes to the terasync table.

$APT_TER_SYNC_PASSWORD [password] Specifies the password for the user identified by $APT_TERA_SYNC_USER.

$APT_TERA_64K_BUFFERS 1 Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration.

$APT_TERA_NO_ERR_CLEANUP 1 This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table.

$APT_TERA_NO_PERM_CHECKS 1 Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.

Page 203: ds325ee

203

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

For More Information

Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) Admin Install Guide, Chapter 11: Environment Variables Operators Reference

DataStage Enterprise Edition Best Practices and Performance Tuning document

Page 204: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 05: Environment Variables

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 205: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 06: Introduction to Performance Tuning

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004

Page 206: ds325ee

206

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Assumptions

This module assumes that you have an understanding of the topics covered in: Module 01: Parallel Framework Architecture Module 02: Partitioning, Collecting, and Sorting Module 03: Parallel Job Score Module 04: Best Practices and Job Design Tips Material covered in

DS324PX: DataStage Enterprise Edition Essentials

Page 207: ds325ee

207

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Optimizing Performance

The ability to process large volumes of data in a short period of time requires optimizing all aspects of the job flow and environment for maximum throughput and performance: Job Design Stage Properties DataStage Parameters Configuration File Disk Subsystem

Especially RAID arrays / SANs Source and Target database Network etc...

Page 208: ds325ee

208

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Enterprise Edition Performance

Within DataStage, examine (in order): End-to-end process flow

Intermediate results, sources/targets, disk usage DataStage Configuration File(s) for Each Job

Degree of ParallelismImpact on Overall System ResourcesFile system mappings, scratch disk

Individual Job Design (including shared containers)Stages chosen, overall design approachPartitioning StrategyCombinationBuffering (as a last resort)

Ultimate job performance may be constrained by external sources / targets eg. disk subsystem, network, database, etc. May be appropriate to scale-back degree of parallelism to

conserve un-used resources

Page 209: ds325ee

209

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Performance Tuning Methodology

Performance tuning is an iterative process: Test in isolation (nothing else should be running)

DataStage ServerSource and Target databases

Change one item at a time, then examine impactUse Job Score to determine

Number of processes generated Operator combination Framework-inserted sorting and partitioning

Use DataStage Job Monitor to verify Data distribution (partitioning) Throughput and bottlenecks

Use UNIX system monitoring tools to determine resource utilization (CPU, memory, disk, network)

Page 210: ds325ee

210

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Using DataStage Director Job Monitor

Enable “Show Instances” to show data distribution (skew) across partitions Best performance with even distribution

Enable “Show %CP” to display CPU utilization

Page 211: ds325ee

211

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Selectively Disabling Operator Combination

Operator combination is intended to improve overall performance and lower resource usage Generally separates I/O from CPU activity

There may be instances when operator combination hurts performance One process cannot use more than 100% of CPU It is also a good idea to separate I/O from CPU tasks

Use DataStage Job Monitor to identify CPU bottlenecks Selectively disable combination through Designer stage

properties

In unusual circumstances, disable all combination by setting $APT_DISABLE_COMBINATION=TRUE Generates significantly more UNIX processes May negatively impact performance

Page 212: ds325ee

212

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Operator Combination Example

In this example, the combined operator is using 100% CPU Disabling operator combination allows each to use stage to use

more CPU, and separates I/O from CPU It has 4 operators:op0[1p] {(parallel FileSetIn.InStream) on nodes ( node1[op0,p0] )}op1[1p] {(parallel APT_TransformOperatorImplJob_Transformer in Transformer) on nodes ( node1[op1,p0] )}op2[1p] {(parallel APT_RealFileExportOperator in File_Set_6.ToOutput) on nodes ( node1[op2,p0] )}op3[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op3,p0] )}It runs 4 processes on 1 node.

It has 2 operators:op0[1p] {(parallel APT_CombinedOperatorController: (FileSetIn.InStream) (APT_TransformOperatorImplJob_Transformer in Transformer) (APT_RealFileExportOperator in File_Set_6.ToOutput) ) on nodes ( node1[op0,p0] )}op1[1p] {(sequential APT_WriteFilesetExportOperator in File_Set_6.ToOutput) on nodes ( node1[op1,p0] )}It runs 2 processes on 1 node.

Without Operator Combination

With Operator Combination

Page 213: ds325ee

213

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Configuration File Guidelines

Minimize I/O overlap across nodes If multiple filesystems are shared across nodes, alter order of

file systems within each node definition Pay particular attention with mapping of file systems to

physical controllers / drives within a RAID array or SAN Use local disks for scratch storage if possible

Named Pools can be used to further separate I/O “buffer” – file systems are only used for buffer overflow “sort” – file systems are only used for sorting

On clustered / MPP configurations, named pools can be used to further specify resources across physical servers Through careful job design, can minimize data shipping Specifies server(s) with database connectivity

Page 214: ds325ee

214

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Use Parallel Data Sets

Use Parallel Data Sets to land intermediate results between parallel jobs Stored in native internal format

(no conversion overhead) Retains data partitioning and sort order

(end-to-end parallelism across jobs) Maximum performance through parallel I/O But, can only be used by other DataStage Enterprise

Edition parallel jobs

When generating Lookup reference data to be used in subsequent jobs, use Lookup File Sets Internal format, partitioned Pre-indexed

Page 215: ds325ee

215

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Impact of Partitioning

Ensure data is as close to evenly distributed as possible When business rules dictate otherwise, re-partition to a

more balanced distribution as soon as possible to improve performance of downstream stages

Minimize repartitions by optimizing the flow to re-use upstream partitioning Especially in clustered / MPP environments

Know your data Choose hash key columns that generate sufficient

unique key combinations (while meeting business requirements)

Use SAME partitioning carefully Maintains degree of parallelism

Page 216: ds325ee

216

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Impact of Sorting

Use parallel sorts if possible (sort by key-column groups) Where sequential sort is required, parallel sort + sort merge collector

is generally much faster than a sequential sort Complete sorts are expensive

Interrupts pipelineRows cannot be output until all rows have been read

Uses scratch disk for intermediate storageUnless the data set is small enough to fit in sort buffer

Minimize and combine sorts where possible Use the “Don’t Sort, Previously Sorted” key-column option to leverage

previous sort groupings Uses much less memory Outputs rows after each key-column group

Parallel data sets maintain sort order and partitioning across jobs

Stable sorts are slower than non-stable sorts; use only when necessary

Use the “Restrict Memory Usage (MB)” option to increase amount of memory per partition (default is 20MB)

Page 217: ds325ee

217

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Impact of Transformers

Minimize number and use of Transformers Consider more appropriate stages / methods

Copy, Output Mappings, Modify, Lookup Combine derivations from multiple Transformers

Use stage variables to perform calculations used by multiple derivations

Replace complex Transformers that do not meet performance requirements with BuildOps

And NEVER use the BASIC Transformer for high-volume flows!

Page 218: ds325ee

218

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Impact of Buffering

Consider maximum row width For very wide rows, it may be necessary to increase

buffer size to hold more rows in memory (default is 3MB / partition)

Set through stage properties or for entire job using $APT_BUFFER_MAXIMUM_MEMORY

Tune all other factors (job design, configuration file, disk, resources, etc) before tuning buffer settings

Be careful changing buffering mode Disabling buffering might cause deadlocks (job

hang)

In some cases, the best solution to avoiding fork-join buffer contention may be to split the job, landing to intermediate data sets

Page 219: ds325ee

219

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Isolating Buffers from Overall Performance

Buffer operators may make it difficult to identify performance bottlenecks in a job flow

Setting the following environment variables effectively isolates each stage (by inserting buffers), and prevents the buffers from slowing down upstream stages (by spilling to disk) $APT_BUFFERING_POLICY=FORCE

Inserts buffers between each operator (isolates) $APT_BUFFER_FREE_RUN=1000

Writes excess buffer to disk instead of slowing down producerBuffer will not slow down producer until it has written 1000*$APT_MAXIMUM_MEMORY to disk

Important notes: These settings will generate a significant amount of disk I/O! Use

configuration file “buffer” disk pools to isolate buffer file systems from scratch and resource disks

Do NOT use these settings for production jobs!

Page 220: ds325ee

220

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Other Performance Tips

Remove un-needed columns as early as possible within the flow Minimizes memory usage, optimizes buffering Use a select list when reading from database sources To remove columns on Output Mapping, disable runtime

column propagation

Always specify a maximum length for VARCHAR columns Significant performance benefits

Avoid type conversions if possible Verify with $OSH_PRINT_SCHEMAS Always import Oracle table definitions using orchdbutil

Page 221: ds325ee

221

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

Tuning Sequential File Performance

On heavily loaded file servers or some RAID/SAN configurations, setting these environment variables may improve performance (specify a number in Kbytes, default is 128): $APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE

In some disk array configurations, set the following environment variable equal to the read/write size in bytes: $APT_CONSISTENT_BUFFERIO_SIZE

Page 222: ds325ee

222

April 9, 2023 © 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

For More Information

Orchestrate “OEM” Documentation (available in the documentation section of Ascential eServices public website) User Guide Operators Reference

DataStage Enterprise Edition Best Practices and Performance Tuning document

Don’t be afraid to try!

Page 223: ds325ee

© 2004 Ascential Software Corporation. All rights reserved. Ascential is a trademark of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. Reproduction and redistribution is prohibited.

DataStage Enterprise Edition

Module 06: Introduction to Performance Tuning

Paul ChristensenSolution Architect

NOTE: These slides are Copyright © 2004 by Ascential Software, Inc. and are for the intended recipient only. Unauthorized copying or re-distribution is prohibited.

Last revision: June 22, 2004